Phonic

Response Coding Pipeline

Timeline

4 weeks

Role

Programmer

Team

Phonic Engineering Team

Tools & Skills

Python, Clustering Algorithms, Dimensionality Reduction (PCA), NumPy, Pandas, NLTK,
Sci-kit learn, Seaborn

Purpose

This project's goal was to help Phonic users visualize categorical clusters from their collected surveys. I designed a machine learning pipeline to extract and visualize high level relationships from customer survey data via dimensionality reduction of paragraph embeddings

Overview

Our aim is to take on response coding by translating complex structures, such as audio surveys, into a simple, easy to analyze medium. We create a pipeline with an input of a text file of documents (audio transcriptions, movie reviews, etc.) and output a plot displaying the clusters of each text file. Our approach consists of:

1. Use a machine-learning algorithm to create a numeric representation of a document
2. Reducing the dimensionality of the vectors
3. Clustering the data

Doc2Vec

We can represent each document in the text file as a point in a high dimensional vector space by using Doc2Vec, a popular unsupervised algorithm for creating text embeddings

Doc2Vec introduces a new vector, called the Paragraph ID, that adds one more comparison to compensate for the less-predictable nature of documents. The Paragraph ID stores the topic or concept of the document and aids in predicting the next set of words. Each document has a unique Paragraph ID and its own set of word vectors.

In this project we are using a pre-trained Doc2Vec model trained on Wikipedia’s English data.

To create and test the vectors, we are using a corpus of 100 IMDB movie reviews. Each line of the file holds a document.

Dimensionality Reduction

The vectors inferred from our Doc2Vec model hold a dimension of 300. Many clustering algorithms perform poorly on datasets of high dimensionality. Thankfully, dimensionality reducing algorithms exist and Principal Component Analysis (PCA) is a popular solution. PCA performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized.

We import PCA from scikit-learn and fit the document vectors collected prior into a list of points:

Clustering

Means Shift

Mean shift attempts to find dense areas of points. It is a centroid grounded algorithm that uses a sliding window. The sliding window continuously assigns candidates for being a centroid by calculating the mean of the points in the window. The candidates are then post-processed into centroid points which usually end up being the points of maximum density.

Mean shift has one drawback where the size of the sliding window may affect the clusters. However, we did not notice any large discrepancies between datasets as the visible clusters were the ones that means shift selected as well. With the given advantage of not needing a set number of clusters, we chose means shift to be the designated clustering algorithm for the pipeline. Above is a plot of the centroids (triangles) and color-coded clusters after running the algorithm on the IMDB movie review corpus.

Other clustering algorithms explored: K-means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Conclusion

After analyzing the given problem, we looked at what Doc2Vec is, and how it works. Then, we inferred vectors from a test document and reduced the dimensionality of the vectors through PCA to prepare them for clustering. We looked at the strengths and weaknesses of various clustering algorithms and implemented mean shift clustering into the pipeline.