Latent Scope GitHub Python Module Discord Examples Source

Survey Analysis Example

A common source of valuable unstructured text data is the free response questions often asked in surveys. This data is of course difficult to work with, as typical techniques for finding insights in surveys focus on structured data like the answers to multiple choice questions.

The data used to setup your first scope is in fact the 765 responses to the question: "What do you think people you work with just don't get about the data visualization work that you do?"

Let's take a look at how Latent Scope can help us pull out some insight from those responses by examining the clusters it's identified. Note that this page is using the exported scope from the your first scope guide, and much of this analysis could be done within the tool.

Before we get to the fun visualizations, let's take a look at the input data. The data is an extract of the full survey, where the answer to the question is stored in DataVizNotUnderstood, and we also have multiple-choice answer of the respondant's Role:

As you can see, it's not very straightforward to pick out patterns just by scrolling through the text, so let's take a look at what Latent Scope gave us:

The map is created by going through the 4 step process in Latent scope:

  1. Embed - run each piece of text through an embedding model
  2. Project - run the high-dimensional embeddings through UMAP
  3. Cluster - run the 2-dimensional UMAP coordinates through HDBSCAN
  4. Label - ask an LLM to create a label by summarizing a list of text taken from each cluster

What the process gives us is a workable categorization of the text data based on the patterns captured by the embedding model and the labeling of the LLM. Of course these automated steps are never perfect, so Latent Scope is designed to both let you tweak the parameters of each step as well as manually re-categorize data once the process is finished. For more details see the explore and curate guide.

Let's take a closer look at some of the clusters we got, starting with my favorite (because it's so true!)

As I mentioned, the process isn't perfect, and here we see a cluster that could have easily been combined with Cluster 27:

In fact, I did use the explore tool to combine a couple of clusters into this one:

And we can see that how much time it takes to make data visualizations is a common response, there is this other large cluster:

As you can see, the embeddings (and UMAP, and clustering) may separate text based on different concepts. The last two clusters are conceptually very similar with the main difference being that most people used the word "viz" in one cluster and not in the other. I find it quite amazing that this level of separation is possible, but it may sometimes not be what you want to separate.

The theme of time and effort is expressed further in other clusters:

Whew! Data visualization certainly takes a lot of time and effort! There are many more clusters to explore (80 in fact), instead of listing them all out let's end with a little interactive choice:

What are you waiting for? Try Latent Scope out on your own data!