Latent Topic Analysis

I have been thinking about adding text clustering, in particular, latent topic analysis, into SQLFlow.

Motivations

Even before the spread of coronavirus in China, there have been well-read posts on CSDN and WeChat Moments talking about code review skills, which are useful for programmers working from home.

These existing posts mostly translate Google Style Guide and practice guides into Chinese. I am more interested in the analysis of what/how developers of famous open source projects (and in-company projects) talk to each other during code review.

I verified that I can crawl code review comments by calling GitHub API. Similarly, we can retrieve comments from in-company code review systems. It seems that latent topic modeling is an interesting approach for the analysis.

Latent Topic Modeling

Given some documents, each as a bag of words, we can train a latent topic model, say the latent Dirichlet allocation (LDA), to cluster words. If two words frequently co-occur in many documents, they tend to appear in the same cluster. For example, a cluster might contain "Bill", "Gates", "Microsoft", and "Windows", which make the cluster a semantic topic. Hence the name latent topic analysis.

There are two stages with latent topic modeling:

Training. The input data is a set of documents, each as a bag/set of words. The output is an integer matrix of VxK, where V is the vocabulary and K is user-specified number of topics/clusters. Each value at the location of (v,k) indicate the number of times that the authors wrote word v to express the idea/topic k.
Inference/Prediction. The input data is a set of documents. For each document, the output is an integer vector with length K, where each of the k-th element is the number of words that represent idea/topic k in the document.

User Experience

Crawl GitHub comments into a table cr.comments.

(document) ID author date comment project

af7x002 wangkuiyi 2019-02-10 Please add the copyright header to newly added files SQLFlow
Users can check the comments by authors, by dates, and/or by projects.

Users can find major contributors to certain projects.

SELECT author, COUNT(author) > 1000 FROM cr.comments 
WHERE project = "tensorflow";

Users can check out comments by such authors, who are supposed to be good players in the open-source field.

SELECT comment FROM cr.comments 
WHERE author in SELECT COUNT(author) > 1000 FROM cr.comments;

Simiarly, users can check out comments by rarely contributed authors, who are supposed to be less good players.
```
SELECT comment FROM cr.comments 
WHERE author in SELECT COUNT(author) < 3 FROM cr.comments;
```

Then, we can train a model using comments from "good" authors and another model from "bad" authors.

SELECT comment FROM cr.comments WHERE author="Jeff Dean" AND project="TensorFlow" 
TO TRAIN LatentDirichletAllocation 
WITH topics=100
INTO jeff_dean_comments_model;

We can then do predict/inference with some documents, which could be the set used to train the model, or another set of different (number of) documents. For example, we can infer the training documents to see what topics each document expresses.
```
SELECT ID, comment FROM cr.comments 
TO PREDICT cr.jeff_comment_topics
USING jeff_dean_comment_model;
```
The above statement generates a table cr.jeff_comment_topics with the following data schema:

(document) ID topic weight

97eff063 89 4

97eff063 17 1

Given the prediction results, we can do a lot of things. For example, if we when to check out the top-N (say, N=4) documents of topic 89, we can write the following query.

SELECT ID FROM cr.jeff_comment_topics WHERE topic=89 ORDER BY weight LIMIT 4;

If we want not only the document ID, but its content, we can write a nested join.

SELECT author, date, comment FROM cr.jeff_comment_topics JOIN 
(SELECT ID FROM cr.jeff_comment_topics WHERE topic=89 ORDER BY weight LIMIT 4) AS B
ON B.ID = cr.jeff_commnet_topics.ID;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Latent Topic Analysis

Motivations

Latent Topic Modeling

User Experience

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

(document) ID	topic	weight
97eff063	89	4
97eff063	17	1