-
Notifications
You must be signed in to change notification settings - Fork 706
Latent Topic Analysis
I have been thinking about adding text clustering, in particular, latent topic analysis, into SQLFlow.
Even before the spread of coronavirus in China, there have been well-read posts on CSDN and WeChat Moments talking about code review skills, which are useful for programmers working from home.
These existing posts mostly translate Google Style Guide and practice guides into Chinese. I am more interested in the analysis of what/how developers of famous open source projects (and in-company projects) talk to each other during code review.
I verified that I can crawl code review comments by calling GitHub API. Similarly, we can retrieve comments from in-company code review systems. It seems that latent topic modeling is an interesting approach for the analysis.
Given some documents, each as a bag of words, we can train a latent topic model, say the latent Dirichlet allocation (LDA), to cluster words. If two words frequently co-occur in many documents, they tend to appear in the same cluster. For example, a cluster might contain "Bill", "Gates", "Microsoft", and "Windows", which make the cluster a semantic topic. Hence the name latent topic analysis.
There are two stages with latent topic modeling:
-
Training. The input data is a set of documents, each as a bag/set of words. The output is an integer matrix of VxK, where V is the vocabulary and K is user-specified number of topics/clusters. Each value at the location of (v,k) indicate the number of times that the authors wrote word v to express the idea/topic k.
-
Inference/Prediction. The input data is a set of documents. For each document, the output is an integer vector with length K, where each of the k-th element is the number of words that represent idea/topic k in the document.
-
Crawl GitHub comments into a table
cr.comments
.(document) ID author date comment project af7x002 wangkuiyi 2019-02-10 Please add the copyright header to newly added files SQLFlow -
Users can check the comments by authors, by dates, and/or by projects.
-
Users can find major contributors to certain projects.
SELECT author, COUNT(author) > 1000 FROM cr.comments WHERE project = "tensorflow";
-
Users can check out comments by such authors, who are supposed to be good players in the open-source field.
SELECT comment FROM cr.comments WHERE author in SELECT COUNT(author) > 1000 FROM cr.comments;
-
Simiarly, users can check out comments by rarely contributed authors, who are supposed to be less good players.
SELECT comment FROM cr.comments WHERE author in SELECT COUNT(author) < 3 FROM cr.comments;
-
Then, we can train a model using comments from "good" authors and another model from "bad" authors.
SELECT comment FROM cr.comments WHERE author="Jeff Dean" AND project="TensorFlow" TO TRAIN LatentDirichletAllocation WITH topics=100 INTO jeff_dean_comments_model;
-
We can then do predict/inference with some documents, which could be the set used to train the model, or another set of different (number of) documents. For example, we can infer the training documents to see what topics each document expresses.
SELECT ID, comment FROM cr.comments TO PREDICT cr.jeff_comment_topics USING jeff_dean_comment_model;
The above statement generates a table
cr.jeff_comment_topics
with the following data schema:(document) ID topic weight 97eff063 89 4 97eff063 17 1 -
Given the prediction results, we can do a lot of things. For example, if we when to check out the top-N (say, N=4) documents of topic 89, we can write the following query.
SELECT ID FROM cr.jeff_comment_topics WHERE topic=89 ORDER BY weight LIMIT 4;
If we want not only the document ID, but its content, we can write a nested join.
SELECT author, date, comment FROM cr.jeff_comment_topics JOIN (SELECT ID FROM cr.jeff_comment_topics WHERE topic=89 ORDER BY weight LIMIT 4) AS B ON B.ID = cr.jeff_commnet_topics.ID;