Albedo

A recommender system for discovering GitHub repos, built with Apache Spark.

Albedo is a fictional character in Dan Simmons's Hyperion Cantos series. Councilor Albedo is the TechnoCore's AI advisor to the Hegemony of Man.

Setup

$ git clone https://github.com/vinta/albedo.git
$ cd albedo
$ make up

Collect Data

You need to create your own GITHUB_PERSONAL_TOKEN on your GitHub settings page.

# get into the main container
$ make attach

# this step might take a few hours to complete
# depends on how many repos you starred and how many users you followed
$ (container) python manage.py migrate
$ (container) python manage.py collect_data -t GITHUB_PERSONAL_TOKEN -u GITHUB_USERNAME
# or
$ (container) wget https://s3-ap-northeast-1.amazonaws.com/files.albedo.one/albedo.sql
$ (container) mysql -h mysql -u root -p123 albedo < albedo.sql

# username: albedo
# password: hyperion
$ make run
$ open http://127.0.0.1:8000/admin/

Start a Spark Cluster

You could also create a Spark cluster on Google Cloud Dataproc.

# start a local Spark cluster in Standalone mode
$ make spark_start

Use Popularity as the Recommendation Baseline

See PopularityRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.PopularityRecommenderTrainer \
    target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.002017744675282716

Build the User Profile for Feature Engineering

See UserProfileBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.UserProfileBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Build the Item Profile for Feature Engineering

See RepoProfileBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.RepoProfileBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Train an ALS Model for Candidate Generation

See ALSRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.ALSRecommenderBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.05209047292612741

Build a Content-based Recommender for Candidate Generation

Elasticsearch's More Like This API will do the tricks.

$ (container) python manage.py sync_data_to_es

See ContentRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,org.apache.httpcomponents:httpclient:4.5.2,org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.ContentRecommenderBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.002559563451967487

Train a Word2Vec Model for Text Vectorization

See Word2VecCorpusBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.Word2VecCorpusBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Train a Logistic Regression Model for Ranking

See LogisticRegressionRanker.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.LogisticRegressionRanker \
    target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.021114356461615493

TODO

Build a recommender system with Spark: Factorization Machine
Build a recommender system with Spark: GDBT for Feature Learning
Build a recommender system with Spark: Item2Vec
Build a recommender system with Spark: PageRank and GraphX
Build a recommender system with Spark: XGBoost

Name	Name	Last commit message	Last commit date
Latest commit vinta Merge pull request #6 from vinta/dependabot/pip/django-1.11.29 Jun 11, 2020 be94cad · Jun 11, 2020 History 431 Commits
.docker-assets	.docker-assets	setup elasticsearch	Oct 9, 2017
.idea	.idea	fuck you run configs	Oct 27, 2017
albedo	albedo	full containerization	May 20, 2017
app	app	create the commend for syncing data to elasticsearch	Oct 9, 2017
src	src	unpersist	Nov 6, 2017
.dockerignore	.dockerignore	add .dockerignore	Apr 25, 2017
.gitignore	.gitignore	ignore spark-data dir	Aug 12, 2017
Dockerfile	Dockerfile	full containerization	May 20, 2017
LICENSE	LICENSE	Initial commit	Feb 26, 2017
Makefile	Makefile	update commands	Nov 6, 2017
README.md	README.md	update README	Nov 4, 2017
albedo.iml	albedo.iml	add mmlspark	Nov 6, 2017
docker-compose.yml	docker-compose.yml	setup elasticsearch	Oct 9, 2017
log4j.properties	log4j.properties	configure log4j	Aug 12, 2017
manage.py	manage.py	bump	Feb 27, 2017
pom.xml	pom.xml	add mmlspark	Nov 6, 2017
requirements.txt	requirements.txt	Bump django from 1.11.28 to 1.11.29	Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Albedo

Setup

Collect Data

Start a Spark Cluster

Use Popularity as the Recommendation Baseline

Build the User Profile for Feature Engineering

Build the Item Profile for Feature Engineering

Train an ALS Model for Candidate Generation

Build a Content-based Recommender for Candidate Generation

Train a Word2Vec Model for Text Vectorization

Train a Logistic Regression Model for Ranking

TODO

Related Posts

About

Releases

Packages

Contributors 2

Languages

License

vinta/albedo

Folders and files

Latest commit

History

Repository files navigation

Albedo

Setup

Collect Data

Start a Spark Cluster

Use Popularity as the Recommendation Baseline

Build the User Profile for Feature Engineering

Build the Item Profile for Feature Engineering

Train an ALS Model for Candidate Generation

Build a Content-based Recommender for Candidate Generation

Train a Word2Vec Model for Text Vectorization

Train a Logistic Regression Model for Ranking

TODO

Related Posts

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages