aozora analyzer

Scrape Aozora bunko page, parse aozora bunko datas, analyze aozora bunko novels.

aozora_parser

This script can be used to parse the html novel data in aozora_bunko repository.

At first, clone following repository.

git clone https://github.com/aozorabunko/aozorabunko

Next, move cards directory to the root of this project.

mv ./aozorabunko/cards ./

convert file encoding Shift-JIS to UTF-8.

find cards -name '*.html' -exec nkf -w --overwrite {} \;

parse novel html files.

python aozora_parser.py

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
utils		utils
.gitignore		.gitignore
README.md		README.md
aozora_doc2vec.py		aozora_doc2vec.py
aozora_parser.py		aozora_parser.py
aozora_scraper.py		aozora_scraper.py
aozora_tf_idf.py		aozora_tf_idf.py
aozora_tfidf_word_clustering.py		aozora_tfidf_word_clustering.py
aozora_vec.py		aozora_vec.py
emb_check.ipynb		emb_check.ipynb
get_tag_class_images.py		get_tag_class_images.py
get_tag_images.py		get_tag_images.py
stop_words.txt		stop_words.txt
vec_analyzer.ipynb		vec_analyzer.ipynb