Skip to content

oskar-j/awesome-text-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 

Repository files navigation

Awesome software for Text ML Awesome

A curated list of awesome ML frameworks and text embeddings. Focused on SOTA libraries which are actively maintained on GitHub.

Frameworks and libraries

🐍 Python

Text processing

  • HanLP - Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification via one unified interface. https://bbs.hankcs.com/

  • flair - A powerful NLP library for state-of-the-art natural language processing (NLP) models, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification.

  • sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.

  • stanza - Official Stanford NLP Python Library for Many Human Languages. https://stanfordnlp.github.io/stanza/

Pipelines / block-programming

Distributed computing

Machine Learning

  • sklearn - Scikit-learn is a Python module for machine learning built on top of SciPy, including tools for text vectorization and vector space compression. https://scikit-learn.org/stable/

  • gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. https://radimrehurek.com/gensim/

  • nlpaug - Augmenting nlp for your machine learning projects.

  • AugLy - A data augmentations library from Facebook research for audio, image, text, and video.

Deep Learning

Natural Language Understanding

Text mining

  • dedupe - A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Visualizations

  • Scattertext - Beautiful visualizations of how language differs among document types.

Big language models

  • BIG-bench - Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models.

C++

Text processing

Currently empty 🪹

Knowledge 📚

Learning 101

  • Virgilio - Virgilio is an open-source initiative, aiming to mentor and guide anyone in the world of the Data Science.

Multiple languages

Python (and Python Notebooks)

  • practicalAI - A practical approach to machine learning to enable everyone to learn, explore and build. https://practicalai.me

  • nlp-recipes - Comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.

No longer maintained

Releases

No releases published

Packages

No packages published