These code samples illustrate the usage of Lemmatizing and Sentiment Analysis SDKs for the Russian language.
To start using these technologies in your projects, you need to acquire the license. Get in touch at [email protected]
Sentiment Analysis technology can be consumed as an API on rapidapi.com:
https://rapidapi.com/insider-insider-default/api/russiansentimentanalyzer
Looking for Chinese sentiment analyzer? Try out the Fuxi API: https://rapidapi.com/insider-insider-default/api/fuxiapi
Documentation in pdf form in Russian and in English is available.
The process of lemmatization constitutes in deriving lemma and a POS tag for a given surface form (word). Because Russian is highly inflectional, it is very important to derive the word lemma and use it instead of stem, which is more crude way of normalizing Russian.
The application area of the lemmatizer is very wide:
- information retrieval (we have a token filter for Lucene / Solr / Elasticsearch, contact us, if you need one)
- sentiment analysis (read on, if you are interested in this)
- machine translation: to avoid issues with sparse word forms space one can lemmatize them first before translating
- your project / research
The dictionary contains order of 100k lemmas, which translates to several million words, including the grammatical cases as well as polysemic (multi-meaning, homonyms) words.
For each word, lemmatizer returns its POS tag. There can be many POS tags for a given word.
If for a particular word you do not agree with the lemma and POS tag prediction, you can redefine this behaviour in your personal user dictionary. It is done by establishing a link with an existing word, grammatical features of which are the closest to your target. For instance, if to assume the lemmatizer does not recognize the word инет (social media slang word from Internet), you define it via the linked word Интернет (Internet):
инет\tинтернет
(\t is the symbol of tabulation)
Documentation in pdf form in Russian is available.
The system returns one of the following labels for a given text (or sentence): NEUTRAL, POSITIVE, NEGATIVE.
Most of the times, especially when monitoring a brand / person / company in the social / news media, it is important to know the sentiment oriented to it. In the following example:
I like Phone1, but Phone2 is ugly.
we expect to get POSITIVE label for the object Phone1, and NEGATIVE for the object Phone2.
Because an object can be referred to using different words or word sequences (like "Android" or "Droid" etc), the system supports describing the target object with an array of object synonyms. The first object synonym to be found in the given text will trigger sentiment detection algorithm.
The quality can be controlled by overriding / introducing new sentiment words in the user polarity dictionaries.
The system for topical grouping in unstructured content. Large-scale compatible: you can generate topics out of your text silos on as big a dataset as millions of texts. Supports multiple languages:
Access / subscribe to the API here: https://rapidapi.com/insider-insider-default/api/doctop