We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug A bit tricky bug to describe. There are two underlying issues:
To Reproduce Say we have the following documents:
When computing TF-IDF for "sleep", the IDF is 0. There are three ways of computing IDF.
math.log10(number_of_docs / number_of_docs_with_word)
math.log10(1 + number_of_docs / number_of_docs_with_word)
math.log10(number_of_docs / (number_of_docs_with_word + 1))
How scikit does it: idf(t) = log [ n / df(t) ] + 1 or idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1 if smooth = True.
df(t) = log [ n / df(t) ] + 1
idf(t) = log [ (1 + n) / (1 + df(t)) ]
To reproduce: Create Corpus (with above docs). Bow (TF-IDF). Data Table.
Expected behavior
Orange version: 3.37.0
Text add-on version: 1.7.0
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Describe the bug
A bit tricky bug to describe. There are two underlying issues:
To Reproduce
Say we have the following documents:
When computing TF-IDF for "sleep", the IDF is 0. There are three ways of computing IDF.
math.log10(number_of_docs / number_of_docs_with_word)
(how we do it)math.log10(1 + number_of_docs / number_of_docs_with_word)
(how we do it with Smooth IDF)math.log10(number_of_docs / (number_of_docs_with_word + 1))
(how it is recommended)How scikit does it: i
df(t) = log [ n / df(t) ] + 1
oridf(t) = log [ (1 + n) / (1 + df(t)) ]
+ 1 if smooth = True.To reproduce: Create Corpus (with above docs). Bow (TF-IDF). Data Table.
Expected behavior
Orange version:
3.37.0
Text add-on version:
1.7.0
The text was updated successfully, but these errors were encountered: