Write a routine that gets a set of email body texts and assigns a spam probability to each of them depending on the similarity to the other emails in the set. The more similar it is to the other emails, the more likely it is a spam email.
Compare similarity in texts with W-shingling algorithm.
Information about algorithm: http://en.wikipedia.org/wiki/W-shingling
Help node index.js --help
Cli node index.js -p ./testFiles/
There are 4 options for cli command
-
Path to dir with textes
-
Similarity index [0..1], 0.5 by default. What we shoud mark as the same? 1 == full copy
-
Min number of duplicates to mark as a not unique, 3 by default
-
Length of one shingle in words, 4 by default, min 3
-
npm run test
for run tests -
There are 4 scenario for quick testing
npm run one
,npm run two
,npm run three
,npm run four