- Implement a program that compares two files for similarities.
- Implement a website that highlights similarities across files, a la the below.
helpers.py
:lines
: Implementlines
in such a way that, given two strings,a
andb
, it returns alist
of the lines that are, identically, in botha
andb
. Thelist
should not contain any duplicates. Assume that lines ina
andb
will be be separated by\n
, but the strings in the returnedlist
should not end in\n
. If botha
andb
contain one or more blank lines (i.e., a\n
immediately preceded by no other characters), the returnedlist
should include an empty string (i.e.,""
).sentences
: Implementsentences
in such a way that, given two strings,a
andb
, it returns alist
of the unique English sentences that are, identically, present in botha
andb
. Thelist
should not contain any duplicates. Usesent_tokenize
from the Natural Language Toolkit to "tokenize" (i.e., separate) each string into alist
of sentences. It can be imported with:from nltk.tokenize import sent_tokenize
. Per its documentation,sent_tokenize
, given astr
as input, returns alist
of English sentences therein. It assumes that its input is indeed English text (and not, e.g., code, which might coincidentally have periods too).substrings
: Implementsubstrings
in such a way that, given two strings,a
andb
, and an integer,n
, it returns alist
of all substrings of lengthn
that are, identically, present in botha
andb
. Thelist
should not contain any duplicates. Recall that a substring of lengthn
of some string is just a sequence ofn
characters from that string. For instance, ifn
is2
and the string isYale
, there are three possible substrings of length2
:Ya
,al
, andle
. Meanwhile, ifn
is1
and the string isHarvard
, there are seven possible substrings of length1
:H
,a
,r
,v
,a
,r
, andd
. But once we eliminate duplicates, there are only five unique substrings:H
,a
,r
,v
,a
,r
, andd
.
templates/index.html
: Implementtemplates/index.html
in such a way that it contains an HTML form via which a user can submit:- a file called
file1
- a file called
file2
- a value of
lines
,sentences
, orsubstrings
for aninput
calledalgorithm
- a number called
length
- a file called
Full instructions available here