crawl-eval

This is a bag of tools I created in organizing an evaluation of crawling. The evaluation needed to extract text from HTML, identify near-duplicates with a fingerprinting scheme, and count how many evaluation participants found each page. This data was used in a capture-recapture analysis using R's Rcapture package, that I won't go into here.

There are two parts to this: a pile of short Python scripts for data cleaning the JSON-formatted data, and a Java program to do document fingerprinting using MinHashing.

The code assumes that the crawl data is in JSON following a schema we used in the evaluation and which I will eventually copy here for completeness, but for the time being, here's the two penny tour. A file is a series of lines, each of which contains one JSON record corresponding to a crawl artifact. The essential fields are:

content_type: This should be the Content-Type header from the response, and is used to decide how to treat binary vs. text content. Textual content is deduped using MinHash. Binary content is deduped by canonicalizing the URL (for historical reasons).
raw_content: This is the raw content from the server.
url: This is the URL for the raw_content.
team: This is a string identifier that tells you this file came from this evaluation participant.

The files can be compressed with gzip.

To generate fingerprints for a bunch of such files, do

mvn exec:java -Dexec.mainClass="gov.nist.crawleval.MinHash" -Dexec.args="$(echo *json.gz)" > hashes

The resulting hashes file has lines with three space-separated fields: fingerprint hash, team, URL.

The Python scripts are a hodgepodge of one-offs for cleaning and work scripts that are needed to do the evaluation. I'm including them for completeness and I don't have a better place to put them. The most useful (for you) (I'm guessing) are probably:

jcat.py: Prettyprint a JSON crawl file.
hash2rcap.py: tabulate the hashes into the capture-recapture format that R wants.
content-types.py: tabulate the content types in the input crawl file.

There is a script capture.R which is not actually executable because R hates me, but contains the code for computing estimated site size estimates using the data from hash2rcap.py.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.idea		.idea
scripts		scripts
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
crawleval.iml		crawleval.iml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawl-eval

About

Releases

Packages

Contributors 2

Languages

License

isoboroff/crawl-eval

Folders and files

Latest commit

History

Repository files navigation

crawl-eval

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages