AOLIA-tools

This repository provides tools for working with the AOLIA corpus: a version of the documents from the AOL query log that can be scraped from The Internet Archive, representing documents close to how they appeared at the time the log was created.

Getting Started

Clone this repository and install dependencies:

git clone https://github.com/terrierteam/aolia-tools
cd aolia-tools
pip install -r requirements.txt

Downloading the Corpus

The downloader.py script downloads the documents for AOLIA from The Internet Archive. Downloads are done in parallel and the process takes about 2 days. The software automatically backs off when it detects rate limiting.

There are two ways you can run the download script. If you are using the ir-datasets, package, you can simply run:

python downloader.py

This will automatically configure the script to work with ir-datasets.

If you do not want to use ir-datsets, you can specify the location of the aol.id2wb.tsv.gz file (downloadable here: https://macavaney.us/aol.id2wb.tsv.gz, MD5: afbf9b03e1a0fabc9f3fdd5105e6ae5a) using the --source argument and the output directory for the downloaded files using the --path argument.

wget https://macavaney.us/aol.id2wb.tsv.gz
python downloader.py --source aol.id2wb.tsv.gz --path output_docs

The output directory will contain 16 files, split by the first character of the document IDs. Each contains json-lines data and is encoded using lz4 compression.

For both settings, you can specify --parallel to change how many worker processes are used (default: 10), --backoff_threshold to change how many consecutive errors that will trigger a backoff (default: 10), and --backoff_duration to change how long a backoff waits until it starts going again, in seconds (default: 10). We found these settings to work well on our network.

Building CARS Datasets

You can use the the generate_cars.py script to generate input files that are usable by wasiahmad/context_attentive_ir, allowing you to run baselines like CARS, M-NSRF, and M-MatchTensor.

The script has two required argument: --out_dir, which specifies the directory to which to save the dataset files, and --run, which specifies the gzip'd TREC-formatted run file for all AOL queries. This file can be downloaded here (1.3GB, MD5: d464f3703384ddfca5c08ae4892c4400).

Right now, this script only works if you are using ir-datasets.

python generate_cars.py --out_dir path/to/context_attentive_ir/data/aolia --run path/to/aolia-title-bm25.run.partial.gz

Replacing CAR-formmated Dataset Titles

To reproduce the Corpus=AOL17, Docs=AOLIA setting, we provide the replace_cars_titles.py script. This takes as input a CARS-formatted dataset file and outputs a new file that replaces the document texts with those from AOLIA.

Right now, this script only works if you are using ir-datasets.

python replace_cars_titles.py path/to/output/split.json path/to/output/split.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AOLIA-tools

Getting Started

Downloading the Corpus

Building CARS Datasets

Replacing CAR-formmated Dataset Titles

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
downloader.py		downloader.py
generate_cars.py		generate_cars.py
replace_cars_titles.py		replace_cars_titles.py
requirements.txt		requirements.txt

terrierteam/aolia-tools

Folders and files

Latest commit

History

Repository files navigation

AOLIA-tools

Getting Started

Downloading the Corpus

Building CARS Datasets

Replacing CAR-formmated Dataset Titles

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages