Skip to content

Latest commit

 

History

History
34 lines (20 loc) · 1.82 KB

File metadata and controls

34 lines (20 loc) · 1.82 KB

Archive.org collection transfer scripts

The provided Shell/Python scripts use the Internet Archive Python-/CL-interface to transfer selected filetypes from Internet Archive collections to a Hadoop cluster (HDFS).

The transfer happens in two steps:

  1. A list of all files to transfer will be created (create_filelist.py)
  2. While the files are being transfered, the script keeps track of the already transfered files to allow restarting the process anytime and continue where it stopped (download_files.py)

The downloaded files are being separated into different folders corresponding to their filetypes.

During transfer, a specified number of files will be downloaded into a local staging directory and copied to HDFS in a bunch (default 10, max_staging in download_files.py).

Usage

First of all, please install https://github.com/jjjake/internetarchive to have the required ia command available.

Next, please modify download.sh according to your needs to include your paths and required filetypes. download.sh calls the python scripts and should be used to start off the transfer process.

To be called with ./download.sh <COLLECTION_NAME>.
E.g., ./download.sh ArchiveIt-Collection-1234

Called by download.sh: ./create_filelist.py <COLLECTION_NAME> <GLOB_PATTERNS>.
E.g., ./create_filelist.py ArchiveIt-Collection-1234 *.cdx *.cdx.gz *.warc.gz

Called by download.sh: ./download_files.py <COLLECTION_NAME> <LOCAL_STAGING_PATH> <HDFS_PATH>
E.g., ./download_files.py ArchiveIt-Collection-1234 /mnt/ephemeral0/holzmann/tmp /user/holzmann/ia1