A C# implementation of wayback machine downloader. Download an entire archived website from the Internet Archive Wayback Machine. The files downloaded are the original ones not the Wayback Archive rewritten version.
For complete documentation you may want to consult the Wiki page here.
- Requirements
- Installation
- Basic Usage
- Advanced Usage
- Case Sensitive Parameter Names
- Downloading Snapshots for All Timestamps
- From Timestamp
- To Timestamp
- Limiting Between Two Timestamps
- Limiting The Number of Files to Download
- Exact URL
- Download Only Specific Files
- Excluding Specific Files
- Download All HTTP Status Codes
- Download Multiple Files at a Time
- Displaying the File List Without Downloading
- Log Files
- Considerations
- Contributing
- .NET Framework 4.0 or newer.
- For development use Visual Studio 2010 or newer. You can download the latest version of Visual Studio here. The Visual Studio Community Edition is free.
- This tool uses Command Line Parser 2.6.0 library.
- Download the ZIP file or clone this repository:
Then you can open the
mkdir [your-directory] cd [your-directory] git clone https://github.com/erlange/wbm-dl.git cd wbm-dl dir
.sln
and build the solution file with Visual Studio. - From Visual Studio, run this command from the Package Manager Console window:
PM> Install-Package CommandLineParser -Version 2.6.0
At the very basic, you should run wbm-dl
followed by the website name, for example http://yoursite.com
:
wbm-dl http://yoursite.com
or just
wbm-dl yoursite.com
Issuing the above command will download the website to the ./websites/yoursite.com
directory.
You must supply a valid URL address to download.
Some valid URL examples are shown below:
wbm-dl yoursite.com
wbm-dl http://yoursite.com
wbm-dl https://yoursite.com
The additional parameter list will display when run without any parameters:
wbm-dl (Wayback Machine Downloader)
http://erlange.github.com
-o, --out Output/destination directory
-f, --from From timestamp. Limits the archived result SINCE this timestamp.
Use 1 to 14 digit with the format: yyyyMMddhhmmss
If omitted, retrieves results since the earliest timestamp available.
-t, --to To timestamp. Limits the archived result UNTIL this timestamps.
Use 1 to 14 digit with the format: yyyyMMddhhmmss
If omitted, retrieves results until the latest timestamp available.
-l, --limit Limits the first N or the last N results. Negative number limits the last N results.
-a All timestamps. Retrieves snapshots for all timestamps.
-c, --count (Default: 1) Number of concurrent processes.
Can speed up the process but requires more memory.
-A, --All Retrieves snapshots for all HTTP status codes.
If omitted only retrieves the status code of 200
-e, --exact Downloads only the url provided and not the full site.
-O, --Only Restrict downloading to urls that match this filter.
-X, --eXclude Skip downloading of urls that match this filter.
-L, --list Displays only the list in a JSON format with the archived timestamps, does not download anything
--help Display this help screen.
--version Display version information.
The Wayback Machine Downloader uses case sensitive parameter names, such as -a
is different from -A
. Careful consideration should be taken when typing such parameter names.
-o, --out Output/destination directory
Optional. The -o
or --out
option specifies the directory in which you want the websites to be saved. A sub-directory called /websites
will be created under the specified directory.
wbm-dl yoursite.com -o c:/download
Will download to c:/download/websites
directory.
wbm-dl yoursite.com -o ./myFolder/web
Will download to [Current Directory]/myFolder/web/websites
directory.
By default, your files are archived in different snapshots for each timestamp. You can specify the -a
parameter to download all snapshot versions for each file.
The -a
parameter is not to be confused with -A
parameter, although they both can also be used in conjunction.
-a All timestamps. Retrieves snapshots for all timestamps.
Optional. The -a
parameter will download the file versions all timestamps. The timestamp of each snapshot will be used as a directory.
wbm-dl yoursite.com -o c:/download -a
Will download to the directory structure below:
c:/download/websites/yoursite.com/20180820202452/index.html
c:/download/websites/yoursite.com/20181019232937/index.html
c:/download/websites/yoursite.com/20190305194903/assets/logo.png
If this parameter is omitted the Wayback Machine Downloader will only download the latest snapshot version of each unique item.
-f, --from From timestamp.
Optional. You can limit the result by specifying the earliest timestamp in the yyyyMMddhhmmss format. This parameter is inclusive, in which the value is included to the result. The Wayback Machine Downloader will only fetch the snapshots since the timestamp specified.
wbm-dl yoursite.com -o c:/download -f 20171101210000
Will download only the snapshots since November 01, 2017 at 21:00:00
wbm-dl yoursite.com -o c:/download -f 2017
Will download only the snapshots since the year of 2017
wbm-dl yoursite.com -o c:/download -f 201707
Will download only the snapshots since July 2017
-t, --to To timestamp.
Optional. You can limit the result by specifying the latest timestamp in the yyyyMMddhhmmss format. This parameter is inclusive, in which the value is included to the result. The Wayback Machine Downloader will only fetch the snapshots until the timestamp specified.
wbm-dl yoursite.com -o c:/download -t 20180915220000
Will download only the snapshots until September 15, 2018 at 22:00:00
wbm-dl yoursite.com -o c:/download -t 2018
Will download only the snapshots until the year of 2018
wbm-dl yoursite.com -o c:/download -t 201804
Will download only the snapshots until April 2018.
You can combine both -f
and -t
parameters to limit the result between two timestamps. Since both parameters are inclusive, the from and to parameter values are included to the result.
wbm-dl yoursite.com -o c:/download -f 20171101210000 -t 20180915220000
Will download only the snapshots since November 01, 2017 21:00:00 until September 15, 2018 22:00:00.
wbm-dl yoursite.com -o c:/download -f 2017 -t 201804
Will download only the snapshots since 2017 until April 2018.
wbm-dl yoursite.com -o c:/download -f 2017 -t 2017
Will download only the snapshots during 2017.
-l, --limit Limits the first N or the last N results. Negative number limits the last N results.
Optional. You can limit the number of files to download by specifying the -l
parameter followed by a positive or negative integer value.
This -l
parameter is not to be confused with the -L
parameter. They both can be used in conjunction though.
wbm-dl yoursite.com -o c:/download -l 50
Will download only 50 files since the earliest timestamp. The earliest timestamp is included to the result.
wbm-dl yoursite.com -o c:/download -l -25
Will download only 25 files until the latest timestamp. The latest timestamp is included to the result.
-e, --exact Downloads only the url provided and not the full site.
Optional. Instead of downloading the entire websites you can use this -e
flag to download only the file you specify as the URL.
wbm-dl yoursite.com -o c:/download -e
Will download only the homepage html file of yoursite.com
-O, --Only
Optional. You can filter the download to only a specific condition, for example you only want to download files of certain types (e.g., .jpg, .pdf, .doc, etc). This parameter needs a string or a regex.
wbm-dl yoursite.com -o c:/download -O "^.*\.(jpg|gif|png|)$"
This will download only image files of .jpg, .gif and .png types.
wbm-dl yoursite.com -o c:/download -O "^.*\b(themes|green).*\b$"
This will download files containing the word themes
or green
in the path.
-X, --eXclude
Optional. In contrast with the -O
parameter, you can exclude specific files using -X
parameter. This parameter needs a string or a regex.
wbm-dl yoursite.com -o c:/download -X "^.*\.(jpg|gif|png|)$"
This will not download image files of .jpg, .gif and .png types.
wbm-dl yoursite.com -o c:/download -X "^.*\b(themes|green).*\b$"
This will exclude the files containing the word themes
or green
in the path.
-A, --All Retrieves snapshots for all HTTP status codes.
If omitted only retrieves the status code of 200
Optional. By default, the Wayback Machine Downloader will download the files responding only to the HTTP status code of 200 (HTTP status code for OK). This -A
flag will download responses with all HTTP status codes, such as 30x, 40x and 50x.
wbm-dl yoursite.com -o c:/download -A
-c, --count (Default: 1) Number of concurrent processes.
Can speed up the process but requires more memory.
Optional. You can speed up the download process significantly by specifying an (integer) number of concurrency with the -c
parameter.
wbm-dl yoursite.com -o c:/download -c 50
Will download maximum 50 files at a time.
-L, --list Displays only the list in a JSON format with the archived timestamps, does not download anything
Optional. This option will only display the file list in JSON format and save it to the /logs
directory. It won't download anything else.
wbm-dl yoursite.com -o c:/download -L
This will only display the file list on screen and save the list in the c:/download/logs
directory.
Upon completion, a /logs
directory containing a log file will be created under the /websites
directory.
The JSON-formatted log file contains completion status of each downloaded item. If errors occured the log files can further be examined to accommodate manual download with the source URL for each item.
This log file will not be generated when using the -L
or --list
flag.
The generated log filename will be yoursite.com.log.json
The JSON-formatted log file contains metadata as follows:
ErrorMsg
Contains the error message if error occured.Num
Line number.Original
Contains the original location of the item.Source
Contains the archived location of the item in the Internet Wayback Archive Machine. You can use the value for manually downloading.Status
Contains the HTTP status code. If flag-A
is omitted and no error occured the value will be200 (OK)
. If this value is empty an error might have occured. You can then consult theErrorMsg
to examine the error and use theSource
to manually download the individual file.Target
Contains full path in the output directory where the file is saved. If this value is empty an error might have occured. You can then consult theErrorMsg
to examine the error and use theSource
to manually download the individual file.Time
The time theSource
responds to the request. The time is inyyyyMMdd hh:mm:ss
.NET format and might not conform to the standard JSON datetime format.
Your archived website gets none but bigger over time. It can get so big with millions of files. Certain aspects must therefore come into considerations.
It is always advisable to limit the downloads each session with filtering options, including, but not limited to:
- Filtering by certain timestamps with
-f
or-t
options - Filtering by certain files with
-O
option - Do not download what you don't need with
-X
option - Minimize the number of simultaneous download by using small number to the
-c
option
It is a good ettiquete to crawl politely.
Avoid mass-scraping by overloading them with too many requests for too many big files as this will surely hurt the server.
If this occurs too often, they might take measures to block downloader tools such as this one, and in the long run, might lead to anti-scraping legal actions.
That said. So download wisely.
Windows has maximum of 248 characters on a directory path while a URL doesn't. This can lead to error due to this limitation and your files are not downloaded. In this case you can examine the log file and download manually from the source URL provided.
Contributions are welcome. Just pull an issue or pull request from GitHub.
Copyright © 2018 - [email protected]