Multithreaded-amazon-scraper

Description

This package allows you to search and scrape for products on Amazon and extract some useful information (price, ratings, number of comments).

Requirements

Python 3
pip3

Dependencies

pip3 install -r requirements.txt

Usage

Clone this repo or zip download it.
Open a terminal or cmd at download folder directory
run:

python3 example.py -w <word you want to search>

Above step with create a .json file(in same directory as example.py) with the products that were found.
For more help just run:

python3 example.py --help

Information fetched

Attribute name	Description
url	Product URL
title	Product title
price	Product price
rating	Rating of the products
review_count	Number of customer reviews
img_url	Image URL
bestseller	Tells whether a product is best seller or not
prime	Tells if product is supported by Amazon prime or not
asin	Product ASIN (Amazon Standard Identification Number)

Output Format

Output is provided in the from of a json file, please refer to the products.json as an example file which was produced with search word 'toaster'

Design Decisions

scraper.py, In method get_page_content, retries were added to make a valid connection with amazon servers even if it connection request was denied.
function -> get_request, returns None when requests.exceptions.ConnectionError occurs and ripples its way down to calling functions to terminate the thread normally instead of abruptly calling sys.exit() which surely will kill the thread but if the thread being killed holds GIL component, in that case it will lead to Deadlock.
function -> get_page_content, if no valid page was found even after retries it returns None in addition to returning None for Nonetype response from get_request.
Decision number 2 and 3 were made keeping in mind that in a multithreaded program, multiple threads are working simultaneously, while doing that there may be a case where 1 or 2 out of 10 or 20 threads does not get valid response (Please check check_page_validity and get_request function for documentation and more), then we terminate only those threads safely while others work to produce the valid output.

Performance Benchmark

On my network connection (results may vary depending on your connection speed)

Number of pages	Number of products	Time
1	22	2.657
3	126	4.007 sec
7	390	8.094 sec
20	426	12.534 sec

Future Imporvements

Write Unit tests
Implement functionality of sending requests from various differnt proxies
Items like Books and DVDs may have multiple prices, Extact all the prices and categorize them into a price dictionary
Add a better way to convert list of objects into json
To handle special characters in the content scraped from Amazon

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
amazon_scraper		amazon_scraper
.deepsource.toml		.deepsource.toml
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
products.json		products.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multithreaded-amazon-scraper

Description

Requirements

Dependencies

Usage

Information fetched

Output Format

Design Decisions

Performance Benchmark

Future Imporvements

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

ankushduacodes/Multithreaded-amazon-scraper

Folders and files

Latest commit

History

Repository files navigation

Multithreaded-amazon-scraper

Description

Requirements

Dependencies

Usage

Information fetched

Output Format

Design Decisions

Performance Benchmark

Future Imporvements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages