This package allows you to search and scrape for products on Amazon and extract some useful information (price, ratings, number of comments).
- Python 3
- pip3
pip3 install -r requirements.txt
- Clone this repo or zip download it.
- Open a terminal or cmd at download folder directory
- run:
python3 example.py -w <word you want to search>
- Above step with create a .json file(in same directory as example.py) with the products that were found.
- For more help just run:
python3 example.py --help
Attribute name | Description |
---|---|
url | Product URL |
title | Product title |
price | Product price |
rating | Rating of the products |
review_count | Number of customer reviews |
img_url | Image URL |
bestseller | Tells whether a product is best seller or not |
prime | Tells if product is supported by Amazon prime or not |
asin | Product ASIN (Amazon Standard Identification Number) |
Output is provided in the from of a json file, please refer to the products.json as an example file which was produced with search word 'toaster'
-
scraper.py, In method get_page_content, retries were added to make a valid connection with amazon servers even if it connection request was denied.
-
function -> get_request, returns None when requests.exceptions.ConnectionError occurs and ripples its way down to calling functions to terminate the thread normally instead of abruptly calling sys.exit() which surely will kill the thread but if the thread being killed holds GIL component, in that case it will lead to Deadlock.
-
function -> get_page_content, if no valid page was found even after retries it returns None in addition to returning None for Nonetype response from get_request.
-
Decision number 2 and 3 were made keeping in mind that in a multithreaded program, multiple threads are working simultaneously, while doing that there may be a case where 1 or 2 out of 10 or 20 threads does not get valid response (Please check check_page_validity and get_request function for documentation and more), then we terminate only those threads safely while others work to produce the valid output.
On my network connection (results may vary depending on your connection speed)
Number of pages | Number of products | Time |
---|---|---|
1 | 22 | 2.657 |
3 | 126 | 4.007 sec |
7 | 390 | 8.094 sec |
20 | 426 | 12.534 sec |
- Write Unit tests
- Implement functionality of sending requests from various differnt proxies
- Items like Books and DVDs may have multiple prices, Extact all the prices and categorize them into a price dictionary
- Add a better way to convert list of objects into json
- To handle special characters in the content scraped from Amazon