csv_files
contains the processed and un-processed csv files.notebooks
contains the all the.ipynb
files. The notebook used to preprocess the data can be found here.scraper
contains thescraper.py
file which was used to scrape the data from amazon.
The goal of this project is to gather information of Data Science realted books from amazon. There are total of 1351 entries in the csv_files/amazon_data_science_books.csv
file.
Later we utlizied the scraped data to understand the following demographics and correlations using Tableau Dashboard:
- A doughnut chart showing the number of books published by the top 15 publishers and the others.
- A barchart of top 15 publisher by the amount of books published
- Average price of books by the top 15 publishers
- Price range of books
- Pages vs Price trend
- Top books by user reviews (rating 4.0 - 5.0)
- Average reviews of Top 15 publishers
Findings and Observations from the Dashboard
Note: Try viewing the Dashboard in Full Screen mode.
- Among the 1324 books (after preprocessing the data) 948 of them are published by only 15 publications.
- Packt has the highest publication of books
- Springer has the highest average price
- As the pages increase, the price of the books increases.
- Price of the most books fall around the range between (14.00 - 60-00) USD
You can visit the public dashboard here
First look on the dashboard
Also, try clicking the bars on the bar plots, and see the changes.
- Clone the repo
git clone https://github.com/Tasfiq-K/amazon-data-science-books-analysis.git
- Initiaize and activate virtual environment
If you are running Python 3.4+, you can use the venv module baked into Python:
python -m venv <directory name>
for example, if you name your directory 'venv', then run this command:
python -m venv venv
For activating the virtual environmet run:
On Windows
# In cmd.exe
venv\Scripts\activate.bat
# In Powershell
venv\Scripts\activate.psl
On Linux or MacOs
$ source venv/bin/activate
- Install dependencies
pip install -r requirements.txt
-
Download Webdriver
Download the web driver at your convenience, I've used the geckodriver to use it with the Firefox browser. You can download it from here -
Run the scraper
python scraper.py --geckodriver_path <path_to_chromedriver>
- You will get a file with the following name
amazon_data_science_books.csv
containing all the required fields and data. Alternatively, check the scraped data here
Tableau Public View: https://public.tableau.com/app/profile/tasfiq.kamran/viz/AmazonDataScienceBooksDashboard/AmazonDataScienceBooks