Skip to content

Latest commit

 

History

History
52 lines (44 loc) · 5.4 KB

README.md

File metadata and controls

52 lines (44 loc) · 5.4 KB

Appstore scraping

Working on a project for better awareness of air quality, I'm interested in knowing what the main apps related to this field are like and what are the feedback from their users. In order to have access to numerous reviews from the app store, web scraping is the way to go! After some research, it seems that Scrapy (https://doc.scrapy.org/en/latest/index.html) should help me scrape the data needed for my analysis. Scrapy allows for gently scraping the web by creating spiders and sending them out to crawl the app stores.


Take-aways from this project

Learned how to use Scrapy:

  • How to navigate html responses using css attributes - cf. reviews_to_dict.py
    • How to deal with multiple classes in css tags to access the right level of data
  • How to "translate" unicode text with '\r\n\t' and leading and trailing whitespaces - cf. reviews_to_dict.py
    • Used the .translate() function with the following dictionary: trans_table = {ord(c): None for c in u'\r\n\t'}
    • Used the .strip() function to get rid of leading and trailing whitespaces
  • Understand dynamic requests: when the URL has a '#', the server does not care about what's after this symbol. The client (web browser) deals with it. Then, I went to the Developer's tools to find out where I could identify the "real" url I would need for scraping. Learned about cURL!
  • How to build a spider where the requests have headers and deal with responses in JSON format - cf. with_headers.py
  • How to configure the settings of my spiders to ensure gentle scraping - cf. settings.py

Built a spider to scrape specific app features on the appstore related to air quality - using SelectorList in CSS

  • Identify relevant apps
    • (so far, couldn't find an app search menu outside the appstore app...)
    • Not sure, it could be automated, after reading robots.txt from Google store and App store, since I've read "Disallow: /store/search*" and "Disallow: /work/search". Need to check on that!
    • Read about factors that most affect app ranking in the appstore (https://www.mobiloud.com/blog/factors-really-impact-app-store-ranking/): app's title, targeted keywords, number of downloads, and user ratings. I'm surprised that usage is not part of these factors.
  • Build a spider to collect app features and description across a range of apps - cf. app_features.py
    • Using a list of landing pages
    • Dealing with Exceptions as some features are not present for all apps.

Built a spider to scrape all the app reviews on the appstore for a given country - using dynamic requests with headers and token (JSON response)

  • Built a spider to collect reviews for a specific app (app id and app name are required) - cf. app_reviews.py
    • Understood how reviews are loaded, and how to reproduce this loading using a spider. Reviews are loaded 10 by 10 when scrolling down in the web browser. Found a third url to use. This url is specific to reviews, and it's the best url to use for review scraping!
    • Looped on the review pages to define the scraping requests.
    • Checked what's the best way to collect data from fields that do not exist for all items (e.g. response from the developer): Used handling exceptions (KeyError).
    • Review app_review spider to build a loop that stops automatically when there is no more page to scrape (request launched within the parse function and callback of the parse function).

Built a simple spider to scrape all the reviews for a specific product on Amazon French website - using SelectorList in CSS

  • Build a "side" spider to scrape reviews for a given product on Amazon French website - cf. amazon_reviews.py
    • Recursively follow the link to the next page.

Future Work