Skip to content
This repository has been archived by the owner on Sep 27, 2022. It is now read-only.

Latest commit

 

History

History
69 lines (39 loc) · 2.3 KB

README.md

File metadata and controls

69 lines (39 loc) · 2.3 KB

Leaflyer: Cannabis Data Scrapper

This repository holds all the necessary code to run the automation robot that extracts strain-related information at Leafly.


Update: Leafly now uses an advanced mechanism for detecting web-scrapping (Cloudflare v2 and re-captcha). Thus this project will not be supported anymore as it now involves breaking their detection with external services.

If you are interested in the most recent Leafly data dump (20th September 2022), please contact [email protected].

Package Guidelines

Installation

Install all the pre-needed requirements using:

pip install -r requirements.txt

(Optional) Download the Data

We have already dumped all Leafly's data and made it available in both .json and .csv formats. Note that there might be some missing values as Leafly's database is incomplete for not well-known strains.

The dataset and its additional information are available at Kaggle.


Usage

Scrap List of Strains

Initially, one needs to scrap/dump the list of strains (URL format) to proceed with the meta-data extraction. To accomplish such a step, one needs to use the following script:

python scrap_strains_list.py -h

Note that -h invokes the script helper, which assists users in employing the appropriate parameters.

Scrap Strains Meta-Data

Further, with the strains' list in hand, it is now possible to extract JSON-like information from every URL. To fulfill this purpose, use the following script:

python scrap_strains_data.py -h

Bash Script

Instead of invoking every script to conduct the automation, it is also possible to use the provided shell script as follows:

./pipeline.sh

Such a script will conduct every step needed to accomplish the automation process. Furthermore, one can change any input argument defined in the script.


Support

We know that we do our best, but it is inevitable to acknowledge that we make mistakes. If you ever need to report a bug, report a problem, talk to us, please do so! We will be available at our bests at this repository or [email protected].