On first launch the chromeProfile file will be generated so it will take some time. When you launch the application the following popup will appear :
β οΈ if files are missing or it is the first launch a particular window can open.
- By clicking on the big red button you will save the information from the page that has the same webpage name (the last one opened).
- For the name of the file, please enter a known extension (
.csv
,.xlsx
orxls
) or it will be saved by default as a.csv
. - If you want to change the saving path click on the grey "In
file-emote
" button. - You can see the existing templates and add new ones with the following two buttons.
- The console will print warning and information messages.
You need to have Chrome installed on your machine. You need to clone the repo at this link or download the zip file.
You may need to change manually some stuff in some of the following files:
what is in .config |
||
---|---|---|
[SAVING] SAVE_DATA_PATH |
C:/Folder/To/Save/the_result.csv | folder where the current website will be saved |
β οΈ If.config
doesn't exist, it will be created on next launch and you will be asked to choose it.
what is in .env |
||
---|---|---|
DIR_CHROMEAPP_PATH |
C:/Program Files/Google/Chrome/Application/ | default folder used to launch Chrome on debugging mode |
PORT |
9222 | default port used to launch the new chrome window |
β οΈ ThePORT
must not be used by another app. Launchcmd
with admin rights and execute :netstat -a
to see what port are used.
π To get the
DIR_CHROMEAPP_PATH
:1. Window + S: to start a research 2. Now search for `Chrome` 3. Righ-click on the logo that just pops 4. Click on: Open File Location 5. In the new window, right-click on `Google Chrome` shortcut file 6. Click on: Open File Location 7. Copy, paste the path of the newly opened folder into the `.env` file
The path should look like that :
C:/Files/Chrome/Application/
.
Templates are used to know what information to scrape on what website.
You can find in example.json
an example of a set of pages and rules.
A template has 2 important parts:
- The Template Specific Name
- The Template List of Pages
- a. The page guideline
- b. The page rules
- c. The page basic rule (optional)
The name of the template file : name.json
is important as it will be the string used to load it.
π For example, if you want to scrap data on the website
https://www.scrap-me.com
, you will need to create ascrap-me.json
template file.
A website can have many different pages. For example https://www.scrap-me.com
can have the following pages :
https://www.scrap-me.com/profiles
https://www.scrap-me.com/companies
We can create different scrapping rules for each one of them or create a basic rule that will apply to every page.
You will find in the template a "pages"
array that contains all the individual page in some {}
and separate by ,
.
Each page has the following two information :
variable | type | description |
---|---|---|
fileName |
string | the default name of the file that will be saved for this page of the website website |
urlSelector |
string | the string in the url that will differentiate this page from the others for the same website |
π For example, in the case of the page with the url :
https://www.scrap-me.com/companies
, we can do :"fileName": "ScrapMe_Companies", "urlSelector": "/companies",
The rules are defined in the "rules"
array of rule.
A rule allows you to define how you will select one data information that you want to save and under what form and what name you will save it.
π You can add as many rules as you want to save information on the web page.
A rule has the following information :
variable | type | description |
---|---|---|
htmlTag |
string | a html tag that the selenium will search for |
value |
string | the value of the data that the html tag has |
saveAs |
string | the name of the column for this information in the CSV |
saveType |
string | the saving type that will define the data format |
π For example, in the case of the following html tag :
<p class="company-title"> Super Company Name </p>We can create the following rule :
{ "htmlTag": "class", "value": "company-title", "saveAs": "Company Name", "saveType": "string" }
π In the case of the following html tag with a link :
<a href="https://www.scrap-me.com/"> Our Website </a>We can create the following rule :
{ "htmlTag": "link", "value": "Our Website", "saveAs": "Company Link", "saveType": "link" }
-
class :
<div class="text container company">...</div>
the rule
value
could here becompany
ortext
β οΈ Only one class can be passed ! -
id
<div id="company-name"> Company Name </div>
the rule
value
will here becompany-name
-
tag
<h1> Company Name </h1>
the rule
value
will here beh1
β οΈ Only the first corresponding tag will be saved ! -
name
<input name="username" type="text" />
the rule
value
will here beusername
-
link
<a href="https://scrap-me.com/"> A Link </a>
the rule
value
will here beA Link
-
partialLink
<a href="https://scrap-me.com/"> A Link </a>
the rule
value
could here belink
orA Li
-
css
π a very flexible html tag selector :
<p> Welcome on </p><p> Scrap-Me </p><p> ! </p>
the rule
value
will here bep.content:nth-child(2)
to select the Scrap-Me -
xpath
π the most flexible html tag selector :
<div class="informations"> <img src="https://img.png" alt="logo" /> <p> Company Name </p> </div>
the rule
value
will here be//div[@class='name']/p
- string
to save any type of data
- link
to save link data from href tag
By using "/"
or ""
as the urlSelector
you will create a page basic rule.
This means that the following scrapping rule will apply to every page of the website. This will happen because every url
has the /
character in it.
β οΈ This rule has to be at the bottom end of the list of pages so it will be the last one to be applied if any other page matches the previous url selector.
You can use this selector so if some website doesn't use any specific string in the url for the page you want to scrap (if they use random token or user id string), you can use it.
π Critical errors will appear in a popup error window and will shutdown the program.
Id | Description | Solution |
---|---|---|
#1 |
The .env file is empty, corrupt or does not exist in the project root folder |
Download the .env file from the git repository and replace the old file with the new one in the root file |
#2 |
The mentioned file is missing in the asset folder |
Download the asset folder from the git repository and replace the old folder with the new one in the root file |
#10 |
Can't execute the terminal commands to set chrome.exe path or to open a chrome debugging instance | Try to launch it manually in your terminal by running this command and see if it work |
#11 |
The selenium driver don't work | Download the latest version of chromedriver and replace the previous chromedriver.exe in the driver folder |
#12 |
No window found to scrap | Please restart the scraper |
#13 |
All chrome pages related to the scrapper have been closed | Do not close the chrome pages of the scrapper if you want to continue using it |
π Warnings will appear in the application console.
Id | Description | Solution |
---|---|---|
#20 |
One of the rules is not working properly | Check the MISSING text in the .csv /.xlsx /.xls saved file and change the field value or htmlTag corresponding in the same template file |
#21 |
Name has special characters in it | The name given to the csv file have special characters please only use letters, numbers and - or _ |
#22 |
Loading templates error | The templates is not founded |
#23 |
Loading templates error | The templates is not founded |
#24 |
The htmlTag in one of the rule of the template is not one of the html tag |
Open the template file of the website you where trying to save and search for the corresponding tag-name that was prompt in console |
#25 |
The save file don't have extension | Please add .csv , .xlsx or .xls at the end of the save file name |
βββ assets
β βββ ... # all .png and .ico for the design are there
βββ driver
β βββ driverProfile
β β βββ ... # all stuff from Google are there
β βββ chromedriver.exe
βββ templates
β βββ ... # templates for scraping data from a website
βββ .config # file with general configuration
βββ .env # file with user configuration
βββ .gitignore # file with all the ignored files for git
βββ webscraper.exe # file with the compiled main program
βββ webscraper.py # file with the main program
βββ README.md # file with general information
βββ requirements.txt # file with all the dependencies
We are using python Popen
to execute a child program in a new process. We then wait for the execution and then kill the subprocess.
The child program will :
- Set the chrome driver path
- Open a new chrome window with a the debugging mode
You can manually execute the child program by running the following command in your terminal after changing the DIR_CHROMEAPP_PATH
and the PORT
values :
set PATH=%PATH%;DIR_CHROMEAPP_PATH&&chrome.exe --remote-debugging-port=PORT --user-data-dir="C:\TestFolder\ChromeScraperProfile"
π look here to see what path and port values you need to set to make it work for you
You can use nuikta to do it by running the following command in your terminal :
py -m nuitka --standalone --include-data-dir=./assets=assets --include-data-dir=./driver=driver --include-data-dir=./templates=templates --include-data-files=.config=.config --include-data-files=.env=.env --include-data-files=README.md=README.md --enable-plugin=tk-inter --enable-plugin=numpy --include-package-data=selenium --include-package-data=openpyxl --windows-icon-from-ico=./assets/app.ico webscraper.py
β οΈ Make sure to not push thedriverProfile
and the.config
they will be generated if missing
- Check if .config is auto generated if an error spawn
- Add chrome profile path in the config file if not define a default one will be created and destroy onClose
- use pypy3
- use numpy for matrix and use jit on top of it
- Create an installation file for all the dependencies
http://sdz.tdct.org/sdz/creer-une-installation.html