Skip to content

QuentinDstl/p_scrap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Usage

On first launch the chromeProfile file will be generated so it will take some time. When you launch the application the following popup will appear :

⚠️ if files are missing or it is the first launch a particular window can open.

  1. By clicking on the big red button you will save the information from the page that has the same webpage name (the last one opened).
  2. For the name of the file, please enter a known extension (.csv, .xlsx or xls) or it will be saved by default as a .csv.
  3. If you want to change the saving path click on the grey "In file-emote" button.
  4. You can see the existing templates and add new ones with the following two buttons.
  5. The console will print warning and information messages.

Installation

You need to have Chrome installed on your machine. You need to clone the repo at this link or download the zip file.

You may need to change manually some stuff in some of the following files:

.config file

what is in .config
[SAVING] SAVE_DATA_PATH C:/Folder/To/Save/the_result.csv folder where the current website will be saved

⚠️ If .config doesn't exist, it will be created on next launch and you will be asked to choose it.

.env file

what is in .env
DIR_CHROMEAPP_PATH C:/Program Files/Google/Chrome/Application/ default folder used to launch Chrome on debugging mode
PORT 9222 default port used to launch the new chrome window

⚠️ The PORT must not be used by another app. Launch cmd with admin rights and execute : netstat -a to see what port are used.

πŸ“– To get the DIR_CHROMEAPP_PATH :

1. Window + S: to start a research
2. Now search for `Chrome`
3. Righ-click on the logo that just pops
4. Click on: Open File Location
5. In the new window, right-click on `Google Chrome` shortcut file
6. Click on: Open File Location
7. Copy, paste the path of the newly opened folder into the `.env` file

The path should look like that : C:/Files/Chrome/Application/.



Features

Templates

Templates are used to know what information to scrape on what website. You can find in example.json an example of a set of pages and rules.

A template has 2 important parts:

  1. The Template Specific Name
  2. The Template List of Pages
    • a. The page guideline
    • b. The page rules
    • c. The page basic rule (optional)

1. Template Specific Name

The name of the template file : name.json is important as it will be the string used to load it.

πŸ“– For example, if you want to scrap data on the website https://www.scrap-me.com, you will need to create a scrap-me.json template file.


2. The Template List of pages

A website can have many different pages. For example https://www.scrap-me.com can have the following pages :

  • https://www.scrap-me.com/profiles
  • https://www.scrap-me.com/companies

We can create different scrapping rules for each one of them or create a basic rule that will apply to every page.

You will find in the template a "pages" array that contains all the individual page in some {} and separate by ,.


a. The page guideline

Each page has the following two information :

variable type description
fileName string the default name of the file that will be saved for this page of the website website
urlSelector string the string in the url that will differentiate this page from the others for the same website

πŸ“– For example, in the case of the page with the url :

https://www.scrap-me.com/companies , we can do :

"fileName": "ScrapMe_Companies",
"urlSelector": "/companies",

b. The page rules

The rules are defined in the "rules" array of rule.

A rule allows you to define how you will select one data information that you want to save and under what form and what name you will save it.

πŸ“– You can add as many rules as you want to save information on the web page.

A rule has the following information :

variable type description
htmlTag string a html tag that the selenium will search for
value string the value of the data that the html tag has
saveAs string the name of the column for this information in the CSV
saveType string the saving type that will define the data format

πŸ“– For example, in the case of the following html tag :

<p class="company-title"> Super Company Name </p>

We can create the following rule :

   {
       "htmlTag": "class",
       "value": "company-title",
       "saveAs": "Company Name",
       "saveType": "string"
   }

πŸ“– In the case of the following html tag with a link :

<a href="https://www.scrap-me.com/"> Our Website </a>

We can create the following rule :

   {
       "htmlTag": "link",
       "value": "Our Website",
       "saveAs": "Company Link",
       "saveType": "link"
   }

html tags:

  • class :

    <div class="text container company">...</div>

    the rule value could here be company or text

    ⚠️ Only one class can be passed !

  • id

    <div id="company-name"> Company Name </div>

    the rule value will here be company-name

  • tag

    <h1> Company Name </h1>

    the rule value will here be h1

    ⚠️ Only the first corresponding tag will be saved !

  • name

    <input name="username" type="text" />

    the rule value will here be username

  • link

    <a href="https://scrap-me.com/"> A Link </a>

    the rule value will here be A Link

  • partialLink

    <a href="https://scrap-me.com/"> A Link </a>

    the rule value could here be link or A Li

  • css

    πŸ“– a very flexible html tag selector :

    <p> Welcome on </p><p> Scrap-Me </p><p> ! </p>

    the rule value will here be p.content:nth-child(2) to select the Scrap-Me

  • xpath

    πŸ“– the most flexible html tag selector :

    <div class="informations">
        <img src="https://img.png" alt="logo" />
        <p> Company Name </p>
    </div>

    the rule value will here be //div[@class='name']/p

    🚩 Get more Information on Xpath or Use Xpath Extension

saving types:

  • string

    to save any type of data

  • link

    to save link data from href tag


c. The page basic rule

By using "/" or "" as the urlSelector you will create a page basic rule.

This means that the following scrapping rule will apply to every page of the website. This will happen because every url has the / character in it.

⚠️ This rule has to be at the bottom end of the list of pages so it will be the last one to be applied if any other page matches the previous url selector.

You can use this selector so if some website doesn't use any specific string in the url for the page you want to scrap (if they use random token or user id string), you can use it.



Error Messages

Critical Errors

πŸ“– Critical errors will appear in a popup error window and will shutdown the program.

Id Description Solution
#1 The .env file is empty, corrupt or does not exist in the project root folder Download the .env file from the git repository and replace the old file with the new one in the root file
#2 The mentioned file is missing in the asset folder Download the asset folder from the git repository and replace the old folder with the new one in the root file
#10 Can't execute the terminal commands to set chrome.exe path or to open a chrome debugging instance Try to launch it manually in your terminal by running this command and see if it work
#11 The selenium driver don't work Download the latest version of chromedriver and replace the previous chromedriver.exe in the driver folder
#12 No window found to scrap Please restart the scraper
#13 All chrome pages related to the scrapper have been closed Do not close the chrome pages of the scrapper if you want to continue using it

Warnings

πŸ“– Warnings will appear in the application console.

Id Description Solution
#20 One of the rules is not working properly Check the MISSING text in the .csv/.xlsx/.xls saved file and change the field value or htmlTag corresponding in the same template file
#21 Name has special characters in it The name given to the csv file have special characters please only use letters, numbers and - or _
#22 Loading templates error The templates is not founded
#23 Loading templates error The templates is not founded
#24 The htmlTag in one of the rule of the template is not one of the html tag Open the template file of the website you where trying to save and search for the corresponding tag-name that was prompt in console
#25 The save file don't have extension Please add .csv, .xlsx or .xls at the end of the save file name


How everything works together

β”œβ”€β”€ assets
β”‚   └── ...     # all .png and .ico for the design are there
β”œβ”€β”€ driver
β”‚   β”œβ”€β”€ driverProfile
β”‚   β”‚   └── ... # all stuff from Google are there
β”‚   └── chromedriver.exe
β”œβ”€β”€ templates
β”‚   └── ...     # templates for scraping data from a website
β”œβ”€β”€ .config     # file with general configuration 
β”œβ”€β”€ .env        # file with user configuration
β”œβ”€β”€ .gitignore  # file with all the ignored files for git
β”œβ”€β”€ webscraper.exe   # file with the compiled main program
β”œβ”€β”€ webscraper.py    # file with the main program
β”œβ”€β”€ README.md   # file with general information
└── requirements.txt # file with all the dependencies

Openning Chrome Debugging Instance

We are using python Popen to execute a child program in a new process. We then wait for the execution and then kill the subprocess.

The child program will :

  1. Set the chrome driver path
  2. Open a new chrome window with a the debugging mode

You can manually execute the child program by running the following command in your terminal after changing the DIR_CHROMEAPP_PATH and the PORT values :

set PATH=%PATH%;DIR_CHROMEAPP_PATH&&chrome.exe --remote-debugging-port=PORT --user-data-dir="C:\TestFolder\ChromeScraperProfile"

πŸ“– look here to see what path and port values you need to set to make it work for you

Converting the .py to .exe

You can use nuikta to do it by running the following command in your terminal :

py -m nuitka --standalone --include-data-dir=./assets=assets --include-data-dir=./driver=driver --include-data-dir=./templates=templates --include-data-files=.config=.config --include-data-files=.env=.env --include-data-files=README.md=README.md --enable-plugin=tk-inter --enable-plugin=numpy --include-package-data=selenium --include-package-data=openpyxl --windows-icon-from-ico=./assets/app.ico webscraper.py

⚠️ Make sure to not push the driverProfile and the .config they will be generated if missing



TODO List

Logic Improvements

  • Check if .config is auto generated if an error spawn
  • Add chrome profile path in the config file if not define a default one will be created and destroy onClose

Speed Improvements

  • use pypy3
  • use numpy for matrix and use jit on top of it

Installation Improvements

  • Create an installation file for all the dependencies http://sdz.tdct.org/sdz/creer-une-installation.html

About

Static Webscraper in Python using selenium

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages