This project is a web scraper built using Flask, BeautifulSoup, Selenium, ScrapeGraphAI and PostgreSQL. It allows users to search for information and store the results in a PostgreSQL database.
-
Clone the repository:
git clone https://github.com/dehyabi/py-scraper.git cd py-scraper
-
Choose your scraping tools:
- beautifulsoup-headless: Uses BeautifulSoup for scraping without opening a browser.
- selenium-headless: Uses Selenium to scrape without opening a browser.
- scrapegraphai-headless: Uses ScrapeGraphAI for scraping without opening a browser (need OpenAI API Key).
For example you use beautifulsoup-headless just do:
cd beautifulsoup-headless
-
Setup the environment:
- Create a
.env
file in the root directory and add your database connection details. - Example:
DATABASE_NAME=your_database_name DATABASE_USER=your_database_user DATABASE_PASSWORD=your_database_password DATABASE_HOST=localhost DATABASE_PORT=5432
- Create a
-
Create a virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Test the database connection:
python3 test-db.py
-
Run the application:
flask run
-
Test the search endpoint: Use curl to test the search functionality:
curl -X POST http://127.0.0.1:5000/search -H "Content-Type: application/json" -d '{"query": "technology"}'
-
To connect to PostgreSQL, use the following command:
sudo postgres psql -d your_database_name
-
You can view the inserted data with:
SELECT * FROM table_name;
Example of inserted data:
-[ RECORD 1 ]------------------------------------------------- id | 1 title | Ultracapacitors: why, how, and where is the technology
Note: The database setup and commands may vary depending on your database system.
Check the logs for information on the operations performed by the application.
* Running on http://127.0.0.1:5000
2025-03-27 05:49:20,478 - INFO - Press CTRL+C to quit
2025-03-27 05:49:50,642 - INFO - Received search query: technology
2025-03-27 05:49:50,642 - INFO - Connecting to the database to insert file...
2025-03-27 05:49:50,679 - INFO - Fetching data from: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=technology
2025-03-27 05:49:52,352 - INFO - Data fetched successfully.
2025-03-27 05:49:52,484 - INFO - file inserted successfully.
2025-03-27 05:49:52,484 - INFO - Scraped data inserted into the database.
This project is licensed under the MIT License. See the LICENSE file for details.