Py-Scraper

This project is a web scraper built using Flask, BeautifulSoup, Selenium, ScrapeGraphAI and PostgreSQL. It allows users to search for information and store the results in a PostgreSQL database.

Setup Instructions

Clone the repository:

git clone https://github.com/dehyabi/py-scraper.git
cd py-scraper

Choose your scraping tools:
- beautifulsoup-headless: Uses BeautifulSoup for scraping without opening a browser.
- selenium-headless: Uses Selenium to scrape without opening a browser.
- scrapegraphai-headless: Uses ScrapeGraphAI for scraping without opening a browser (need OpenAI API Key).
For example you use beautifulsoup-headless just do:
```
cd beautifulsoup-headless
```

Setup the environment:

Create a .env file in the root directory and add your database connection details.

Example:

DATABASE_NAME=your_database_name
DATABASE_USER=your_database_user
DATABASE_PASSWORD=your_database_password
DATABASE_HOST=localhost
DATABASE_PORT=5432

Create a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Test the database connection:
```
python3 test-db.py
```
Run the application:
```
flask run
```

Test the search endpoint: Use curl to test the search functionality:

curl -X POST http://127.0.0.1:5000/search -H "Content-Type: application/json" -d '{"query": "technology"}'

Database Interaction

To connect to PostgreSQL, use the following command:
```
sudo postgres psql -d your_database_name
```

You can view the inserted data with:

SELECT * FROM table_name;

Example of inserted data:

-[ RECORD 1 ]-------------------------------------------------
id    | 1
title | Ultracapacitors: why, how, and where is the technology

Note: The database setup and commands may vary depending on your database system.

Success Logs

Check the logs for information on the operations performed by the application.

 * Running on http://127.0.0.1:5000
2025-03-27 05:49:20,478 - INFO - Press CTRL+C to quit
2025-03-27 05:49:50,642 - INFO - Received search query: technology
2025-03-27 05:49:50,642 - INFO - Connecting to the database to insert file...
2025-03-27 05:49:50,679 - INFO - Fetching data from: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=technology
2025-03-27 05:49:52,352 - INFO - Data fetched successfully.
2025-03-27 05:49:52,484 - INFO - file inserted successfully.
2025-03-27 05:49:52,484 - INFO - Scraped data inserted into the database.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
beautifulsoup-headless		beautifulsoup-headless
scrapegraphai-headless		scrapegraphai-headless
selenium-headless		selenium-headless
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Py-Scraper

Setup Instructions

Database Interaction

Success Logs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

dehyabi/py-scraper

Folders and files

Latest commit

History

Repository files navigation

Py-Scraper

Setup Instructions

Database Interaction

Success Logs

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages