Wiki Cralwer

The Python project for capturing random pages in Wikipedia, extracting information from the page: Title, Content, Categories, History(User, Date of change). For further storage in the MySQL database.

Installing & Running

Python 3.6.5 (pip 9.0.3)
MySQL (I'm using the version 5.7.22-0ubuntu0.16.04.1)
All libraries used are in the Python Standard Library except:
- BeautifulSoup4 - 4.6.0 (pip install beautifulsoup4)
- pymysql - 0.8.1
- requests - 2.18.4

Running

Create the necessary tables and additional fields in the table
- Run bash create_wiki_tables.sh
Create stored procedures
- Run bash create_wiki_procedures.sh
Edit file config.json

      "mysql": {
            "host": " ",
            "user": " ",
            "password": " ",
            "database": " "
      }

Run wikipedia-crawler.py

    python3 wikipedia-crawler.py

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
sql		sql
README.md		README.md
config.json		config.json
utils.py		utils.py
wikipedia-crawler.py		wikipedia-crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki Cralwer

Installing & Running

Running

About

Releases

Packages

Languages

alexbuldey/Wikipedia-scraper

Folders and files

Latest commit

History

Repository files navigation

Wiki Cralwer

Installing & Running

Running

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages