Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/fferegrino/zeldaKG
Browse files Browse the repository at this point in the history
  • Loading branch information
fferegrino committed Jun 16, 2018
2 parents 3a7b8b0 + a8d387a commit bf23669
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,17 @@
a TLOZ inspired knowledge graph.
```

- Step 0: Gather a lot of wiki pages (check if you can use a tool like [HTTrack](https://www.httrack.com/)), in this case I downloaded a copy of the whole [Zeldapedia](http://zelda.wikia.com/wiki/) and [Zelda Wiki](https://zelda.gamepedia.com/Main_Page).
- Step 0: Gather a lot of wiki pages (check if you can use a tool like [HTTrack](https://www.httrack.com/)), in this case, I downloaded a copy of the whole [Zeldapedia](http://zelda.wikia.com/wiki/) and [Zelda Wiki](https://zelda.gamepedia.com/Main_Page).

- Step 1: If you did not configure your crawlers/copiers correctly, the previous step may have got you a lot of useles sites, such as User pages, templates or even forum pages. The purpose of this step is to reduce the number of files to be procesed by filtering out the documents whose name starts with "User_", "Category_Zeldapedians_", "Message_Wall_" and similar. In this cleaning stage the real content of the site (in wikia that is the tag `article`) is extracted discarding the templated website out.
- Here are the **[Zeldapedia notebook](html_cleaning/zelda.wikia.ipynb)** and the **[Zelda Wiki notebook](html_cleaning/zelda.gamepedia.ipynb)**.
- Step 1: If you did not configure your crawlers/copiers correctly, the previous step might have got you a lot of useless sites, such as User pages, templates or even forum pages. The purpose of this step is to reduce the number of files to be processed by filtering out the documents whose name starts with "User_", "Category_Zeldapedians_", "Message_Wall_" and similar. In this cleaning stage, the real content of the site (in wikia that is the tag `article`) is extracted discarding the templated website out.
- Here are the **[Zeldapedia notebook](html_cleaning/zelda.wikia.ipynb)** and the **[Zelda Wiki notebook](html_cleaning/zelda.gamepedia.ipynb)**.
- Download the "clean" data here: [zelda-wikia2-clean.zip](https://github.com/fferegrino/zeldaKG/releases/download/data/zelda-wikia2-clean.zip) and [zelda-gamepedia-clean.zip](https://github.com/fferegrino/zeldaKG/releases/download/data/zelda-gamepedia-clean.zip)

- Step 2: Information extraction
- Title-Link relationship: extract a relationship between each file and the title of the article it represents into two dataframes. **[Title-Link relationship notebook](relation_extraction/title-link-relationship.ipynb)**.
- Infobox extraction: extract *raw* relationships between entities extracted from the infobox of each page. The relationships are generated as json objects that are interpreted in the next step. **[for gamepedia](relation_extraction/infobox_extraction.gamepedia.ipynb)** and **[for wikia](relation_extraction/infobox_extraction.wikia.ipynb)**.
- Merge infobox sources: In this step we can extract information from the infoboxes. Information such as Gender, Race, Appereances, and many more. **[Merge sources notebook](relation_extraction/merge_sources_infoboxes.ipynb)**.
- Text extraction using spaCy. In this step the text of each article is analysed using the spaCy package to extract *raw* relationships between a `Resource` and names in the notebook **[text_extraction](relation_extraction/text_extraction.ipynb)**, and then processed again to ground them to only relationships between `Resource`s existing in our graph, his happens in **[text_extraction_processing](relation_extraction/text_extraction_processing.ipynb)**.
- Step 2: Information Extraction
- Title-Link relationship: extract a relationship between each file and the title of the article it represents into two dataframes. **[Title-Link relationship notebook](relation_extraction/title-link-relationship.ipynb)**.
- Infobox extraction: extract *raw* relationships between entities extracted from the infobox of each page. The relationships are generated as json objects that are interpreted in the next step. **[for gamepedia](relation_extraction/infobox_extraction.gamepedia.ipynb)** and **[for wikia](relation_extraction/infobox_extraction.wikia.ipynb)**.
- Merge infobox sources: In this step we can extract information from the infoboxes. Information such as Gender, Race, Appereances, and many more. **[Merge sources notebook](relation_extraction/merge_sources_infoboxes.ipynb)**.
- Text extraction using spaCy. In this step the text of each article is analysed using the spaCy package to extract *raw* relationships between a `Resource` and names in the notebook **[text_extraction](relation_extraction/text_extraction.ipynb)**, and then processed again to ground them to only relationships between `Resource`s existing in our graph, his happens in **[text_extraction_processing](relation_extraction/text_extraction_processing.ipynb)**.

- Step 3: Insertion into neo4j

Expand Down

0 comments on commit bf23669

Please sign in to comment.