Skip to content

Latest commit

 

History

History
143 lines (94 loc) · 6.38 KB

File metadata and controls

143 lines (94 loc) · 6.38 KB

Turkish-Wikipedia-Based-Knowledge-Graph

This repository includes a Knowledge Graph construction project from Turkish Wikipedia pages. This project constructs a Knowledge Graph from Turkish wikipedia dump, using both the unstructured texts and information boxes. It is developed under inzva AI Projects #6 event, with a group of 4 researchers.

alt text

Resources that we used

We mainly used two repositories. We constructed a pipeline using both of them in order to construct a knowledge graph. First repository , Radboud Entity Linker which is a modular Entity Linker. Second repository is Link which is non-official implementation of the Language Models are Open Knowledge Graphs paper.

Dia Parser for Dependency Parsing

For dependency parsing, we used DiaParser . It didn't have pre-trained parser on Turkish, so we trained new parser using UD_Turkish-BOUN dataset. The training dataset contains 7803 sentences for training 979 sentences for development 979 sentences for testing.

Results

Model UAS on Dev LAS on Dev UAS on Test LAS on Test
bert-base-turkish-cased 83.20% 74.83% 83.05% 75.41%
electra-base-turkish-discriminator 84.22% 75.64% 83.53% 75.87%
convbert-base-turkish-cased 83.12% 74.86% 82.55% 75.21%

You can access our dependency parser model from Diaparser library

WikiExtractor

This script takes as an input a Wikipedia dump and spits out files such as
wiki_redirects.txt,
wiki_name_id_map.txt,
wiki_disambiguation.txt.

You can find WikiExtractor script from here.

Wikipedia2Vec

from wikipedia2vec import Wikipedia2Vec
wiki2vec = Wikipedia2Vec.load('wikipedia2vec_trained')
wiki2vec.most_similar(wiki2vec.get_entity('Atatürk'), 5)

>>> [(<Entity Mustafa Kemal Atatürk>, 0.9999999), (<Word atatürk>, 0.9274426), (<Word kemal>, 0.782923), (<Entity Kategori:Mustafa Kemal Atatürk>, 0.77045125), (<Entity Yardım:Açıklamalı sayfa>, 0.7423448)]

wiki2vec.most_similar(wiki2vec.get_entity('Fatih Terim'), 5)

>>> [(<Entity Fatih Terim>, 1.0), (<Entity Şenol Güneş>, 0.7102364), (<Entity Müfit Erkasap>, 0.6819058), (<Entity Abdullah Avcı>, 0.67471796), (<Word hiddink>, 0.6672677)]

We used Wikipedia2Vec to obtain page embeddings.
Total number of word occurrences: 457850145
Hyperparameters: window=5, iteration=10, negative=15

You can access Wikipedia2Vec official page from here.
You can access 2021 Turkish Wikipedia Dump from here.
Binary file soon!

POS

We trained a model for Part of Speech Tagging which is trained with Bert Turk language model

Model Parameters

Batch size : 8
Epoch : 10
Maximum sequence length : 128

Dataset

We used UD Turkish IMST Dataset in order to train, test and validate our model.

Results

The results are shown below

Precision Recall F1 loss
95.94 96.04 95.99 0.1625

Model link

You can access our Bert Part of Speech tagging model from here

NER

We trained a Named Entity Recognition which is trained with Convberturk language model

Model Parameters

Batch size : 32
Epoch : 5
Maximum sequence length : 512

Dataset

We used Xtreme Dataset in order to train, test and validate our model. We trained convbert model with merging train and extra files and we got the results on validation file.

Results

The results are shown below

Precision Recall F1 loss
95.83 96.84 96.33 0.0665

Model link

You can access our convbert Named Entity Recognition model from here

Wikipedia Information Box Relation Extraction

alt text

This information box relations extract from Tarkan wikipedia page

Lemmatization

We used the combination of Zeyrek and Turkish lemmatizer to apply Lemmatization on words.

Adjective, Adverb, Verb Corpus

We used Turkish WordNet and trnlp gihub repository to collect adjective, adverb and verbs. You can access Turkish WordNet from here You can access trnlp repository from here

Count based on POS

Turkish WordNet Count based on POS

Adjective Count Adverb Count Verb Count
10092 2325 13274

trnlp Count based on POS

Adjective Count Adverb Count Verb Count
8456 1416 9788

Total

Adjective Count Adverb Count Verb Count
18548 3741 23062

Presentation

https://www.youtube.com/watch?v=25fUKX36Nx4