GitHub - BupyeongHealer/Bio-Text-Mining: Text-mining using PubMed abstracts for analyzing significant genes.

Text mining from PubMed abstracts

A research purpose is to extract Thalassemia associated genes from abstracts through text mining strategy.

Development Environment

IDE : Pycharm
Package : Biopython
Python Version : 3.6 & 3.7

Installation

We need to install biopython package.

pip install biopython

Update old biopython package.

pip install biopython --upgrade

Overview

0. Settings
1. Load the Kegg genes file
2. Download the abstracts from PubMed
3. Text mining the abstracts
4. Scoring the selected words
5. Discussion

0. Settings

from Bio import Entrez
import math
import time
import random

time.sleep(random.randint(1, 3))

1. Load the Kegg genes file

kegg_names = {}
name_kegg = {}

f = open('C:\\Users\PARK\\genes.txt', 'r')
for line in f.readlines():

    t1 = line.split(';')[0]
    t2 = t1.split('\t')
    kegg_id = t2[0]
    kegg_names[kegg_id] = []
    for name in t2[1].split(','):
        name = name.strip()
        kegg_names[kegg_id].append(name)
        name_kegg[name] = kegg_id

f.close()

2. Download the abstracts from Pubmed

disease = 'Thalassemia'.upper()

print('Download abstracts...')

Entrez.email = '[email protected]'

handle = Entrez.esearch(db='pubmed', term=disease, retmax=10000)
record = Entrez.read(handle)

downloaded_abstracts = []

cnt = 0
for pubmed_id in record['IdList']:
    cnt = cnt + 1

    print(cnt, '/', len(record['IdList']))
    abstract = Entrez.efetch('pubmed', id=pubmed_id, retmode='text', rettype='abstracts').read()
    downloaded_abstracts.append(abstract)

3. Text mining the abstracts

keywords_in_abstract = []
for ab in downloaded_abstracts:
    keyword_box = []
    words = ab.replace('.', ' ').split(' ')
    for w in words:
        if w.upper() == disease:
            keyword_box.append(w.upper())
        else:
            if w in name_kegg:
                keyword_box.append(name_kegg[w])

    keywords_in_abstract.append(keyword_box)

4. Scoring the selected words

def probability(abstracts, keywords):
    count = 0.0
    total = len(abstracts)

    for WORDS in abstracts:
        has_terms = True
        for t in keywords:
            if not t in WORDS:
                has_terms = False
        if has_terms:
            count = count + 1

    return count / total

print('Calculating MI....')

scores = {}

p_disease = probability(keywords_in_abstract, [disease])
for kegg_id in kegg_names:
    p_gene = probability(keywords_in_abstract, [kegg_id])
    p_gene_disease = probability(keywords_in_abstract, [kegg_id, disease])

    if p_gene != 0 and p_disease != 0 and p_gene_disease != 0:
        mi = math.log2(p_gene_disease / (p_gene * p_disease))
        scores[kegg_names[kegg_id][0]] = mi

f2 = open('C:\\Users\PARK\\result.txt', 'w')

for key in sorted(scores, key=scores.__getitem__, reverse=True):
    f2.write(key + '\t' + str(scores[key]) + '\n')
    print(key + '\t' + str(scores[key]))
f2.close()

5. Discussion

741 different genes are collected by my text mining codes. 541 genes got a plus score and 200 genes got a minus score.
A Plus score means that It is more likely to exist genes and disease at same time in abstracts than alone.
But a Minus score means that the minus scored genes will be likely not to exist together. They will exist alone in abstracts.
If the genes and disease exist in abstracts simultaneously, we can draw the conclusion that there may be a significant correlation between a disease and genes

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
10,000 result.txt		10,000 result.txt
README.md		README.md
genes-1.txt		genes-1.txt
textMining.py		textMining.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text mining from PubMed abstracts

Development Environment

Installation

Overview

0. Settings

1. Load the Kegg genes file

2. Download the abstracts from Pubmed

3. Text mining the abstracts

4. Scoring the selected words

5. Discussion

About

Releases

Packages

Languages

BupyeongHealer/Bio-Text-Mining

Folders and files

Latest commit

History

Repository files navigation

Text mining from PubMed abstracts

Development Environment

Installation

Overview

0. Settings

1. Load the Kegg genes file

2. Download the abstracts from Pubmed

3. Text mining the abstracts

4. Scoring the selected words

5. Discussion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages