Skip to content

Text-mining using PubMed abstracts for analyzing significant genes.

Notifications You must be signed in to change notification settings

BupyeongHealer/Bio-Text-Mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Text mining from PubMed abstracts

A research purpose is to extract Thalassemia associated genes from abstracts through text mining strategy.

Development Environment

  • IDE : Pycharm
  • Package : Biopython
  • Python Version : 3.6 & 3.7

Installation

  1. We need to install biopython package.
pip install biopython
  1. Update old biopython package.
pip install biopython --upgrade

Overview

0. Settings
1. Load the Kegg genes file
2. Download the abstracts from PubMed
3. Text mining the abstracts
4. Scoring the selected words
5. Discussion

0. Settings

from Bio import Entrez
import math
import time
import random

time.sleep(random.randint(1, 3))

1. Load the Kegg genes file

kegg_names = {}
name_kegg = {}

f = open('C:\\Users\PARK\\genes.txt', 'r')
for line in f.readlines():

    t1 = line.split(';')[0]
    t2 = t1.split('\t')
    kegg_id = t2[0]
    kegg_names[kegg_id] = []
    for name in t2[1].split(','):
        name = name.strip()
        kegg_names[kegg_id].append(name)
        name_kegg[name] = kegg_id

f.close()

2. Download the abstracts from Pubmed

disease = 'Thalassemia'.upper()

print('Download abstracts...')

Entrez.email = '[email protected]'

handle = Entrez.esearch(db='pubmed', term=disease, retmax=10000)
record = Entrez.read(handle)

downloaded_abstracts = []

cnt = 0
for pubmed_id in record['IdList']:
    cnt = cnt + 1

    print(cnt, '/', len(record['IdList']))
    abstract = Entrez.efetch('pubmed', id=pubmed_id, retmode='text', rettype='abstracts').read()
    downloaded_abstracts.append(abstract)

3. Text mining the abstracts

keywords_in_abstract = []
for ab in downloaded_abstracts:
    keyword_box = []
    words = ab.replace('.', ' ').split(' ')
    for w in words:
        if w.upper() == disease:
            keyword_box.append(w.upper())
        else:
            if w in name_kegg:
                keyword_box.append(name_kegg[w])

    keywords_in_abstract.append(keyword_box)

4. Scoring the selected words

def probability(abstracts, keywords):
    count = 0.0
    total = len(abstracts)

    for WORDS in abstracts:
        has_terms = True
        for t in keywords:
            if not t in WORDS:
                has_terms = False
        if has_terms:
            count = count + 1

    return count / total

print('Calculating MI....')

scores = {}

p_disease = probability(keywords_in_abstract, [disease])
for kegg_id in kegg_names:
    p_gene = probability(keywords_in_abstract, [kegg_id])
    p_gene_disease = probability(keywords_in_abstract, [kegg_id, disease])

    if p_gene != 0 and p_disease != 0 and p_gene_disease != 0:
        mi = math.log2(p_gene_disease / (p_gene * p_disease))
        scores[kegg_names[kegg_id][0]] = mi

f2 = open('C:\\Users\PARK\\result.txt', 'w')

for key in sorted(scores, key=scores.__getitem__, reverse=True):
    f2.write(key + '\t' + str(scores[key]) + '\n')
    print(key + '\t' + str(scores[key]))
f2.close()

5. Discussion

741 different genes are collected by my text mining codes. 541 genes got a plus score and 200 genes got a minus score.
A Plus score means that It is more likely to exist genes and disease at same time in abstracts than alone.
But a Minus score means that the minus scored genes will be likely not to exist together. They will exist alone in abstracts.
If the genes and disease exist in abstracts simultaneously, we can draw the conclusion that there may be a significant correlation between a disease and genes

About

Text-mining using PubMed abstracts for analyzing significant genes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages