NLP-Medical Record-CapstoneProject

Here we will use natural language processing in order to identify medical condition listed in patient history notes.

Authors

About the Project

Writing patient notes to document the history of a patient's complaint, exam findings, possible diagnoses, and care. Learning and assessing these skills requires feedback from other doctors.

Until recently, a part of clinical skills exams for medical students involved interacting with standardized patients and taking notes. These notes were later scored by trained physician raters in a labor intensive process.

This project seeks to use natural language processing to identify specific clinical concepts in patient nots.

Project Goals

Develop automated method to map clinical concepts from an exam ruberic to various ways in which these concepts are expressed in clinical patient notes written by medical students.

DELIVERABLES:

A well-documented Jupyter notebook that contains a report of your analysis, and link to that notebook.
A slideshow suitable for a general audience that summarizes your findings.
Include well-labeled visualizations in your slides.
Link to the team Trello board.
A presentation. Each team member should present a portion of the presentation.

Initial Questions

What clinical conditions are present for the 10 standardized patients?
On average, how many conditions do students correctly label?
What words or phrases are tied to specific patients and conditions?
How do we predict multiple outcomes in a multiclass classification process?
What other NLP libraries could be useful besides NLTK? (Spacey)
What other pre-trained medical NLP models would be good? (i.e., BioBERT)
What deep learning approaches would be appropriate to solving these questions?

Initial Hypotheses

There will be patient and condition specific words from notes corresponding to the target conditions.
Bigrams and higher order n-grams could be good modeling features.

Data Dictionary

Feature	Datatype	Definition
pn_num	int	A unique identifier for each patient notes
case_num	int	A unique identifier for the clinical case a patient note represent
pn_history	String	The text of the encounter as recorded by the test taker.
feature_num	int	A unique identifier for each feature.
feature_text	String	A description of the feature
Id	int	A unique identifier identifier for each patient note / feature pair.
Annotation	String	The text within the patient’s note indicating a feature
Location	Int	The character spans indicating the location of each annotations within the notes
Original	String	The raw text as recorded by the test taker.
clean	String	The cleanned version of the raw text tokenized, with stoword removed.
stemmed	String	reducing inflection in words to their root forms
lemmatized	String	considers the context and converts the word to its meaningful base form

Steps to Reproduce

Plan

Classical Modeling approach

Acquire Data
1. Download from kaggle.
Prepare Data Use the NLTK to:
1. Convert text to lower case.
2. Remove any accented, non-ASCII characters.
3. Remove any special characters.
4. Lemmatize words.
5. Remove stopwords
6. Produce a dataframe with the original and cleaned text.
7. Split data:
  1. Train (700), test(300).
    1. We will use k-fold cross validation in lieu of having another out-of-sample split.
  2. Separate x/y features / target (clinical feature).
Explore
1. Look at the clinical features represented.
2. Separate the overall corpus into word lists for each patient and each clinical feature.
3. Look at the frequency of words in each list to determine unique words or words with high predictive value.
4. Visualize most frequent words by patient.
5. Examine bigrams and other n-grams.
Model: MVP will be a model determining the case number from the patient note
1. Feature enginnering
  1. Determine Term Frequency(TF).
  2. Determine Inverse Document Frequency(IDF).
  3. Calculate the TF-IDF index.
2. Build models
  1. Logistic regression
  2. Decision trees
  3. Random Forest
  4. KNN
3. Tune models
  1. Use k-fold cross validation to check for overtraining.
4. Evaluate models
  1. Figure out the baseline.
  2. Determine the most important evaluation metric.
  3. Compare confusion matrices of different models to see which performs best.
  4. Test the best model on out of sample data. Deep Learning models RNN, LSTM, other neural network approaches to classification.

Additional Classical Models

Attempt to label feature from the training data notes.
Make a separate model for each condition, limiting the number of features.
Find a way to get multiple classificatuon outputs of models.
Look into feature extraction using classical approaches.
Look into model improvement through model stacking or model ensembling.

Deep Leanting Approach

Set up Environment for SciSpacy
Shortfalls with word count vectorzation:
1. i.e., Patient denies pain, shortness of breath, etc., etc., ... Long list of things that they DON'T have.
2. Need context-dependent word embedding
Look into Gensym word embedding
Use SpaCy and specifically, SciSpacy for parts of speech tagging looking to identify conditions, drugs, and other relavant features.
Consider training a different model for each case/
Look at LSTM and multilable, multiclass.

Delivery

Refine best visuals.
Create slides.
Divide presentation.
Rehearse.
Deliver

Key Findings

Classical models perform well on simple tasks like assigning which of the ten possible cases the note belongs to based on the content of the note.
These approaches fail at more complex tasks like extracting the 143 different clinical concepts from the notes
Concepts can be expressed in many different ways with many different words
Spelling is inconsistent in these notes with many misspellings
TF-IDF and other word-count based approaches are insufficient
More sophisticated methods for word embedding that capture context are needed
- SpaCy
- SciSpaCy
- BioBERT
Each of these additional deep-learning approaches add new, useful information, but none of them alone solve the problem of extracting and identifying clinical concepts from patient notes
Our final model will likely be an ensemble of different models employing these techniques.

Conclusions

Classical techniques with word counts can be used to accurately predict simple classifications.
These techniques are insufficient for identifying clinical concepts in notes because there are so many different ways to express the same concept in words.
Deep learning approaches that look at word context improve models.
Some of those approaches have been trained specifically on biomedical data and those perform best.

Future Directions

Add BioBERT.
Use different word embeddings for deep learning models using LSTMs.
Create ensemble models incorporating these different embeddings.

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
biobert_pytorch		biobert_pytorch
data		data
environments		environments
images		images
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mvp-presentation-notebook.ipynb		mvp-presentation-notebook.ipynb
mvp_scipspacy_appendix.ipynb		mvp_scipspacy_appendix.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP-Medical Record-CapstoneProject

Authors

About the Project

Project Goals

Initial Questions

Initial Hypotheses

Data Dictionary

Steps to Reproduce

Plan

Classical Modeling approach

Additional Classical Models

Deep Leanting Approach

Delivery

Key Findings

Conclusions

Future Directions

Thank you for your time and attention

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

codeup-nlp-capstone/clinical_patient_notes

Folders and files

Latest commit

History

Repository files navigation

NLP-Medical Record-CapstoneProject

Authors

About the Project

Project Goals

Initial Questions

Initial Hypotheses

Data Dictionary

Steps to Reproduce

Plan

Classical Modeling approach

Additional Classical Models

Deep Leanting Approach

Delivery

Key Findings

Conclusions

Future Directions

Thank you for your time and attention

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages