Here we will use natural language processing in order to identify medical condition listed in patient history notes.
Writing patient notes to document the history of a patient's complaint, exam findings, possible diagnoses, and care. Learning and assessing these skills requires feedback from other doctors.
Until recently, a part of clinical skills exams for medical students involved interacting with standardized patients and taking notes. These notes were later scored by trained physician raters in a labor intensive process.
This project seeks to use natural language processing to identify specific clinical concepts in patient nots.
- Develop automated method to map clinical concepts from an exam ruberic to various ways in which these concepts are expressed in clinical patient notes written by medical students.
DELIVERABLES:
- A well-documented Jupyter notebook that contains a report of your analysis, and link to that notebook.
- A slideshow suitable for a general audience that summarizes your findings.
- Include well-labeled visualizations in your slides.
- Link to the team Trello board.
- A presentation. Each team member should present a portion of the presentation.
- What clinical conditions are present for the 10 standardized patients?
- On average, how many conditions do students correctly label?
- What words or phrases are tied to specific patients and conditions?
- How do we predict multiple outcomes in a multiclass classification process?
- What other NLP libraries could be useful besides NLTK? (Spacey)
- What other pre-trained medical NLP models would be good? (i.e., BioBERT)
- What deep learning approaches would be appropriate to solving these questions?
- There will be patient and condition specific words from notes corresponding to the target conditions.
- Bigrams and higher order n-grams could be good modeling features.
Feature | Datatype | Definition |
---|---|---|
pn_num | int | A unique identifier for each patient notes |
case_num | int | A unique identifier for the clinical case a patient note represent |
pn_history | String | The text of the encounter as recorded by the test taker. |
feature_num | int | A unique identifier for each feature. |
feature_text | String | A description of the feature |
Id | int | A unique identifier identifier for each patient note / feature pair. |
Annotation | String | The text within the patient’s note indicating a feature |
Location | Int | The character spans indicating the location of each annotations within the notes |
Original | String | The raw text as recorded by the test taker. |
clean | String | The cleanned version of the raw text tokenized, with stoword removed. |
stemmed | String | reducing inflection in words to their root forms |
lemmatized | String | considers the context and converts the word to its meaningful base form |
- Read the
README.md
. - Clone this repository to your local environment.
-
git clone [email protected]:codeup-nlp-capstone/nlp-capstone.git
-
- Open the
tf_env.yml
file and follow the instructions to create an environment with the proper libraries. - Download the
csv
files from the kaggle website and ensure they are in your repository directory.-
features.csv
-
patient_notes.csv
-
test.csv
-
train.csv
-
- Run final report notebook.
- Acquire Data
- Download from kaggle.
- Prepare Data Use the NLTK to:
- Convert text to lower case.
- Remove any accented, non-ASCII characters.
- Remove any special characters.
- Lemmatize words.
- Remove stopwords
- Produce a dataframe with the original and cleaned text.
- Split data:
- Train (700), test(300).
- We will use k-fold cross validation in lieu of having another out-of-sample split.
- Separate x/y features / target (clinical feature).
- Train (700), test(300).
- Explore
- Look at the clinical features represented.
- Separate the overall corpus into word lists for each patient and each clinical feature.
- Look at the frequency of words in each list to determine unique words or words with high predictive value.
- Visualize most frequent words by patient.
- Examine bigrams and other n-grams.
- Model:
MVP will be a model determining the case number from the patient note
- Feature enginnering
- Determine Term Frequency(TF).
- Determine Inverse Document Frequency(IDF).
- Calculate the TF-IDF index.
- Build models
- Logistic regression
- Decision trees
- Random Forest
- KNN
- Tune models
- Use k-fold cross validation to check for overtraining.
- Evaluate models
- Figure out the baseline.
- Determine the most important evaluation metric.
- Compare confusion matrices of different models to see which performs best.
- Test the best model on out of sample data. Deep Learning models RNN, LSTM, other neural network approaches to classification.
- Feature enginnering
- Attempt to label feature from the training data notes.
- Make a separate model for each condition, limiting the number of features.
- Find a way to get multiple classificatuon outputs of models.
- Look into feature extraction using classical approaches.
- Look into model improvement through model stacking or model ensembling.
- Set up Environment for SciSpacy
- Shortfalls with word count vectorzation:
- i.e., Patient denies pain, shortness of breath, etc., etc., ... Long list of things that they DON'T have.
- Need context-dependent word embedding
- Look into Gensym word embedding
- Use SpaCy and specifically, SciSpacy for parts of speech tagging looking to identify conditions, drugs, and other relavant features.
- Consider training a different model for each case/
- Look at LSTM and multilable, multiclass.
- Refine best visuals.
- Create slides.
- Divide presentation.
- Rehearse.
- Deliver
- Classical models perform well on simple tasks like assigning which of the ten possible cases the note belongs to based on the content of the note.
- These approaches fail at more complex tasks like extracting the 143 different clinical concepts from the notes
- Concepts can be expressed in many different ways with many different words
- Spelling is inconsistent in these notes with many misspellings
- TF-IDF and other word-count based approaches are insufficient
- More sophisticated methods for word embedding that capture context are needed
- SpaCy
- SciSpaCy
- BioBERT
- Each of these additional deep-learning approaches add new, useful information, but none of them alone solve the problem of extracting and identifying clinical concepts from patient notes
- Our final model will likely be an ensemble of different models employing these techniques.
- Classical techniques with word counts can be used to accurately predict simple classifications.
- These techniques are insufficient for identifying clinical concepts in notes because there are so many different ways to express the same concept in words.
- Deep learning approaches that look at word context improve models.
- Some of those approaches have been trained specifically on biomedical data and those perform best.
- Add BioBERT.
- Use different word embeddings for deep learning models using LSTMs.
- Create ensemble models incorporating these different embeddings.