About • Data • Installation • Method • How To Use • Credits • Code Structure • Results • License
The project is accomplished by team OrloviETF with members:
Igor Pavlovic - @Igzi
Jelisaveta Aleksic - @AleksicJelisaveta
Natasa Jovanovic - @natasa-jovanovic
This repository contains the work done for our second Machine Learning course project, which was completed in conjunction with the Laboratory for Biomolecular Modelling under the mentorship of Lucien Krapp. The goal of our effort is to use amino acid sequences to predict the melting temperature (Tm) of proteins, which an important characteristic that indicates a protein's thermal stability.
The initial approach was to use the train data publicly available for Kaggle's competition Novozymes Enzyme Stability Prediction. There were updates on this file made by the competition organizers and the final train file used is train_updated.
One part of the experiments was conducted by directly using sequences as inputs to our methods and values of tm as outputs. However, based on this discussion, we incorporated the knowledge based data preprocessing to obtain [TO-DO].
Installation may depend on your task. The general steps are the following:
-
(Optional) Create and activate new environment using
conda
orvenv
(+pyenv
).a.
conda
version:# create env conda create -n project_env python=PYTHON_VERSION # activate env conda activate project_env
b.
venv
(+pyenv
) version:# create env ~/.pyenv/versions/PYTHON_VERSION/bin/python3 -m venv project_env # alternatively, using default python version python3 -m venv project_env # activate env source project_env
-
Install all required packages
pip install -r requirements.txt
We implemented two main approaches for this problem.
-
Pretrained ESM 2 Model: Utilize the pretrained ESM 2 model from Hugging Face to generate sequence embeddings. These embeddings are further fine-tuned by training a neural network to predict Tm values.
-
Carbonara Architecture: Used the Carbonara architecture embeddings from our data (precisely, the output from the penultimate layer) and:
- Obtained features used to train a neural network.
- Used RNN model on embeddings directly.
To train an esm model, run the following command:
python3 scripts/run.py
Alternatively, you can also run the models/esm.ipynb
notebook.
To run the Carbonara models, you need to retieve the carbonara outputs used to train the model from this link, and store it inside the root folder.
Afterward simply run the models/carbonara_simpl.ipynb
or models/carbonara_rnn.ipynb
notebooks to reproduce the results.
├── metricks and plots: Folder containing metrics and plots of our models
├── models
├── carbonara_embeddings.ipynb: notebok to process carbonara features and extract embeddings
├── carbonara_rnn.ipynb: rnn model based on carbonara embeddings
├── carbonara_simple.ipyng: MLP model based on carbonara embeddings
├── data_exploration.ipynb: notebook for data explotation
├── esm.ipynb: model based on the ESM output
├── evaluate_models.ipynb: computes the relevant metrics and plots the results
├── predictions: Folder containing model predictions on the validation dataset
├── scripts
├── datasets.py: definition ProteinDataset used by the ESM model
├── esm.py: defitinion of the esm model we used
├── evaluate.py: code to evaluete the performance of the esm model
├── run.py: python script to run and evaluate the ESM model
├── train.py: code for training the esm model
├── CS_433_Class_Project_2.pdf: a report of the project.
├── README.md
├── requirements.txt
├── test.csv: csv file containg test data
├── train.csv: csv file containing training data
├── train_wildtype_groups.csv: csv file containing grouped trained data
├── train_no_wildtype.csv: csv file containing training data which are not grouped
The table below shows the results obtained for Model 1 (ESM) and the best Model 2 (Carbonara MLP with pLDDT factor).
Model | PCC | SCC | RMSE | MAE |
---|---|---|---|---|
ESM | 0.77 ± 0.01 | 0.56 ± 0.01 | 7.7 ± 0.1 | 5.6 ± 0.1 |
Carbonara | 0.49 ± 0.01 | 0.37 ± 0.01 | 11.6 ± 0.2 | 8.7 ± 0.3 |
The estimated training time for 5 epochs of the ESM model is approximately 30 minutes on a workstation with a dedicated GPU. In comparison, the estimated training times for 100 epochs of the Carbonara MLP and Carbonara RNN models are roughly 2 minutes and 30 minutes, respectively.
This repository is based on a heavily modified fork of pytorch-template and asr_project_template repositories.