Protein Melting Temperature Prediction

About • Data • Installation • Method • How To Use • Credits • Code Structure • Results • License

Team

The project is accomplished by team OrloviETF with members:

Igor Pavlovic - @Igzi

Jelisaveta Aleksic - @AleksicJelisaveta

Natasa Jovanovic - @natasa-jovanovic

About

This repository contains the work done for our second Machine Learning course project, which was completed in conjunction with the Laboratory for Biomolecular Modelling under the mentorship of Lucien Krapp. The goal of our effort is to use amino acid sequences to predict the melting temperature (Tm) of proteins, which an important characteristic that indicates a protein's thermal stability.

Data

The initial approach was to use the train data publicly available for Kaggle's competition Novozymes Enzyme Stability Prediction. There were updates on this file made by the competition organizers and the final train file used is train_updated.

One part of the experiments was conducted by directly using sequences as inputs to our methods and values of tm as outputs. However, based on this discussion, we incorporated the knowledge based data preprocessing to obtain [TO-DO].

Installation

Installation may depend on your task. The general steps are the following:

(Optional) Create and activate new environment using conda or venv (+pyenv).

a. conda version:

# create env
conda create -n project_env python=PYTHON_VERSION

# activate env
conda activate project_env

b. venv (+pyenv) version:

# create env
~/.pyenv/versions/PYTHON_VERSION/bin/python3 -m venv project_env

# alternatively, using default python version
python3 -m venv project_env

# activate env
source project_env

Install all required packages
```
pip install -r requirements.txt
```

Method

We implemented two main approaches for this problem.

Pretrained ESM 2 Model: Utilize the pretrained ESM 2 model from Hugging Face to generate sequence embeddings. These embeddings are further fine-tuned by training a neural network to predict Tm values.
Carbonara Architecture: Used the Carbonara architecture embeddings from our data (precisely, the output from the penultimate layer) and:
1. Obtained features used to train a neural network.
2. Used RNN model on embeddings directly.

How To Use

To train an esm model, run the following command:

python3 scripts/run.py

Alternatively, you can also run the models/esm.ipynb notebook.

To run the Carbonara models, you need to retieve the carbonara outputs used to train the model from this link, and store it inside the root folder. Afterward simply run the models/carbonara_simpl.ipynb or models/carbonara_rnn.ipynb notebooks to reproduce the results.

Code Structure

├── metricks and plots:  Folder containing metrics and plots of our models
├── models
    ├── carbonara_embeddings.ipynb: notebok to process carbonara features and extract embeddings
    ├── carbonara_rnn.ipynb: rnn model based on carbonara embeddings
    ├── carbonara_simple.ipyng: MLP model based on carbonara embeddings
    ├── data_exploration.ipynb: notebook for data explotation
    ├── esm.ipynb: model based on the ESM output
    ├── evaluate_models.ipynb: computes the relevant metrics and plots the results
├── predictions: Folder containing model predictions on the validation dataset
├── scripts
    ├── datasets.py: definition ProteinDataset used by the ESM model
    ├── esm.py: defitinion of the esm model we used
    ├── evaluate.py: code to evaluete the performance of the esm model
    ├── run.py: python script to run and evaluate the ESM model
    ├── train.py: code for training the esm model
├── CS_433_Class_Project_2.pdf: a report of the project.
├── README.md
├── requirements.txt
├── test.csv: csv file containg test data
├── train.csv: csv file containing training data
├── train_wildtype_groups.csv: csv file containing grouped trained data
├── train_no_wildtype.csv: csv file containing training data which are not grouped

Results

The table below shows the results obtained for Model 1 (ESM) and the best Model 2 (Carbonara MLP with pLDDT factor).

Model	PCC	SCC	RMSE	MAE
ESM	0.77 ± 0.01	0.56 ± 0.01	7.7 ± 0.1	5.6 ± 0.1
Carbonara	0.49 ± 0.01	0.37 ± 0.01	11.6 ± 0.2	8.7 ± 0.3

The estimated training time for 5 epochs of the ESM model is approximately 30 minutes on a workstation with a dedicated GPU. In comparison, the estimated training times for 100 epochs of the Carbonara MLP and Carbonara RNN models are roughly 2 minutes and 30 minutes, respectively.

Credits

This repository is based on a heavily modified fork of pytorch-template and asr_project_template repositories.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Melting Temperature Prediction

Team

About

Data

Installation

Method

How To Use

Code Structure

Results

Credits

License

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
metrics and plots		metrics and plots
models		models
predictions		predictions
scripts		scripts
.gitignore		.gitignore
CS_433_Class_Project_2.pdf		CS_433_Class_Project_2.pdf
README.md		README.md
requirements.txt		requirements.txt
test.csv		test.csv
train_no_wildtype.csv		train_no_wildtype.csv
train_updated.csv		train_updated.csv
train_wildtype_groups.csv		train_wildtype_groups.csv

CS-433/ml-project-2-orlovietf

Folders and files

Latest commit

History

Repository files navigation

Protein Melting Temperature Prediction

Team

About

Data

Installation

Method

How To Use

Code Structure

Results

Credits

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages