Skip to content

CS-433/ml-project-2-orlovietf

Repository files navigation

Protein Melting Temperature Prediction

AboutDataInstallationMethodHow To UseCreditsCode StructureResultsLicense

Team

The project is accomplished by team OrloviETF with members:

Igor Pavlovic - @Igzi

Jelisaveta Aleksic - @AleksicJelisaveta

Natasa Jovanovic - @natasa-jovanovic

About

This repository contains the work done for our second Machine Learning course project, which was completed in conjunction with the Laboratory for Biomolecular Modelling under the mentorship of Lucien Krapp. The goal of our effort is to use amino acid sequences to predict the melting temperature (Tm) of proteins, which an important characteristic that indicates a protein's thermal stability.

Data

The initial approach was to use the train data publicly available for Kaggle's competition Novozymes Enzyme Stability Prediction. There were updates on this file made by the competition organizers and the final train file used is train_updated.

One part of the experiments was conducted by directly using sequences as inputs to our methods and values of tm as outputs. However, based on this discussion, we incorporated the knowledge based data preprocessing to obtain [TO-DO].

Installation

Installation may depend on your task. The general steps are the following:

  1. (Optional) Create and activate new environment using conda or venv (+pyenv).

    a. conda version:

    # create env
    conda create -n project_env python=PYTHON_VERSION
    
    # activate env
    conda activate project_env

    b. venv (+pyenv) version:

    # create env
    ~/.pyenv/versions/PYTHON_VERSION/bin/python3 -m venv project_env
    
    # alternatively, using default python version
    python3 -m venv project_env
    
    # activate env
    source project_env
  2. Install all required packages

    pip install -r requirements.txt

Method

We implemented two main approaches for this problem.

  • Pretrained ESM 2 Model: Utilize the pretrained ESM 2 model from Hugging Face to generate sequence embeddings. These embeddings are further fine-tuned by training a neural network to predict Tm values.

  • Carbonara Architecture: Used the Carbonara architecture embeddings from our data (precisely, the output from the penultimate layer) and:

    1. Obtained features used to train a neural network.
    2. Used RNN model on embeddings directly.

How To Use

To train an esm model, run the following command:

python3 scripts/run.py

Alternatively, you can also run the models/esm.ipynb notebook.

To run the Carbonara models, you need to retieve the carbonara outputs used to train the model from this link, and store it inside the root folder. Afterward simply run the models/carbonara_simpl.ipynb or models/carbonara_rnn.ipynb notebooks to reproduce the results.

Code Structure

├── metricks and plots:  Folder containing metrics and plots of our models
├── models
    ├── carbonara_embeddings.ipynb: notebok to process carbonara features and extract embeddings
    ├── carbonara_rnn.ipynb: rnn model based on carbonara embeddings
    ├── carbonara_simple.ipyng: MLP model based on carbonara embeddings
    ├── data_exploration.ipynb: notebook for data explotation
    ├── esm.ipynb: model based on the ESM output
    ├── evaluate_models.ipynb: computes the relevant metrics and plots the results
├── predictions: Folder containing model predictions on the validation dataset
├── scripts
    ├── datasets.py: definition ProteinDataset used by the ESM model
    ├── esm.py: defitinion of the esm model we used
    ├── evaluate.py: code to evaluete the performance of the esm model
    ├── run.py: python script to run and evaluate the ESM model
    ├── train.py: code for training the esm model
├── CS_433_Class_Project_2.pdf: a report of the project.
├── README.md
├── requirements.txt
├── test.csv: csv file containg test data
├── train.csv: csv file containing training data
├── train_wildtype_groups.csv: csv file containing grouped trained data
├── train_no_wildtype.csv: csv file containing training data which are not grouped

Results

The table below shows the results obtained for Model 1 (ESM) and the best Model 2 (Carbonara MLP with pLDDT factor).

Model PCC SCC RMSE MAE
ESM 0.77 ± 0.01 0.56 ± 0.01 7.7 ± 0.1 5.6 ± 0.1
Carbonara 0.49 ± 0.01 0.37 ± 0.01 11.6 ± 0.2 8.7 ± 0.3

The estimated training time for 5 epochs of the ESM model is approximately 30 minutes on a workstation with a dedicated GPU. In comparison, the estimated training times for 100 epochs of the Carbonara MLP and Carbonara RNN models are roughly 2 minutes and 30 minutes, respectively.

Credits

This repository is based on a heavily modified fork of pytorch-template and asr_project_template repositories.

License

License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •