Skip to content

Getting Started: A Basic Tutorial

nadavbra edited this page Jul 30, 2016 · 36 revisions

Foreword

This short tutorial will help you install and get your hands on ASAP as quickly as possible by demonstrating a basic usage of the framework. Obviously this won't get you to know all of its awesome features, nor will you get fully-optimized results. You will see, however, that the package can provide meaningful results, even with minimal tuning.

Installation

Dependencies

ASAP is dependent on the following Python packages:

  • Biopython
  • NumPy
  • Pandas
  • scikit-learn

Tip: getting these packages is very easy using Anaconda.

Installing ASAP

To install ASAP, simply clone it directly from GitHub:

git clone https://github.com/ddofer/asap.git /choose/your/favorite/location

Make sure that the py/ sub-directory is in your PYTHONPATH whenever you use ASAP.

Testing

Open your Python command-line interpreter and enter:

from asap import *

Getting to know ASAP - a basic tutorial

To demonstrate the power of ASAP, we will try our luck with predicting Serine phosphorylation sites. Inside the data/phosphoserine sub-directory, we have a toy dataset of phosphorylated proteins in the human proteome, taken from UniProt. As the sole purpose of this dataset is to provide an example for this tutorial, we didn't put a lot of efforts into creating and validating it, so it might be incomplete. Hence we do not recommend using it outside of this tutorial.

For convenience, I will assume that you save all files to your homedir; use whatever paths you like instead.

Step 1 - Extract windows and features

Using ASAP requires having a dataset of annotated sequences in the LF format. Inside data/phosphoserine there are two files: annotated_seqs.lf, which contains the full dataset of 4758 sequences, and annotated_seqs_demo.lf, which contains only the first 1013 records. Because we want things to run in a reasonable amount of time, we will use the latter:

annotated_seqs_file = open('<path_to_asap>/data/phosphoserine/annotated_seqs_demo.lf', 'rb')

Choose where to save the extracted windows with features:

csv_output_file = open('~/window_features.csv', 'wb')

If you haven't done so already, now is a good time to import ASAP:

from asap import *

First, we need to define the window extraction parameters we are going to use. We will use the default parameters (meaning default features), except that we are going to filter all the windows that are not serine-centered, as we have the prior knowledge that only serine residues undergo phosphoserine, so including windows of all the other 19 amino-acids is a waste of CPU time and disk storage.

def windows_filter(window): 
    return window.get_aa_seq()[window_extraction_params.window_hot_index] == 'S'

window_extraction_params = WindowExtractionParams(windows_filter = windows_filter)

That's it. You are ready to extract:

extract_windows_from_file(annotated_seqs_file, extract_annotations = True, csv_output_file = csv_output_file, window_extraction_params = window_extraction_params)

It should run for 10-20 minutes. It will produce a large (163MB) CSV file containing 68807 windows, each having 735 features.

When it's done, don't forget to close open files:

annotated_seqs_file.close()
csv_output_file.close()

ASAP is designed as a feature engineering framework, but you may use it only for extracting features, later processing the output CSV file with your own preferred pipeline (you can also use the method get_windows_data). However, ASAP can further help you to apply standard ML pipelines on top of the extracted features, obtaining decent results with minimal effort. That's what we are going to do in the rest of the tutorial.

Step 2 - Train and estimate a model

We will now use the extracted windows with features to train and estimate a ML model using a standard cross-validation (CV) procedure.

We will use a simple linear SVM model:

from sklearn.svm import LinearSVC
classifiers = [LinearSVC(class_weight = 'auto')]

Load the CSV file you have just generated, using Pandas:

import pandas as pd
windows_data_frame = pd.read_csv('~/window_features.csv')

We will use 3 rounds of cross validation:

window_classifier, performance = train_window_classifier(windows_data_frame, classifiers = classifiers, n_folds = 3)

Let it run; it has some data to crunch. When it finishes, you will have a trained classifier (window_classifier) and a list of various performance parameters (performance, composed of: f1 score, AUC, sensitivity, precision, specificity and a confusion matrix).

You can see that even with these default settings, we got non-trivial results (AUC = 75%, sensitivity = 63%, precision = 21%). As said in the beginning, the focus of this tutorial is to get you familiar with the basics of ASAP, not obtaining fully-optimized results. In particular, we used only a part of the full (potentially noisy) dataset, default settings and no classifier tuning.

You can now create a peptide predictor, and use Pickle to save it for later use:

import pickle

peptide_predictor = PeptidePredictor(window_classifier, window_extraction_params)

with open('~/peptide_predictor.pkl', 'wb') as predictor_dump_file:
    pickle.dump(peptide_predictor, predictor_dump_file)

Step 3 - Use your trained model

After training a classifier, you can use your derived peptide predictor to easily predict annotations for new sequences:

peptide_predictor.predict_annotations('MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR')

You will immediately get a binary output string (comprised of 1s and 0s), indicating the predicted label for each of the input sequence's residues (phosphoserine or not).

Summary

I hope that this short tutorial helped you getting started with ASAP. Although we covered only the basics, ASAP is a highly configurable framework. In order to learn about its full feature repertoire, use the builtin dir and help Python methods to read the documentation of the various module components.

To learn more about the underlying algorithm, please read our paper "ASAP: A Machine-Learning Framework for Local Protein Properties". If you found our work useful for your research, please cite it.

Contact us

For any issue/request, feel free to contact us: Nadav Brandes ([email protected]) and Dan Ofer (ddofer "at" gmail.com).