diff --git a/docs/additional_code.html b/docs/additional_code.html index 052e6863..ed544344 100644 --- a/docs/additional_code.html +++ b/docs/additional_code.html @@ -80,19 +80,7 @@ alphapept - - + +
diff --git a/docs/chem.html b/docs/chem.html index c7d40843..f2060ec4 100644 --- a/docs/chem.html +++ b/docs/chem.html @@ -145,19 +145,7 @@ alphapept - - + +
@@ -328,7 +316,7 @@

Chem

-

This notebook contains all chemistry-related functionality. Here, a major part is functionality to generate isotope patterns and the averagine model. This is required for feature finding, where an isotope pattern is validated by being compared to its averagine model. We use the data structure Isotopes from constants to handle Isotopes. Next, we define the class IsotopeDistribution to calculate isotope distributions for a given Isotope. The core of the calculation is the function fast_add, which allows the fast estimation of isotope distributions.
To calculate the isotope distribution for the averagine model, we define the function get_average_formula, which calculates the amino acid composition of the averagine molecule for the given mass.

+

This notebook contains all chemistry-related functionality. Here, a major part is functionality to generate isotope patterns and the averagine model. This is required for feature finding, where an isotope pattern is validated by being compared to its averagine model. We use the data structure Isotopes from constants to handle Isotopes. Next, we define the class IsotopeDistribution to calculate isotope distributions for a given Isotope. The core of the calculation is the function fast_add, which allows the fast estimation of isotope distributions.
To calculate the isotope distribution for the averagine model, we define the function get_average_formula, which calculates the amino acid composition of the averagine molecule for the given mass.

@@ -348,7 +336,7 @@

IsotopeDistributions<

The implementation is a Python port of the Java version of the proteomicore - implementation from Johan Teleman

In brief, the approach avoids expanding polynomial expressions by combining precalculated patterns of hypothetical atom clusters and pruning low-intensity peaks.
To calculate the isotope distribution for a given isotope, we define the function dict_to_dist, which accepts a dictionary of amino acids and returns the isotope distribution.


-

source

+

source

dict_to_dist

@@ -359,7 +347,7 @@

dict_to_dist

Args: counted_AA (Dict): Numba-typed dict with counts of atoms. isotopes (Dict): Numba-typed lookup dict with isotopes.

Returns: IsotopeDistribution: The calculated isotope distribution for the chemical compound.


-

source

+

source

numba_bin

@@ -370,7 +358,7 @@

numba_bin

Args: decimal (int): Decimal number.

Returns: list: Number in binary.


-

source

+

source

fast_add

@@ -428,9 +416,9 @@

IsotopeDistribution

Averagine

-

The averagine model is based on Senko et al.. We define the function get_average_formula to calculate a dictionary with averagine masses.

+

The averagine model is based on Senko et al.. We define the function get_average_formula to calculate a dictionary with averagine masses.


-

source

+

source

get_average_formula

@@ -453,9 +441,9 @@

get_average_formula{C: 13, H: 24, N: 4, O: 4, S: 0}

-

To directly calculate the isotope distribution of a molecule mass based on the averagine model, we define the wrapper function mass_to_dist.

+

To directly calculate the isotope distribution of a molecule mass based on the averagine model, we define the wrapper function mass_to_dist.


-

source

+

source

mass_to_dist

@@ -499,9 +487,9 @@

mass_to_dist

Mass Calculations

-

calculate_mass: This function allows to calculate the precursor mass from the monoisotopic m/z and the charge.

+

calculate_mass: This function allows to calculate the precursor mass from the monoisotopic m/z and the charge.


-

source

+

source

calculate_mass

diff --git a/docs/constants.html b/docs/constants.html index 1fb3c02f..f8642f02 100644 --- a/docs/constants.html +++ b/docs/constants.html @@ -144,19 +144,7 @@ alphapept
- - + +
@@ -339,7 +327,7 @@

Amino Acids

Mass dict

A numba compatible mass dict dictionary. This is created from the modifications.tsv. Change to allow custom modifications.


-

source

+

source

get_mass_dict

diff --git a/docs/contributing.html b/docs/contributing.html index a5309526..151f609a 100644 --- a/docs/contributing.html +++ b/docs/contributing.html @@ -80,19 +80,7 @@ alphapept - - + +
@@ -404,7 +385,7 @@

Useful tricks:


-

source

+

source

diff --git a/docs/display.html b/docs/display.html index 16aad837..13a2f900 100644 --- a/docs/display.html +++ b/docs/display.html @@ -81,19 +81,7 @@ alphapept - - + +
@@ -256,7 +244,7 @@

Display

Sequence coverage

Calculate the coverage of a target protein sequence by peptides in a given list


-

source

+

source

calculate_sequence_coverage

diff --git a/docs/export.html b/docs/export.html index 260b56a7..3efe77ce 100644 --- a/docs/export.html +++ b/docs/export.html @@ -143,19 +143,7 @@ alphapept - - + +
@@ -324,7 +312,7 @@

MaxQuant output file

Sequence Notation

Dictionary to be able to convert AlphaPept sequence notaiton to MaxQuant


-

source

+

source

ap_to_mq_sequence

@@ -332,7 +320,7 @@

ap_to_mq_sequence

Converts AlphaPept sequence format to MaxQuant Format returns sequence_naked, len_sequence, modifications_, mq_sequence


-

source

+

source

remove_mods

@@ -354,7 +342,7 @@

evidence.txt

for _ in evidence.columns: print(f"mq_dict_evidence['{_}'] =")
-

source

+

source

prepare_ap_results

diff --git a/docs/fasta.html b/docs/fasta.html index 7a8ee418..1e4a3b97 100644 --- a/docs/fasta.html +++ b/docs/fasta.html @@ -145,19 +145,7 @@ alphapept - - + +
@@ -404,9 +392,9 @@

FASTA

Currently, numba has only limited string support. A lot of the functions are therefore Python-native.

Cleaving

-

We use regular expressions to find potential cleavage sites for cleaving and write the wrapper cleave_sequence to use it.

+

We use regular expressions to find potential cleavage sites for cleaving and write the wrapper cleave_sequence to use it.


-

source

+

source

cleave_sequence

@@ -416,7 +404,7 @@

cleave_sequence

Cleave a sequence with a given protease. Filters to have a minimum and maximum length. Args: sequence (str): the given (protein) sequence. n_missed_cleavages (int): the number of max missed cleavages. protease (str): the protease/enzyme name, the regular expression can be found in alphapept.constants.protease_dict. pep_length_min (int): min peptide length. pep_length_max (int): max peptide length. Returns: list (of str): cleaved peptide sequences with missed cleavages.


-

source

+

source

get_missed_cleavages

@@ -440,7 +428,7 @@

get_missed_cleavages<

Counting missed and internal cleavages

The following are helper functions to retrieve the number of missed cleavages and internal cleavage sites for each sequence.


-

source

+

source

count_internal_cleavages

@@ -449,7 +437,7 @@

count_internal_cl

Counts the number of internal cleavage sites for a given sequence and protease Args: sequence (str): the given (peptide) sequence. protease (str): the protease/enzyme name, the regular expression can be found in alphapept.constants.protease_dict. Returns: int (0 or 1): if the sequence is from internal cleavage.


-

source

+

source

count_missed_cleavages

@@ -473,9 +461,9 @@

count_missed_cleava

Parsing

-

Peptides are composed out of amino acids that are written in capital letters - PEPTIDE. To distinguish modifications, they are written in lowercase such as PEPTIoxDE and can be of arbitrary length. For a modified amino acid (AA), the modification precedes the letter of the amino acid. Decoys are indicated with an underscore. Therefore, the parse function splits after _. When parsing, the peptide string is converted into a numba-compatible list, like so: PEPoxTIDE -> [P, E, P, oxT, I, D, E]. This allows that we can use the mass_dict from alphapept.constants to directly determine the masses for the corresponding amino acids.

+

Peptides are composed out of amino acids that are written in capital letters - PEPTIDE. To distinguish modifications, they are written in lowercase such as PEPTIoxDE and can be of arbitrary length. For a modified amino acid (AA), the modification precedes the letter of the amino acid. Decoys are indicated with an underscore. Therefore, the parse function splits after _. When parsing, the peptide string is converted into a numba-compatible list, like so: PEPoxTIDE -> [P, E, P, oxT, I, D, E]. This allows that we can use the mass_dict from alphapept.constants to directly determine the masses for the corresponding amino acids.


-

source

+

source

list_to_numba

@@ -483,7 +471,7 @@

list_to_numba

Convert Python list to numba.typed.List Args: a_list (list): Python list. Return: List (numba.typed.List): Numba typed list.


-

source

+

source

parse

@@ -503,9 +491,9 @@

parse

Decoy

-

The decoy strategy employed is a pseudo-reversal of the peptide sequence, keeping only the terminal amino acid and reversing the rest. Additionally, we can call the functions swap_KR and and swap_AL that will swap the respective AAs. The function swap_KR will only swap terminal AAs. The swapping functions only work if the AA is not modified.

+

The decoy strategy employed is a pseudo-reversal of the peptide sequence, keeping only the terminal amino acid and reversing the rest. Additionally, we can call the functions swap_KR and and swap_AL that will swap the respective AAs. The function swap_KR will only swap terminal AAs. The swapping functions only work if the AA is not modified.


-

source

+

source

add_decoy_tag

@@ -513,7 +501,7 @@

add_decoy_tag

Adds a ’_decoy’ tag to a list of peptides


-

source

+

source

get_decoys

@@ -523,7 +511,7 @@

get_decoys

Wrapper to get decoys for lists of peptides Args: peptide_list (list): the list of peptides to be reversed. pseudo_reverse (bool): If True, reverse the peptide bug keep the C-terminal amino acid; otherwise reverse the whole peptide. (Default: False) AL_swap (bool): replace A with L, and vice versa. (Default: False) KR_swap (bool): replace K with R at the C-terminal, and vice versa. (Default: False) Returns: list (of str): a list of decoy peptides


-

source

+

source

swap_AL

@@ -533,7 +521,7 @@

swap_AL

Swaps a A with L. Note: Only if AA is not modified. Args: peptide (str): peptide.

Returns: str: peptide with swapped ALs.


-

source

+

source

swap_KR

@@ -544,7 +532,7 @@

swap_KR

Args: peptide (str): peptide.

Returns: str: peptide with swapped KRs.


-

source

+

source

get_decoy_sequence

@@ -582,7 +570,7 @@

Modifications

Fixed Modifications

Fixed modifications are implemented by passing a list with modified AAs that should be replaced. As a AA is only one letter, the remainder is the modification.


-

source

+

source

add_fixed_mods

@@ -613,7 +601,7 @@

add_fixed_mods

Variable Modifications

To employ variable modifications, we loop through each variable modification and each position of the peptide and add them to the peptide list. For each iteration in get_isoforms, one more variable modification will be added.


-

source

+

source

get_isoforms

@@ -623,7 +611,7 @@

get_isoforms

Function to generate modified forms (with variable modifications) for a given peptide - returns a list of modified forms. The original sequence is included in the list Args: mods_variable_dict (dict): Dicitionary with modifications. The key is AA, and value is the modified form (e.g. oxM). peptide (str): the peptide sequence to generate modified forms. isoforms_max (int): max number of modified forms to generate per peptide. n_modifications_max (int, optional): max number of variable modifications per peptide. Returns: list (of str): the list of peptide forms for the given peptide


-

source

+

source

add_variable_mod

@@ -644,9 +632,9 @@

add_variable_mod

['AMAMA', 'AoxMAMA', 'AMAoxMA'] -

Lastly, we define the wrapper add_variable_mods so that the functions can be called for lists of peptides and a list of variable modifications.

+

Lastly, we define the wrapper add_variable_mods so that the functions can be called for lists of peptides and a list of variable modifications.


-

source

+

source

add_variable_mods

@@ -676,7 +664,7 @@

Terminal Mo

Additionally, if we want to have a terminal modification on any AA we indicate this ^.


-

source

+

source

add_fixed_mods_terminal

@@ -686,7 +674,7 @@

add_fixed_mods_ter

Wrapper to add fixed mods on sequences and lists of mods Args: peptides (list of str): peptide list. mods_fixed_terminal (list of str): list of fixed terminal mods. Raises: “Invalid fixed terminal modification {mod}” exception for the given mod. Returns: list (of str): list of peptides with modification added.


-

source

+

source

add_fixed_mod_terminal

@@ -713,9 +701,9 @@

add_fixed_mod_termi

Terminal Modifications - Variable

-

Lastly, to handle terminal variable modifications, we use the function add_variable_mods_terminal. As the modification can only be at the terminal end, this function only adds a peptide where the terminal end is modified.

+

Lastly, to handle terminal variable modifications, we use the function add_variable_mods_terminal. As the modification can only be at the terminal end, this function only adds a peptide where the terminal end is modified.


-

source

+

source

get_unique_peptides

@@ -724,7 +712,7 @@

get_unique_peptides

Function to return unique elements from list. Args: peptides (list of str): peptide list. Returns: list (of str): list of peptides (unique).


-

source

+

source

add_variable_mods_terminal

@@ -743,9 +731,9 @@

add_variable_mo

Generating Peptides

-

Lastly, we put all the functions into a wrapper generate_peptides. It will accept a peptide and a dictionary with settings so that we can get all modified peptides.

+

Lastly, we put all the functions into a wrapper generate_peptides. It will accept a peptide and a dictionary with settings so that we can get all modified peptides.


-

source

+

source

check_peptide

@@ -754,7 +742,7 @@

check_peptide

Check if the peptide contains non-AA letters. Args: peptide (str): peptide sequence. AAs (set): the set of legal amino acids. See alphapept.constants.AAs Returns: bool: True if all letters in the peptide is the subset of AAs, otherwise False


-

source

+

source

generate_peptides

@@ -796,12 +784,12 @@

generate_peptides

Mass Calculations

-

Using the mass_dict from constants and being able to parse sequences with parse, one can simply look up the masses for each modified or unmodified amino acid and add everything up.

+

Using the mass_dict from constants and being able to parse sequences with parse, one can simply look up the masses for each modified or unmodified amino acid and add everything up.

Precursor

To calculate the mass of the neutral precursor, we start with the mass of an \(H_2O\) and add the masses of all amino acids of the sequence.


-

source

+

source

get_precmass

@@ -818,11 +806,11 @@

get_precmass

Fragments

-

Likewise, we can calculate the masses of the fragment ions. We employ two functions: get_fragmass and get_frag_dict.

-

get_fragmass is a fast, numba-compatible function that calculates the fragment masses and returns an array indicating whether the ion-type was b or y.

-

get_frag_dict instead is not numba-compatible and hence a bit slower. It returns a dictionary with the respective ion and can be used for plotting theoretical spectra.

+

Likewise, we can calculate the masses of the fragment ions. We employ two functions: get_fragmass and get_frag_dict.

+

get_fragmass is a fast, numba-compatible function that calculates the fragment masses and returns an array indicating whether the ion-type was b or y.

+

get_frag_dict instead is not numba-compatible and hence a bit slower. It returns a dictionary with the respective ion and can be used for plotting theoretical spectra.


-

source

+

source

get_fragmass

@@ -840,7 +828,7 @@

get_fragmass


-

source

+

source

get_frag_dict

@@ -894,16 +882,16 @@

get_frag_dict

Spectra

-

The function get_spectrum returns a tuple with the following content:

+

The function get_spectrum returns a tuple with the following content:

  • precursor mass
  • peptide sequence
  • fragmasses
  • fragtypes
-

Likewise, get_spectra returns a list of tuples. We employ a list of tuples here as this way, we can sort them easily by precursor mass.

+

Likewise, get_spectra returns a list of tuples. We employ a list of tuples here as this way, we can sort them easily by precursor mass.


-

source

+

source

get_spectra

@@ -913,7 +901,7 @@

get_spectra

Get neutral peptide mass, fragment masses and fragment types for a list of peptides Args: peptides (list of str): the (modified) peptide list. mass_dict (numba.typed.Dict): key is the amino acid or modified amino acid, and the value is the mass. Raises: Unknown exception and pass. Returns: list of Tuple[float, str, np.ndarray(np.float64), np.ndarray(np.int8)]: See get_spectrum.


-

source

+

source

get_spectrum

@@ -933,10 +921,10 @@

get_spectrum

Reading FASTA

-

To read FASTA files, we use the SeqIO module from the Biopython library. This is a generator expression so that we read one FASTA entry after another until the StopIteration is reached, which is implemented in read_fasta_file. Additionally, we define the function read_fasta_file_entries that simply counts the number of FASTA entries.

-

All FASTA entries that contain AAs which are not in the mass_dict can be checked with check_sequence and will be ignored.

+

To read FASTA files, we use the SeqIO module from the Biopython library. This is a generator expression so that we read one FASTA entry after another until the StopIteration is reached, which is implemented in read_fasta_file. Additionally, we define the function read_fasta_file_entries that simply counts the number of FASTA entries.

+

All FASTA entries that contain AAs which are not in the mass_dict can be checked with check_sequence and will be ignored.


-

source

+

source

check_sequence

@@ -944,7 +932,7 @@

check_sequence

Checks wheter a sequence from a FASTA entry contains valid AAs Args: element (dict): fasta entry of the protein information. AAs (set): a set of amino acid letters. verbose (bool): logging the invalid amino acids. Returns: bool: False if the protein sequence contains non-AA letters, otherwise True.


-

source

+

source

read_fasta_file_entries

@@ -953,7 +941,7 @@

read_fasta_file_en

Function to count entries in fasta file Args: fasta_filename (str): fasta. Returns: int: number of entries.


-

source

+

source

read_fasta_file

@@ -978,9 +966,9 @@

read_fasta_file

Peptide Dictionary

-

In order to efficiently store peptides, we rely on the Python dictionary. The idea is to have a dictionary with peptides as keys and indices to proteins as values. This way, one can quickly look up to which protein a peptide belongs to. The function add_to_pept_dict uses a regular python dictionary and allows to add peptides and stores indices to the originating proteins as a list. If a peptide is already present in the dictionary, the list is appended. The function returns a list of added_peptides, which were not present in the dictionary yet. One can use the function merge_pept_dicts to merge multiple peptide dicts.

+

In order to efficiently store peptides, we rely on the Python dictionary. The idea is to have a dictionary with peptides as keys and indices to proteins as values. This way, one can quickly look up to which protein a peptide belongs to. The function add_to_pept_dict uses a regular python dictionary and allows to add peptides and stores indices to the originating proteins as a list. If a peptide is already present in the dictionary, the list is appended. The function returns a list of added_peptides, which were not present in the dictionary yet. One can use the function merge_pept_dicts to merge multiple peptide dicts.


-

source

+

source

add_to_pept_dict

@@ -1003,7 +991,7 @@

add_to_pept_dict


-

source

+

source

merge_pept_dicts

@@ -1024,9 +1012,9 @@

merge_pept_dicts

Generating a database

-

To wrap everything up, we employ two functions, generate_database and generate_spectra. The first one reads a FASTA file and generates a list of peptides, as well as the peptide dictionary and an ordered FASTA dictionary to be able to look up the protein indices later. For the callback we first read the whole FASTA file to determine the total number of entries in the FASTA file. For a typical FASTA file of 30 Mb with 40k entries, this should take less than a second. The progress of the digestion is monitored by processing the FASTA file one by one. The function generate_spectra then calculates precursor masses and fragment ions. Here, we split the total_number of sequences in 1000 steps to be able to track progress with the callback.

+

To wrap everything up, we employ two functions, generate_database and generate_spectra. The first one reads a FASTA file and generates a list of peptides, as well as the peptide dictionary and an ordered FASTA dictionary to be able to look up the protein indices later. For the callback we first read the whole FASTA file to determine the total number of entries in the FASTA file. For a typical FASTA file of 30 Mb with 40k entries, this should take less than a second. The progress of the digestion is monitored by processing the FASTA file one by one. The function generate_spectra then calculates precursor masses and fragment ions. Here, we split the total_number of sequences in 1000 steps to be able to track progress with the callback.


-

source

+

source

generate_fasta_list

@@ -1034,7 +1022,7 @@

generate_fasta_list

Function to generate a database from a fasta file Args: fasta_paths (str or list of str): fasta path or a list of fasta paths. callback (function, optional): callback function. Returns: fasta_list (list of dict): list of protein entry dict {id:str, name:str, description:str, sequence:str}. fasta_dict (dict{int:dict}): the key is the protein id, the value is the protein entry dict.


-

source

+

source

generate_database

@@ -1044,7 +1032,7 @@

generate_database

Function to generate a database from a fasta file Args: mass_dict (dict): not used, will be removed in the future. fasta_paths (str or list of str): fasta path or a list of fasta paths. callback (function, optional): callback function. Returns: to_add (list of str): non-redundant (modified) peptides to be added. pept_dict (dict{str:list of int}): the key is peptide sequence, and the value is protein id list indicating where the peptide is from. fasta_dict (dict{int:dict}): the key is the protein id, the value is the protein entry dict {id:str, name:str, description:str, sequence:str}.


-

source

+

source

generate_spectra

@@ -1056,9 +1044,9 @@

generate_spectra

Parallelized version

-

To speed up spectra generated, one can use the parallelized version. The function generate_database_parallel reads an entire FASTA file and splits it into multiple blocks. Each block will be processed, and the generated pept_dicts will be merged.

+

To speed up spectra generated, one can use the parallelized version. The function generate_database_parallel reads an entire FASTA file and splits it into multiple blocks. Each block will be processed, and the generated pept_dicts will be merged.


-

source

+

source

blocks

@@ -1066,7 +1054,7 @@

blocks

Helper function to create blocks from a given list Args: l (list): the list n (int): size per block Returns: Generator: List with splitted elements


-

source

+

source

block_idx

@@ -1075,7 +1063,7 @@

block_idx

Helper function to split length into blocks Args: len_list (int): list length. block_size (int, optional, default 1000): size per block. Returns: list[(int, int)]: list of (start, end) positions of blocks.


-

source

+

source

generate_database_parallel

@@ -1084,7 +1072,7 @@

generate_databa

Function to generate a database from a fasta file in parallel. Args: settings: alphapept settings. Returns: list: theoretical spectra. See generate_spectra() dict: peptide dict. See add_to_pept_dict() dict: fasta_dict. See generate_fasta_list()


-

source

+

source

digest_fasta_block

@@ -1095,12 +1083,12 @@

digest_fasta_block

Parallel search on large files

-

In some cases (e.g., a lot of modifications or very large FASTA files), it will not be useful to save the database as it will consume too much memory. Here, we use the function search_parallel from search. It creates theoretical spectra on the fly and directly searches against them. As we cannot create a pept_dict here, we need to create one from the search results. For this, we group peptides by their FASTA index and generate a lookup dictionary that can be used as a pept_dict.

+

In some cases (e.g., a lot of modifications or very large FASTA files), it will not be useful to save the database as it will consume too much memory. Here, we use the function search_parallel from search. It creates theoretical spectra on the fly and directly searches against them. As we cannot create a pept_dict here, we need to create one from the search results. For this, we group peptides by their FASTA index and generate a lookup dictionary that can be used as a pept_dict.

Note that we are passing the settings argument here. Search results should be stored in the corresponding path in the *.hdf file.


-

source

+

source