This notebook contains all chemistry-related functionality. Here, a major part is functionality to generate isotope patterns and the averagine model. This is required for feature finding, where an isotope pattern is validated by being compared to its averagine model. We use the data structure Isotopes from constants to handle Isotopes. Next, we define the class IsotopeDistribution to calculate isotope distributions for a given Isotope. The core of the calculation is the function fast_add, which allows the fast estimation of isotope distributions. To calculate the isotope distribution for the averagine model, we define the function get_average_formula, which calculates the amino acid composition of the averagine molecule for the given mass.
+
This notebook contains all chemistry-related functionality. Here, a major part is functionality to generate isotope patterns and the averagine model. This is required for feature finding, where an isotope pattern is validated by being compared to its averagine model. We use the data structure Isotopes from constants to handle Isotopes. Next, we define the class IsotopeDistribution to calculate isotope distributions for a given Isotope. The core of the calculation is the function fast_add, which allows the fast estimation of isotope distributions. To calculate the isotope distribution for the averagine model, we define the function get_average_formula, which calculates the amino acid composition of the averagine molecule for the given mass.
In brief, the approach avoids expanding polynomial expressions by combining precalculated patterns of hypothetical atom clusters and pruning low-intensity peaks. To calculate the isotope distribution for a given isotope, we define the function dict_to_dist, which accepts a dictionary of amino acids and returns the isotope distribution.
Cleave a sequence with a given protease. Filters to have a minimum and maximum length. Args: sequence (str): the given (protein) sequence. n_missed_cleavages (int): the number of max missed cleavages. protease (str): the protease/enzyme name, the regular expression can be found in alphapept.constants.protease_dict. pep_length_min (int): min peptide length. pep_length_max (int): max peptide length. Returns: list (of str): cleaved peptide sequences with missed cleavages.
Counts the number of internal cleavage sites for a given sequence and protease Args: sequence (str): the given (peptide) sequence. protease (str): the protease/enzyme name, the regular expression can be found in alphapept.constants.protease_dict. Returns: int (0 or 1): if the sequence is from internal cleavage.
Peptides are composed out of amino acids that are written in capital letters - PEPTIDE. To distinguish modifications, they are written in lowercase such as PEPTIoxDE and can be of arbitrary length. For a modified amino acid (AA), the modification precedes the letter of the amino acid. Decoys are indicated with an underscore. Therefore, the parse function splits after _. When parsing, the peptide string is converted into a numba-compatible list, like so: PEPoxTIDE -> [P, E, P, oxT, I, D, E]. This allows that we can use the mass_dict from alphapept.constants to directly determine the masses for the corresponding amino acids.
+
Peptides are composed out of amino acids that are written in capital letters - PEPTIDE. To distinguish modifications, they are written in lowercase such as PEPTIoxDE and can be of arbitrary length. For a modified amino acid (AA), the modification precedes the letter of the amino acid. Decoys are indicated with an underscore. Therefore, the parse function splits after _. When parsing, the peptide string is converted into a numba-compatible list, like so: PEPoxTIDE -> [P, E, P, oxT, I, D, E]. This allows that we can use the mass_dict from alphapept.constants to directly determine the masses for the corresponding amino acids.
The decoy strategy employed is a pseudo-reversal of the peptide sequence, keeping only the terminal amino acid and reversing the rest. Additionally, we can call the functions swap_KR and and swap_AL that will swap the respective AAs. The function swap_KR will only swap terminal AAs. The swapping functions only work if the AA is not modified.
+
The decoy strategy employed is a pseudo-reversal of the peptide sequence, keeping only the terminal amino acid and reversing the rest. Additionally, we can call the functions swap_KR and and swap_AL that will swap the respective AAs. The function swap_KR will only swap terminal AAs. The swapping functions only work if the AA is not modified.
Wrapper to get decoys for lists of peptides Args: peptide_list (list): the list of peptides to be reversed. pseudo_reverse (bool): If True, reverse the peptide bug keep the C-terminal amino acid; otherwise reverse the whole peptide. (Default: False) AL_swap (bool): replace A with L, and vice versa. (Default: False) KR_swap (bool): replace K with R at the C-terminal, and vice versa. (Default: False) Returns: list (of str): a list of decoy peptides
Fixed modifications are implemented by passing a list with modified AAs that should be replaced. As a AA is only one letter, the remainder is the modification.
To employ variable modifications, we loop through each variable modification and each position of the peptide and add them to the peptide list. For each iteration in get_isoforms, one more variable modification will be added.
Function to generate modified forms (with variable modifications) for a given peptide - returns a list of modified forms. The original sequence is included in the list Args: mods_variable_dict (dict): Dicitionary with modifications. The key is AA, and value is the modified form (e.g. oxM). peptide (str): the peptide sequence to generate modified forms. isoforms_max (int): max number of modified forms to generate per peptide. n_modifications_max (int, optional): max number of variable modifications per peptide. Returns: list (of str): the list of peptide forms for the given peptide
Wrapper to add fixed mods on sequences and lists of mods Args: peptides (list of str): peptide list. mods_fixed_terminal (list of str): list of fixed terminal mods. Raises: “Invalid fixed terminal modification {mod}” exception for the given mod. Returns: list (of str): list of peptides with modification added.
Lastly, to handle terminal variable modifications, we use the function add_variable_mods_terminal. As the modification can only be at the terminal end, this function only adds a peptide where the terminal end is modified.
+
Lastly, to handle terminal variable modifications, we use the function add_variable_mods_terminal. As the modification can only be at the terminal end, this function only adds a peptide where the terminal end is modified.
Lastly, we put all the functions into a wrapper generate_peptides. It will accept a peptide and a dictionary with settings so that we can get all modified peptides.
+
Lastly, we put all the functions into a wrapper generate_peptides. It will accept a peptide and a dictionary with settings so that we can get all modified peptides.
Check if the peptide contains non-AA letters. Args: peptide (str): peptide sequence. AAs (set): the set of legal amino acids. See alphapept.constants.AAs Returns: bool: True if all letters in the peptide is the subset of AAs, otherwise False
Using the mass_dict from constants and being able to parse sequences with parse, one can simply look up the masses for each modified or unmodified amino acid and add everything up.
+
Using the mass_dict from constants and being able to parse sequences with parse, one can simply look up the masses for each modified or unmodified amino acid and add everything up.
Precursor
To calculate the mass of the neutral precursor, we start with the mass of an \(H_2O\) and add the masses of all amino acids of the sequence.
Likewise, we can calculate the masses of the fragment ions. We employ two functions: get_fragmass and get_frag_dict.
-
get_fragmass is a fast, numba-compatible function that calculates the fragment masses and returns an array indicating whether the ion-type was b or y.
-
get_frag_dict instead is not numba-compatible and hence a bit slower. It returns a dictionary with the respective ion and can be used for plotting theoretical spectra.
+
Likewise, we can calculate the masses of the fragment ions. We employ two functions: get_fragmass and get_frag_dict.
+
get_fragmass is a fast, numba-compatible function that calculates the fragment masses and returns an array indicating whether the ion-type was b or y.
+
get_frag_dict instead is not numba-compatible and hence a bit slower. It returns a dictionary with the respective ion and can be used for plotting theoretical spectra.
Get neutral peptide mass, fragment masses and fragment types for a list of peptides Args: peptides (list of str): the (modified) peptide list. mass_dict (numba.typed.Dict): key is the amino acid or modified amino acid, and the value is the mass. Raises: Unknown exception and pass. Returns: list of Tuple[float, str, np.ndarray(np.float64), np.ndarray(np.int8)]: See get_spectrum.
To read FASTA files, we use the SeqIO module from the Biopython library. This is a generator expression so that we read one FASTA entry after another until the StopIteration is reached, which is implemented in read_fasta_file. Additionally, we define the function read_fasta_file_entries that simply counts the number of FASTA entries.
-
All FASTA entries that contain AAs which are not in the mass_dict can be checked with check_sequence and will be ignored.
+
To read FASTA files, we use the SeqIO module from the Biopython library. This is a generator expression so that we read one FASTA entry after another until the StopIteration is reached, which is implemented in read_fasta_file. Additionally, we define the function read_fasta_file_entries that simply counts the number of FASTA entries.
+
All FASTA entries that contain AAs which are not in the mass_dict can be checked with check_sequence and will be ignored.
Checks wheter a sequence from a FASTA entry contains valid AAs Args: element (dict): fasta entry of the protein information. AAs (set): a set of amino acid letters. verbose (bool): logging the invalid amino acids. Returns: bool: False if the protein sequence contains non-AA letters, otherwise True.
In order to efficiently store peptides, we rely on the Python dictionary. The idea is to have a dictionary with peptides as keys and indices to proteins as values. This way, one can quickly look up to which protein a peptide belongs to. The function add_to_pept_dict uses a regular python dictionary and allows to add peptides and stores indices to the originating proteins as a list. If a peptide is already present in the dictionary, the list is appended. The function returns a list of added_peptides, which were not present in the dictionary yet. One can use the function merge_pept_dicts to merge multiple peptide dicts.
+
In order to efficiently store peptides, we rely on the Python dictionary. The idea is to have a dictionary with peptides as keys and indices to proteins as values. This way, one can quickly look up to which protein a peptide belongs to. The function add_to_pept_dict uses a regular python dictionary and allows to add peptides and stores indices to the originating proteins as a list. If a peptide is already present in the dictionary, the list is appended. The function returns a list of added_peptides, which were not present in the dictionary yet. One can use the function merge_pept_dicts to merge multiple peptide dicts.
To wrap everything up, we employ two functions, generate_database and generate_spectra. The first one reads a FASTA file and generates a list of peptides, as well as the peptide dictionary and an ordered FASTA dictionary to be able to look up the protein indices later. For the callback we first read the whole FASTA file to determine the total number of entries in the FASTA file. For a typical FASTA file of 30 Mb with 40k entries, this should take less than a second. The progress of the digestion is monitored by processing the FASTA file one by one. The function generate_spectra then calculates precursor masses and fragment ions. Here, we split the total_number of sequences in 1000 steps to be able to track progress with the callback.
+
To wrap everything up, we employ two functions, generate_database and generate_spectra. The first one reads a FASTA file and generates a list of peptides, as well as the peptide dictionary and an ordered FASTA dictionary to be able to look up the protein indices later. For the callback we first read the whole FASTA file to determine the total number of entries in the FASTA file. For a typical FASTA file of 30 Mb with 40k entries, this should take less than a second. The progress of the digestion is monitored by processing the FASTA file one by one. The function generate_spectra then calculates precursor masses and fragment ions. Here, we split the total_number of sequences in 1000 steps to be able to track progress with the callback.
Function to generate a database from a fasta file Args: fasta_paths (str or list of str): fasta path or a list of fasta paths. callback (function, optional): callback function. Returns: fasta_list (list of dict): list of protein entry dict {id:str, name:str, description:str, sequence:str}. fasta_dict (dict{int:dict}): the key is the protein id, the value is the protein entry dict.
Function to generate a database from a fasta file Args: mass_dict (dict): not used, will be removed in the future. fasta_paths (str or list of str): fasta path or a list of fasta paths. callback (function, optional): callback function. Returns: to_add (list of str): non-redundant (modified) peptides to be added. pept_dict (dict{str:list of int}): the key is peptide sequence, and the value is protein id list indicating where the peptide is from. fasta_dict (dict{int:dict}): the key is the protein id, the value is the protein entry dict {id:str, name:str, description:str, sequence:str}.
To speed up spectra generated, one can use the parallelized version. The function generate_database_parallel reads an entire FASTA file and splits it into multiple blocks. Each block will be processed, and the generated pept_dicts will be merged.
+
To speed up spectra generated, one can use the parallelized version. The function generate_database_parallel reads an entire FASTA file and splits it into multiple blocks. Each block will be processed, and the generated pept_dicts will be merged.
Helper function to split length into blocks Args: len_list (int): list length. block_size (int, optional, default 1000): size per block. Returns: list[(int, int)]: list of (start, end) positions of blocks.
Function to generate a database from a fasta file in parallel. Args: settings: alphapept settings. Returns: list: theoretical spectra. See generate_spectra() dict: peptide dict. See add_to_pept_dict() dict: fasta_dict. See generate_fasta_list()
In some cases (e.g., a lot of modifications or very large FASTA files), it will not be useful to save the database as it will consume too much memory. Here, we use the function search_parallel from search. It creates theoretical spectra on the fly and directly searches against them. As we cannot create a pept_dict here, we need to create one from the search results. For this, we group peptides by their FASTA index and generate a lookup dictionary that can be used as a pept_dict.
+
In some cases (e.g., a lot of modifications or very large FASTA files), it will not be useful to save the database as it will consume too much memory. Here, we use the function search_parallel from search. It creates theoretical spectra on the fly and directly searches against them. As we cannot create a pept_dict here, we need to create one from the search results. For this, we group peptides by their FASTA index and generate a lookup dictionary that can be used as a pept_dict.
Note that we are passing the settings argument here. Search results should be stored in the corresponding path in the *.hdf file.
This allows us to not only easily store connections between centroids but also perform a quick lookup for the delta of an existing connection. Note that it also only stores the best connection for each centroid. To extract the connected centroids, we can use np.where(results >= 0). This implementation allows getting millions of connections within seconds.
As we are also allowing gaps, refering to that we can have connections between Scan 0 and Scan 2, we make the aforementioned matrix multdimensional, so that e.g. a first matrix stores the conncetions for no gap, the second matrix the connections with a gap of 1.
Wrapper function to call connect_centroids_unidirection
Args: rowwise_peaks (np.ndarray): Length of centroids with respect to the row borders. row_borders (np.ndarray): Row borders of the centroids array. centroids (np.ndarray): Array containing the centroids data. max_gap (int): Maximum gap when connecting centroids. centroid_tol (float): Centroid tolerance.
Args: x (np.ndarray): Index to datapoint. Note that this using the performance_function, so one passes an ndarray. row_borders (np.ndarray): Row borders of the centroids array. connections (np.ndarray): Connections matrix to store the connections scores (np.ndarray): Score matrix to store the connections centroids (np.ndarray): 1D Array containing the masses of the centroids data. max_gap (int): Maximum gap when connecting centroids. centroid_tol (float): Centroid tolerance.
-
We wrap the centroid connections in the function connect_centroids. This function converts the connections into an usable array.
+
We wrap the centroid connections in the function connect_centroids. This function converts the connections into an usable array.
Args: rowwise_peaks (np.ndarray): Indexes for centroids. row_borders (np.ndarray): Row borders (for indexing). centroids (np.ndarray): Centroid data. max_gap: Maximum gap. centroid_tol: Centroid tol for matching centroids. Returns: np.ndarray: From index. np.ndarray: To index. float: Median score. float: Std deviation of the score.
Args: x (np.ndarray): Input index. Note that we are using the performance function so this is a range. from_idx (np.ndarray): From index. to_idx (np.ndarray): To index.
Args: query_data (dict): Data structure containing the query data. max_gap (int): Maximum gap when connecting centroids. centroid_tol (float): Centroid tolerance.
Returns: hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. path_node_cnt (int): Number of elements in this path. score_median (float): Median score. score_std (float): Std deviation of the score.
Args: centroids (np.ndarray): 1D Array containing the masses of the centroids. from_idx (np.ndarray): From index. to_idx (np.ndarray): To index. hill_length_min (int): Minimum hill length:
Returns: hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. path_node_cnt (int): Number of elements in this path.
Args: x (np.ndarray): Input index. Note that we are using the performance function so this is a range. path_starts (np.ndarray): Array that stores the starts of the paths. forwards (np.ndarray): Forward array. out_hill_data (np.ndarray): Array containing the indices to hills. out_hill_ptr (np.ndarray): Array containing the bounds to out_hill_data.
Args: x (np.ndarray): Input index. Note that we are using the performance function so this is a range. path_starts (np.ndarray): Array that stores the starts of the paths. forward (np.ndarray): Array that stores forward information. path_cnt (np.ndarray): Reporting array to count the paths.
Args: x (np.ndarray): Input index. Note that we are using the performance function so this is a range. forward (np.ndarray): Array to report forward connection. backward (np.ndarray): Array to report backward connection. path_starts (np.ndarray): Array to report path starts.
When having a hill with two or more maxima, we would like to split it at the minimum position. For this, we use a recursive approach. First, the minimum of a hill is detected. A hill is split at this minimum if the smaller of the surrounding maxima is at least the factor hill_split_level larger than the minimum. For each split, the process is repeated.
Args: hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. int_data (np.ndarray): Array containing the intensity to each centroid. hill_split_level (float): Split level for hills. window (int): Smoothing window.
Returns: np.ndarray: Array containing the bounds to the hill_data with splits.
To filter hills, we define a minimum length hill_min_length. All peaks below the threshold hill_peak_min_length are accepted as is. For longer hills, the intensity at the start and the end are compared to the maximum intensity. If the ratio of the maximum raw intensity to the smoothed intensity and the beginning and end are larger than hill_peak_factor the hills are accepted.
Args: hill_data (np.ndarray): Array containing the indices to hills. hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. int_data (np.ndarray): Array containing the intensity to each centroid. hill_check_large (int, optional): Length criterion when a hill is considered large.. Defaults to 40. window (int, optional): Smoothing window. Defaults to 1.
Returns: np.ndarray: Filtered hill data. np.ndarray: Filtered hill points.
The calculation of hill statistics for a single hill is implemented in get_hill_stats. To calculate the hill stats for a list of hills, we can call the wrapper get_hill_data.
+
The calculation of hill statistics for a single hill is implemented in get_hill_stats. To calculate the hill stats for a list of hills, we can call the wrapper get_hill_data.
Args: query_data (dict): Data structure containing the query data. hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. hill_nboot_max (int): Maximum number of bootstrap comparisons. hill_nboot (int): Number of bootstrap comparisons
Returns: np.ndarray: Hill stats. np.ndarray: Sortindex. np.ndarray: Upper index. np.ndarray: Scan index. np.ndarray: Hill data. np.ndarray: Hill points.
Args: stats (np.ndarray): Stats array that contains summary statistics of hills. hill_data (np.ndarray): Array containing the indices to hills. hill_ptrs (np.ndarray): Array containing the bounds to the hill_data.
Returns: np.ndarray: Filtered hill data. np.ndarray: Filtered hill points. np.ndarray: Filtered hill stats.
After obtaining summary statistics of hills, the next step is to check whether they belong together to form an isotope pattern. For this, we check wheter it is possible that they are neighbors in an isotope pattern, e.g. one having a 12C atom that has been replaced by a 13C version. The detailed criterion for the check is implemented in check_isotope_pattern and is as follows:
+
After obtaining summary statistics of hills, the next step is to check whether they belong together to form an isotope pattern. For this, we check wheter it is possible that they are neighbors in an isotope pattern, e.g. one having a 12C atom that has been replaced by a 13C version. The detailed criterion for the check is implemented in check_isotope_pattern and is as follows:
The left side contains \(\Delta m\), being the delta of the precise mass estimates from the summary statistics and \(\Delta M = 1.00286864\), which is the mass difference ebtween the 13C peak and the monoisotopic peak in an averagine molecule of 1500 Da mass divided by the charge \(z\).
The right side contains \(\Delta S = 0.0109135\), which is the maximum shift that a sulphur atom can cause (\(\Delta S = 2m(^{13}C) - 2m(^{12}C) - m(^{34}S) + m(^{32}S)\)) and \(\Delta {m_{1}}\) and \(\Delta {m_{2}}\), which are the bootstrapped mass standard deviations.
The intensities of two hills are only compared if both have an intensity value in a particular scan. Otherwise, the intensity is set to zero. Additionally, an overlap of at least three elements is required.
Now having two criteria to check whether hills could, in principle, belong together, we define the wrapper functions extract_edge and get_edges to extract the connected hills. To minimize the number of comparisons we need to perform, we only compare the hills that overlap in time (i.e., the start of one hill rt_min needs to be before the end of the other hill rt_max) and are less than the sum of \(\Delta M\) and \(\Delta S\) apart.
+
Now having two criteria to check whether hills could, in principle, belong together, we define the wrapper functions extract_edge and get_edges to extract the connected hills. To minimize the number of comparisons we need to perform, we only compare the hills that overlap in time (i.e., the start of one hill rt_min needs to be before the end of the other hill rt_max) and are less than the sum of \(\Delta M\) and \(\Delta S\) apart.
To extract all hills that belong together, we again rely on the NetworkX-package to extract the connected components.
Correlates two edges and flag them it they should be kept.
Args: idx (np.ndarray): Input index. Note that we are using the performance function so this is a range. to_keep (np.ndarray): Array with indices which edges should be kept. sortindex_ (np.ndarray): Sortindex to access the hills from stats. pre_edges (np.ndarray): Array with pre edges. hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. int_data (np.ndarray): Array containing the intensity to each centroid. scan_idx (np.ndarray): Array containing the scan index for a centroid. cc_cutoff (float): Cutoff value for what is considered correlating.
The extracted pre-isotope patterns may not be consistent because their pair-wise mass differences may not correspond to the same charge. To extract isotope patterns from pre-isotope patterns, we need to ensure that they are consistent for a single charge.
-
To do this, we start with the 100 most intense peaks from a pre-isotope pattern to be used as a seed. For each seed and charge we then try to extract the longest consistent isotope pattern. To check wheter a hill is consistent with the seed we employ a modified checking criterion (check_isotope_pattern_directed) to be as follows:
+
To do this, we start with the 100 most intense peaks from a pre-isotope pattern to be used as a seed. For each seed and charge we then try to extract the longest consistent isotope pattern. To check wheter a hill is consistent with the seed we employ a modified checking criterion (check_isotope_pattern_directed) to be as follows:
Here \(m\) is the mass of a seed peak, and \(m_{j}\) refers to a peak relative to the seed. \(j\) refers to the peaks to the left or right (negative or positive index) within the pattern. \(j\) needs to run over consecutive values so that gaps are not allowed. Besides this consistency check, two hills are also checked to have a cosine correlation of at least 0.6.
-
Programmatically, this is implemented in grow_trail and grow. These function uses a recursive approach that adds matching hills to the seed on the left and right side until no more hills can be added.
+
Programmatically, this is implemented in grow_trail and grow. These function uses a recursive approach that adds matching hills to the seed on the left and right side until no more hills can be added.
Args: seed (int): Seed index. pattern (np.ndarray): Pre isotope pattern. stats (np.ndarray): Stats array that contains summary statistics of hills. charge_range (List): Charge range. iso_mass_range (float): Mass range for checking isotope patterns. sortindex_ (np.ndarray): Sortindex to access the hills from stats. hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. int_data (np.ndarray): Array containing the intensity to each centroid. scan_idx (np.ndarray): Array containing the scan index for a centroid. cc_cutoff (float): Cutoff value for what is considered correlating.
Args: seed (int): Seed position. pattern (np.ndarray): Isotope pattern. stats (np.ndarray): Stats array that contains summary statistics of hills. charge (int): Charge. iso_mass_range (float): Mass range for checking isotope patterns. sortindex_ (np.ndarray): Sortindex to access the hills from stats. hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. int_data (np.ndarray): Array containing the intensity to each centroid. scan_idx (np.ndarray): Array containing the scan index for a centroid. cc_cutoff (float): Cutoff value for what is considered correlating.
Args: trail (List): List of hills belonging to a pattern. seed (int): Seed position. direction (int): Direction in which to grow the trail relative_pos (int): Relative position. index (int): Index. stats (np.ndarray): Stats array that contains summary statistics of hills. pattern (np.ndarray): Isotope pattern. charge (int): Charge. iso_mass_range (float): Mass range for checking isotope patterns. sortindex_ (np.ndarray): Sortindex to access the hills from stats. hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. int_data (np.ndarray): Array containing the intensity to each centroid. scan_idx (np.ndarray): Array containing the scan index for a centroid. cc_cutoff (float): Cutoff value for what is considered correlating.
Returns: List: List of hills belonging to a pattern.
Check if two masses could belong to the same isotope pattern.
Args: mass1 (float): Mass of the first pattern. mass2 (float): Mass of the second pattern. delta_mass1 (float): Delta mass of the first pattern. delta_mass2 (float): Delta mass of the second pattern. charge (int): Charge. index (int): Index (unused). iso_mass_range (float): Isotope mass ranges. Returns: bool: Flag if two isotope patterns belong together.
Args: pattern (np.ndarray): Pre isotope pattern. sorted_hills (np.ndarray): Hills, sorted. centroids (np.ndarray): 1D Array containing the masses of the centroids. hill_data (np.ndarray): Array containing the indices to hills.
The extraction of the longest consistent isotope pattern is implemented in isolate_isotope_pattern. Here, three additional checks for an isotope pattern are implemented.
-
The first one is truncate. Here, one checks the seed position, whether it has a minimum to its left or right side. If a minimum is found, the isotope pattern is cut off at this position.
+
The extraction of the longest consistent isotope pattern is implemented in isolate_isotope_pattern. Here, three additional checks for an isotope pattern are implemented.
+
The first one is truncate. Here, one checks the seed position, whether it has a minimum to its left or right side. If a minimum is found, the isotope pattern is cut off at this position.
The second one is a mass filter. If the seed has a mass of smaller than 1000, the intensity maximum is detected, and all smaller masses are discarded. This reflects the averagine distribution for small masses where no minimum on the left side can be found.
-
The third one is check_averagine that relies on pattern_to_mz and cosine_averagine. It is used to ensure that the extracted isotope pattern has a cosine correlation of the averagine isotope pattern of the same mass of at least 0.6.
+
The third one is check_averagine that relies on pattern_to_mz and cosine_averagine. It is used to ensure that the extracted isotope pattern has a cosine correlation of the averagine isotope pattern of the same mass of at least 0.6.
After the longest consistent isotope pattern is found, the hills are removed from the pre-isotope pattern, and the process is repeated until no more isotope patterns can be extracted from the pre-isotope patterns.
Args: int_one (np.ndarray): Intensity of the first hill. int_two (np.ndarray): Intensity of the second hill. spec_one (np.ndarray): Scan numbers of the first hill. spec_two (np.ndarray): Scan numbers of the second hill.
Args: pre_pattern (np.ndarray): Pre isotope pattern. hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. int_data (np.ndarray): Array containing the intensity to each centroid. scan_idx (np.ndarray): Array containing the scan index for a centroid. stats (np.ndarray): Stats array that contains summary statistics of hills. sortindex_ (np.ndarray): Sortindex to access the hills from stats. iso_mass_range (float): Mass range for checking isotope patterns. charge_range (List): Charge range. averagine_aa (Dict): Dict containing averagine masses. isotopes (Dict): Dict containing isotopes. iso_n_seeds (int): Number of seeds. cc_cutoff (float): Cutoff value for what is considered correlating. iso_split_level (float): Split level when isotopes are split.
Returns: np.ndarray: Array with the best pattern. int: Charge of the best pattern.
Wrapper function to iterate over pre_isotope_patterns.
Args: pre_isotope_patterns (list): List of pre-isotope patterns. hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. int_data (np.ndarray): Array containing the intensity to each centroid. scan_idx (np.ndarray): Array containing the scan index for a centroid. stats (np.ndarray): Stats array that contains summary statistics of hills. sortindex_ (np.ndarray): Sortindex to access the hills from stats. averagine_aa (Dict): Dict containing averagine masses. isotopes (Dict): Dict containing isotopes. iso_charge_min (int, optional): Minimum isotope charge. Defaults to 1. iso_charge_max (int, optional): Maximum isotope charge. Defaults to 6. iso_mass_range (float, optional): Mass search range. Defaults to 5. iso_n_seeds (int, optional): Number of isotope seeds. Defaults to 100. cc_cutoff (float, optional): Cuttoff for correlation.. Defaults to 0.6. iso_split_level (float, optional): Isotope split level.. Defaults to 1.3. callback (Union[Callable, None], optional): Callback function for progress. Defaults to None. Returns: list: List of isotope patterns. np.ndarray: Iso idx. np.ndarray: Array containing isotope charges.
Args: idx (np.ndarray): Input index. Note that we are using the performance function so this is a range. isotope_patterns (list): List containing isotope patterns (indices to hills). isotope_charges (list): List with charges assigned to the isotope patterns. iso_idx (np.ndarray): Index to isotope pattern. stats (np.ndarray): Stats array that contains summary statistics of hills. sortindex_ (np.ndarray): Sortindex to access the hills from stats. hill_ptrs (np.ndarray): Array containing the bounds to the hill_data. hill_data (np.ndarray): Array containing the indices to hills. int_data (np.ndarray): Array containing the intensity to each centroid. rt_ (np.ndarray): Array with retention time information for each scan. rt_idx (np.ndarray): Lookup array to match centroid idx to rt. results (np.ndarray): Recordarray with isotope pattern summary statistics. lookup_idx (np.ndarray): Lookup array for each centroid.
Args: feature_path (str): Path to the feature file from Bruker FF (.features-file). feature_table (pd.DataFrame): Pandas DataFrame containing the features. query_data (dict): Data structure containing the query data.
Returns: pd.DataFrame: DataFrame containing features information.
Map MS1 features to MS2 based on rt and mz. If ccs is included also add. Args: feature_table (pd.DataFrame): Pandas DataFrame with features. query_data (dict): Data structure containing the query data. map_mz_range (float, optional): Mapping range for mz (Da). Defaults to 1. map_rt_range (float, optional): Mapping range for rt (min). Defaults to 0.5. map_mob_range (float, optional): Mapping range for mobility (%). Defaults to 0.3. map_n_neighbors (int, optional): Maximum number of neighbors to be extracted. Defaults to 5. search_unidentified (bool, optional): Flag to perform search on features that have no isotope pattern. Defaults to False.
Args: settings (dict): A dictionary with settings how to process the data. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Args: settings (dict): A dictionary with settings how to process the data. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Args: settings (dict): A dictionary with settings how to process the data. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Args: settings (dict): A dictionary with settings how to process the data. pept_dict (dict): A dictionary with peptides. Defaults to None. fasta_dict (dict): A dictionary with fasta sequences. Defaults to None. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Args: settings (dict): A dictionary with settings how to process the data. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Args: settings (dict): A dictionary with settings how to process the data. pept_dict (dict): A dictionary with peptides. Defaults to None. fasta_dict (dict): A dictionary with fasta sequences. Defaults to None. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Args: settings (dict): A dictionary with settings how to process the data. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Args: settings (dict): A dictionary with settings how to process the data. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Args: settings (dict): A dictionary with settings how to process the data. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Args: settings (dict): A dictionary with settings how to process the data. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Args: settings (dict): A dictionary with settings how to process the data. progress (bool): Track progress. Defaults to False. logger_set (bool): If False, reset the default logger. Defaults to False. settings_parsed (bool): If True, reparse the settings. Defaults to False. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None. callback_overall (callable): Same as callback, but for the overall progress. Defaults to None. callback_task (callable): Same as callback, but for the task progress. Defaults to None. logfile (str): The name of a logfile. Defaults to None.
Args: ms_data (alphapept.io.MS_Data_File): An MS_Data file which has been fully identified and quantified. fields (list): A list with colum names to calculate summary statistics.
Returns: dict: A dictionary with summary statistics.
All workflow functions can be called with the command line interface (CLI). To implement this CLI, we use the click package.
-
In brief, click allows to create a CLI with minimal effort by simply adding decorators to already defined functions. These decorators create a help text for each function and describe all their parameters. Functions that are decorated by click can be added to a central run_cli functions to be incorporated in the CLI automatically.
+
In brief, click allows to create a CLI with minimal effort by simply adding decorators to already defined functions. These decorators create a help text for each function and describe all their parameters. Functions that are decorated by click can be added to a central run_cli functions to be incorporated in the CLI automatically.
While AlphaTims allows modular execution of individual steps to process MS data, it is common for these steps to be combined and reuse multiple parameters. We therefore opt to use a singe YAML settings file containing all parameters in dictionary format as a single parameter instead of providing all parameters individually to each function.
To read Thermo files, AlphaPept uses the pyrawfilereader package, a Python implementation of the commonly used rawfilereader tool. By using the custom python version, Thermo files can be read without having to install MSFileReader.
The user can pass an additional flag use_profile_ms1. This will then use the profile data which is not centroided already an peform centroiding. Note that this will lead to slightly different intensities, as the centroided data uses the apex and the centroid algorithm the summed intensity.
To access Bruker files, AlphaPept relies on the external timsdata library from Bruker (available in the alphatims\ext folder, licenses are applicable). Unfortunately, these libraries are only available on Windows and Linux. As a result, the reading of raw data is not available on macOS. However, once raw data is converted to .ms_data.hdf output, other workflow steps (besides feature feating) are possible without problems on macOS.
Returns: tuple: A dictionary with all the raw data and a string with the acquisition_date_time
For ccs (i.e., ion mobility) values, we need additional functions from the Bruker library. As the live feature-finder might not be able to determine some charge values, it is intended to perform this calculation at a later stage once we have charge values from the post-processing feature finder.
To access .mzML files, we rely on the pyteomics package. For using an mzml format for performing a search, Peak Picking (data centroiding) should be applied to all MS levels of the data.
Args: filename (str): The name of a .mzml file. n_most_abundant (int): The maximum number of peaks to retain per MS2 spectrum. callback (callable): A function that accepts a float between 0 and 1 as progress. Defaults to None.
Returns: tuple: A dictionary with all the raw data, a string with the acquisition_date_time and a string with the vendor.
Args: input_dict (dict): A dictionary obtained by iterating over a Pyteomics mzml.read function.
Returns: tuple: The rt, masses, intensities, ms_order, prec_mass, mono_mz, charge arrays retrieved from the input_dict. If the ms level in the input dict does not equal 2, the charge, mono_mz and prec_mass will be equal to 0.
Importing raw data frequently results in profile data. When having profile data, Alphapept first needs to perform centroiding to use this data properly. For this, it needs to search local maxima (“peaks”) of the intensity as a function of m/z. For this AlphaPept uses the function get_peaks. A peak is described by three points, the start of the peak, the center, and the end. The function accepts an intensity array and calculates the delta (gradient) between consecutive data points to determine the start, center, and end.
+
Importing raw data frequently results in profile data. When having profile data, Alphapept first needs to perform centroiding to use this data properly. For this, it needs to search local maxima (“peaks”) of the intensity as a function of m/z. For this AlphaPept uses the function get_peaks. A peak is described by three points, the start of the peak, the center, and the end. The function accepts an intensity array and calculates the delta (gradient) between consecutive data points to determine the start, center, and end.
Three or more data points: Gaussian estimation of the center position.
For the Gaussian estimation, only the three central points are used to fit a Gaussian Peak shape. The Gaussian is then approximated with the logarithm.
Args: peak (tuple): A triplet of the form (start, center, end) mz_array (np.ndarray): An array with mz values. int_array (np.ndarray): An array with intensity values.
Returns: float: The gaussian estimate of the center.
Trimming spectra to retain the n most intense peaks
-
get_most_abundant: In order to save spectra in a more memory-efficient form, we only keep the n most abundant peaks. This allows us to save data in a fast, accessible matrix format.
-
get_local_intensity: This calculates the local intensity to get local maxima.
+
get_most_abundant: In order to save spectra in a more memory-efficient form, we only keep the n most abundant peaks. This allows us to save data in a fast, accessible matrix format.
+
get_local_intensity: This calculates the local intensity to get local maxima.
Args: mass (np.ndarray): An array with mz values. intensity (np.ndarray): An array with intensity values. n_max (int): The maximum number of peaks to retain. Setting n_max to -1 returns all peaks. window (int): Use local maximum in a window
Returns: tuple: the filtered mass and intensity arrays.
get_local_intensityCalculate the local intensity for a spectrum.
Args: intensity (np.ndarray): An array with intensity values. window (int): Window Size Returns: nop.ndarray: local intensity
For saving, we are currently relying on the hdf-container (see below).
-
While we could, in principle, store the mz and int arrays as a list of variable length, this will come at a performance decrease. We, therefore, create an array of the dimensions of the n most abundant peaks and the number of spectra with the function list_to_numpy_f32 and fill the unoccupied cells with -1. This allows an increase in accessing times at the cost of additional disk space.
+
While we could, in principle, store the mz and int arrays as a list of variable length, this will come at a performance decrease. We, therefore, create an array of the dimensions of the n most abundant peaks and the number of spectra with the function list_to_numpy_f32 and fill the unoccupied cells with -1. This allows an increase in accessing times at the cost of additional disk space.
Args: value (type): The name of the data to write. If the value is pd.DataFrame, a dataset_name must be provided. group_name (str): The group where to write data. If no group_name is provided, write directly to the root group. Defaults to None. dataset_name (str): If no dataset_name is provided, create a new group with value as name. The dataset where to write data. Defaults to None. attr_name (str): The attr where to write data. Defaults to None. overwrite (bool): Overwrite pre-existing data and truncate existing groups. If the False, ignore the is_overwritable flag of this HDF_File. Defaults to None. dataset_compression (str): The compression type to use for datasets. Defaults to None. swmr (bool): Open files in swmr mode. Defaults to False.
Raises: IOError: When the object is read-only. KeyError: When the group_name or attr_name does not exist. ValueError: When trying to overwrite something while overwiting is disabled.
Based on the generic HDF_File, a subclass that acts as an MS data container can be implemented. This class should contain all (centroided) fragment ions, all their coordinates, and all the metadata.
Testing of the MS_Data_File container includes reading and writing from different file formats.
While that HDF data structure could be used directly, it is often easier to read it and return a query_data dictionary similar to those that are returned by the readers of Thermo, Bruker, mzML and mzXML raw data.
To use all the above functionality from a workflow with several parameters, the following functions are defined. These functions also allow parallel processing.
Wrapper function to search labels on an ms_file and write results to the peptide_fdr of the file.
Args: file_name (str): Path to ms_file: label (NamedTuple): Label with channels, mod_name and masses. reporter_frag_tol (float): Fragment tolerance for search. ppm (bool): Flag to use ppm instead of Dalton.
In contrast to the relative correction, sometimes absolute correction is more applicable. Consider the case of retention time. Here one would rather not expect a relative offset but rather an absolute offset. As an example, consider a lag time of 0.5 Minutes. This would be constant for all retention times and not differ e.g., for later retention times.
Args: table_1 (pd.DataFrame): Dataframe with precusor data. table_2 (pd.DataFrame): Dataframe with precusor data. offset_dict (dict): Dictionary with column names and how the distance should be calculated. calib (bool): Flag to indicate that distances should be calculated on calibrated columns. Defaults to False.
Raises: KeyError: If either table_1 or table_2 is not indexed by precursor
Args: table_1 (pd.DataFrame): Dataframe with data. delta (pd.Series): Series cotaining the offset. offset_dict (dict): Dictionary with column names and how the distance should be calculated.
Raises: NotImplementedError: If the type of vonversion is not implemented.
Args: deltas (pd.DataFrame): Distances from each dataset to another. filenames (list): The filenames of the datasts that were compared. weights (np.ndarray, optional): Distances can be weighted by their number of shared elements. Defaults to None. n_jobs (optional): Number of processes to be used. Defaults to None (=1).
Args: combos (list): A list containing tuples of filenames that should be compared. calib (bool): Boolean flag to indicate distance should be calculated on calibrated data. callback (Callable): A callback function to track progress.
Returns: pd.DataFrame: Dataframe containing the deltas of the files np.ndarray: Numpy array containing the weights of each comparison (i.e. number of shared elements) dict: Offset dictionary whicch was used for comparing.
Transfer MS2 identifications to similar MS1 features.
For “match-between-runs” we start with aligning datasets. To create a reference we use for matching, we combine all datasets of a matching group. When using the default settings, the matching group consists of all files. We then group the dataset by precursor and calculate it’s average properties (rt, mz, mobility). By combining several files we further are able to calculate a standard deviation. This allows us to know where and with what deviation we would expect an MS1 feature and have the corresponding identification. This is our matching reference. In the matching step, we go through each dataset individually and check if there are precursors in the reference that were not identified in this dataset. We then perform a nearest-neighbor lookup to find if any MS1 features exist that are in close proximity to the reference. The distance metric we use is normed by the median standard of the deviation. Lastly we assess the confidence in a transfered identifcation by using the Mahalanobis distance.
Numba functions are by default set to use nogil=True and nopython=True, unless explicitly defined otherwise. Cuda functions are by default set to use device=True, unless explicitly defined otherwise..
-
Args: compilation_mode (str): The compilation mode to use. Will be checked with is_valid_compilation_mode. If None, the global COMPILATION_MODE will be used as soon as the function is decorated for static compilation. If DYNAMIC_COMPILATION_ENABLED, the function will always be compiled at runtime and thus by default returns a Python function. Static recompilation can be enforced by reimporting a module containing the function with importlib.reload(imported_module). If COMPILATION_MODE is Python and not DYNAMIC_COMPILATION_ENABLED, no compilation will be used. Default is None **decorator_kwargs: Keyword arguments that will be passed to numba.jit or cuda.jit compilation decorators.
+
Args: compilation_mode (str): The compilation mode to use. Will be checked with is_valid_compilation_mode. If None, the global COMPILATION_MODE will be used as soon as the function is decorated for static compilation. If DYNAMIC_COMPILATION_ENABLED, the function will always be compiled at runtime and thus by default returns a Python function. Static recompilation can be enforced by reimporting a module containing the function with importlib.reload(imported_module). If COMPILATION_MODE is Python and not DYNAMIC_COMPILATION_ENABLED, no compilation will be used. Default is None **decorator_kwargs: Keyword arguments that will be passed to numba.jit or cuda.jit compilation decorators.
Returns: callable: A decorated function that is compiled.
Args: compilation_mode (str): The compilation mode to use. Will be checked with is_valid_compilation_mode. Default is None enable_dynamic_compilation (bool): Enable dynamic compilation. If enabled, code will generally be slower and no other functions can be called from within a compiled function anymore, as they are compiled at runtime. WARNING: Enabling this is strongly disadvised in almost all cases! Default is None.
+
Args: compilation_mode (str): The compilation mode to use. Will be checked with is_valid_compilation_mode. Default is None enable_dynamic_compilation (bool): Enable dynamic compilation. If enabled, code will generally be slower and no other functions can be called from within a compiled function anymore, as they are compiled at runtime. WARNING: Enabling this is strongly disadvised in almost all cases! Default is None.
Testing yields the expected results:
import types
@@ -455,7 +443,7 @@
set_compilation_mode<
Next, we define the ‘performance_function’ decorator to take full advantage of both compilation and parallelization for maximal performance. Note that a ‘performance_function’ can not return values. Instead, it should store results in provided buffer arrays.
A decorator to compile a given function and allow multithreading over an multiple indices.
NOTE This should only be used on functions that are compilable. Functions that need to be decorated need to have an index argument as first argument. If an iterable is provided to the decorated function, the original (compiled) function will be applied to all elements of this iterable. The most efficient way to provide iterables are with ranges, but numpy arrays work as well. Functions can not return values, results should be stored in buffer arrays inside thge function instead.
-
Args: worker_count (int): The number of workers to use for multithreading. If None, the global MAX_WORKER_COUNT is used at runtime. Default is None. compilation_mode (str): The compilation mode to use. Will be forwarded to the compile_function decorator. **decorator_kwargs: Keyword arguments that will be passed to numba.jit or cuda.jit compilation decorators.
+
Args: worker_count (int): The number of workers to use for multithreading. If None, the global MAX_WORKER_COUNT is used at runtime. Default is None. compilation_mode (str): The compilation mode to use. Will be forwarded to the compile_function decorator. **decorator_kwargs: Keyword arguments that will be passed to numba.jit or cuda.jit compilation decorators.
Returns: callable: A decorated function that is compiled and parallelized.
We test this function with a simple smoothing algorithm.
@@ -523,7 +511,7 @@
performance_function<
Finally, we also provide functionality to use multiprocessing instead of multithreading.
NOTE: There are some inherent limitation with the number of processes that Python can spawn. As such, no process Pool should use more than 50 processes.
To test the delayed normalization approach we create an in silico test dataset with a known ground truth. We therefore know, which systematic changes are between the samples and we employ different solvers to recover the normalization parameters.
Args: mu (float): mean of ND. sigma (float): standard deviation of ND. grid (np.ndarray): input array np.int[:]. For each element of the array, the probability density is calculated.
Returns: np.ndarray: probability density array, np.float[:].
Args: timepoint (float): coordinate of the peak apex. sigma (float): standard deviation of the gaussian. n_runs (int): number of points along which the density is calculated.
Returns: np.ndarray: probability density array, np.float[:].
Protein intensity profiles are constructed for each protein individually. All possible protein fold changes between the samples are derived from the median peptide fold changes. Subsequently, pseudointensities are chosen such that the fold changes between the pseudointensities ideally reconstruct the actually observed fold changes. Similar to the delayed normalization, this is formulated as a quadratic minimization, which we solve with the SLSQP solver.
-
Codewise, we start with simulating in-silico test data to serve as a ground-truth for assessing solvers for the optimization problem. For the algorithmic optimization, we define the function get_protein_ratios that allows to quickly calculate the protein ratios. Next, we define an error function triangle_error that we use for the optimization problem. Lastly, we have several wrapper functions to access the functions.
+
Codewise, we start with simulating in-silico test data to serve as a ground-truth for assessing solvers for the optimization problem. For the algorithmic optimization, we define the function get_protein_ratios that allows to quickly calculate the protein ratios. Next, we define an error function triangle_error that we use for the optimization problem. Lastly, we have several wrapper functions to access the functions.
In-silico test data
Create a simulated input dataset of peptide intensities.
Args: df (pd.DataFrame): Feature table by alphapept. minimum_ratios (int): Minimum number of peptide ratios necessary to derive a protein ratio. field (str): The field containing the quantitative peptide information (i.e. precursor intensities). callback ([type], optional): Callback function. Defaults to None.
Returns: pd.DataFrame: table containing the LFQ intensities of each protein in each sample.
The fragment mass calibration is based on the identified fragment_ions (i.e., b-hits and y-hits). For each hit, we calculate the offset to its theoretical mass. The correction is then applied by taking the median offset in ppm and applying it globally.
Args: df (pd.DataFrame): Input dataframe that contains identified peptides (w/o outliers). features (pd.DataFrame): Features dataframe for which the masses are calibrated. cols (list): List of input columns for the calibration. target (str): Target column on which offset is calculated. scaling_dict (dict): A dictionary that contains how scaling operations are applied. calib_n_neighbors (int): Number of neighbors for calibration.
Returns: np.ndarray: A numpy array with calibrated masses.
Args: df (pd.DataFrame): Input dataframe that contains identified peptides. features (pd.DataFrame): Features dataframe for which the masses are calibrated. outlier_std (float, optional): Range in standard deviations for outlier removal. Defaults to 3. calib_n_neighbors (int, optional): Number of neighbors used for regression. Defaults to 100. calib_mz_range (int, optional): Scaling factor for mz range. Defaults to 20. calib_rt_range (float, optional): Scaling factor for rt_range. Defaults to 0.5. calib_mob_range (float, optional): Scaling factor for mobility range. Defaults to 0.3. **kwargs: Arbitrary keyword arguments so that settings can be passes as whole.
Returns: corrected_mass (np.ndarray): The calibrated mass y_hat_std (float): The standard deviation of the precursor offset after calibration
Scatter plot colored by 2d histogram Adapted from https://stackoverflow.com/questions/20105364/how-can-i-make-a-scatter-plot-colored-by-density-in-matplotlib
Another way to calibrate the fragment and precursor masses is by directly comparing them to a previously generated theoretical mass database. Here, peaks in the distribution of databases are used to align the experimental masses.
The filtering functions are essential base functions for scoring in AlphaPept. They make sure that only the ‘best precursor per spectum’ and the ‘best spectrum per precursor’ is used.
Recall from the search that when having feautres, raw_idx refers to the actual index from the raw data. Otherwise it isquery_data.
-
For filtering, we have several functions. When applying for a score, we first use filter_score and then filter_precursor. filter_score is keeping the best score per experimental spectrum. First we rank by score for each query_idx. As we have multiple hits for each experimental spectrum from the search we only want to keep the best one.
+
For filtering, we have several functions. When applying for a score, we first use filter_score and then filter_precursor. filter_score is keeping the best score per experimental spectrum. First we rank by score for each query_idx. As we have multiple hits for each experimental spectrum from the search we only want to keep the best one.
When performing feature finding, we assign multiple possible features to each experimental spectrum. The idea here is that a spectrum could originate from various precursors. To disentangle these psms we can use the following modes:
single: This mode will only keep one feature per experimental spectrum (the one with the highest score and the closest distance). Each feature can only occur once.
multiple: Allow multiple features per experimental spectrum. Each feature can only occur once.
-
filter_precursor is intended for the case that a precursor (charge + sequence) occurs more than once. Only the one with the highest score will be kept.
+
filter_precursor is intended for the case that a precursor (charge + sequence) occurs more than once. Only the one with the highest score will be kept.
Args: df (pd.DataFrame): psms table of search results from alphapept. mode (str, optional): string specifying which mode to use for psms filtering. The two options are ‘single’ and ‘multiple’. ‘single’ will only keep one feature per experimental spectrum. ‘multiple’ will allow multiple features per experimental spectrum. In either option, each feature can only occur once. Defaults to ‘multiple’.
Returns: pd.DataFrame: table containing the filtered psms results.
The employed FDR strategy is based on a classical target-decoy competition approach. The procedure works as follows: 1. Consider only the best scoring target or decoy PSM per spectrum. 2. Sort all PSMs by decreasing scores. 3. Estimate the FDR as #decoys / #targets, where #targets (#decoys) is the number of positive target (decoy) PSMs at a given score threshold t (i.e. PSMs with scores higher than t). 4. Convert the estimated FDR to q-values by selecting the minimum FDR at which the identification could be made, i.e. the lowest score threshold t that could be set to include an identification without increasing the number of false positives. 5. Report the set of target PSMs with q-values smaller or equal to the selected fdr_level.
Informative literature describing and discussing different FDR estimation approaches for shotgun proteomics can be found here (the implemented strategy in alphapept is referred to as T-TDC in this article): > Keich, Uri et al. “Improved False Discovery Rate Estimation Procedure for Shotgun Proteomics.” Journal of proteome research vol. 14,8 (2015): 3148-61. https://pubs.acs.org/doi/10.1021/acs.jproteome.5b00081
Simulation of random scores for 50’000 measurements (corresponding to spectra). Simulated are decoys, true targets and false targets. We assume a false traget raio (pi0) of 0.8 and a mean score difference of 3.5.
Simulated score distribution for a separate target and decoy database search:
Simulated score distribution for a corresponding concatinated target-decoy database search with target-decoy-competition:
-
Application of the cut_fdr function to the simulated target-decoy competition dataset saved in TDC:
+
Application of the cut_fdr function to the simulated target-decoy competition dataset saved in TDC:
cval, cut_TDC = cut_fdr(TDC, fdr_level=0.01)
@@ -455,7 +443,7 @@
cut_fdr
Global FDR
-
The cut_global_fdr function has two specific applications: 1. Estimate q-values on the peptide and protein level The concept here is based on selecting the best scoring precursor per peptide (or protein) to then estimate the FDR by target-decoy competition using the cut_fdr function. 2. Estimate q-values across an entire dataset on either precursor, peptide or protein level The concept here is based on selecting the best scoring precursor, peptide or protein signal across an entire dataset to then estimate the FDR by target-decoy competition using the cut_fdr function.
+
The cut_global_fdr function has two specific applications: 1. Estimate q-values on the peptide and protein level The concept here is based on selecting the best scoring precursor per peptide (or protein) to then estimate the FDR by target-decoy competition using the cut_fdr function. 2. Estimate q-values across an entire dataset on either precursor, peptide or protein level The concept here is based on selecting the best scoring precursor, peptide or protein signal across an entire dataset to then estimate the FDR by target-decoy competition using the cut_fdr function.
This strategy was extensively tested and discussed in the following publications:
Nesvizhskii, Alexey I. “A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics.” Journal of proteomics vol. 73,11 (2010): 2092-123. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2956504/
@@ -464,7 +452,7 @@
Global FDR
Gupta, Nitin, and Pavel A Pevzner. “False discovery rates of protein identifications: a strike against the two-peptide rule.” Journal of proteome research vol. 8,9 (2009): 4173-81. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3398614/
Args: data (pd.DataFrame): psms table of search results from alphapept. analyte_level (str, optional): string specifying the analyte level to apply the fdr threshold. Options include: ‘precursor’, ‘sequence’, ‘protein_group’ and ‘protein’. Defaults to ‘sequence’. fdr_level (float, optional): fdr level that should be used for filtering. The value should lie between 0 and 1. Defaults to 0.01. plot (bool, optional): flag to enable plot. Defaults to ‘True’.
Returns: pd.DataFrame: df with filtered results
Similar to the sequence level simulations we can simulatae score distributions for peptides beloning to proteins. In our simulation we assumed a poisson distribution for the number of peptides for each protein centered at 4 peptides.
-
Application of the cut_global_fdr function to the simulated protein-level target-decoy competition dataset saved in TDC_prot:
+
Application of the cut_global_fdr function to the simulated protein-level target-decoy competition dataset saved in TDC_prot:
Args: df (pd.DataFrame): psms table of search results from alphapept. fdr_level (float, optional): fdr level that should be used for filtering. The value should lie between 0 and 1. Defaults to 0.01.
Returns: pd.DataFrame: psms table with an extra ‘score’ column for the generic score, filtered for no feature or precursor to be assigned multiple times.
Args: df (pd.DataFrame): psms table of search results from alphapept. fdr_level (float, optional): fdr level that should be used for filtering. The value should lie between 0 and 1. Defaults to 0.01.
Returns: pd.DataFrame: psms table with an extra ‘score’ column for x_tandem, filtered for no feature or precursor to be assigned multiple times.
score_psms uses the specified score and applies the cut_fdr function to filter PSMs at the specified fdr_level. filter_score and filter_precursor are applied to only report the best PSM per acquired spectrum and the best signal per precursor (i.e. sequence + charge combination).
+
score_psms uses the specified score and applies the cut_fdr function to filter PSMs at the specified fdr_level. filter_score and filter_precursor are applied to only report the best PSM per acquired spectrum and the best signal per precursor (i.e. sequence + charge combination).
get_ML_features extracts additional scoring metrics for the machine learning, including the number of amino acids per precursor, the number of missed cleavages and the logarithmic number of times the same peptide occurs in the set of PSMs
-
train_RF trains a random forest classifier for scoring all PSMs. For this, we use the scikit-learn library.
+
get_ML_features extracts additional scoring metrics for the machine learning, including the number of amino acids per precursor, the number of missed cleavages and the logarithmic number of times the same peptide occurs in the set of PSMs
+
train_RF trains a random forest classifier for scoring all PSMs. For this, we use the scikit-learn library.
Next, a grid search is initialized for testing the hyperparameter space (max_depth and max_leaf_nodes) of the random forest classifier by a 5-fold cross-validation using GridSearchCV.
@@ -603,10 +591,10 @@
Mac
If plot is enabled, a figure illustrating the weights of each feature is produced.
Finally the function returns the trained random forest classifier for subsequent application to the entire set of PSMs or for transfering to a different dataset.
-
score_ML applies a classifier trained by train_RF to a complete set of PSMs. It calls the cut_fdr function and filters for the specified fdr_level. filter_score and filter_precursor are applied to only report the best PSM per acquired spectrum and the best signal per precursor (i.e. sequence + charge combination).
+
score_ML applies a classifier trained by train_RF to a complete set of PSMs. It calls the cut_fdr function and filters for the specified fdr_level. filter_score and filter_precursor are applied to only report the best PSM per acquired spectrum and the best signal per precursor (i.e. sequence + charge combination).
Args: df (pd.DataFrame): psms table of search results from alphapept. trained_classifier (GridSearchCV): GridSearchCV object returned by train_RF. features (list): list with features returned by train_RF. Defaults to ‘None’.
Returns: pd.DataFrame: psms table with an extra ‘score’ column from the trained_classifier by ML, filtered for no feature or precursor to be assigned multiple times.
Args: df (pd.DataFrame): psms table of search results from alphapept. trained_classifier (GridSearchCV): GridSearchCV object returned by train_RF. features (list): list with features returned by train_RF. Defaults to None. fdr_level (float, optional): fdr level that should be used for filtering. The value should lie between 0 and 1. Defaults to 0.01. plot (bool, optional): flag to enable plot. Defaults to True.
Returns: pd.DataFrame: filtered df with psms within fdr
Args: df (pd.DataFrame): psms table of search results from alphapept. exclude_features (list, optional): list with column names of features to exclude for ML. Defaults to [‘precursor_idx’,‘fragment_ion_idx’,‘fasta_index’,‘feature_rank’,‘raw_rank’,‘score_rank’,‘db_idx’, ‘feature_idx’, ‘precursor’, ‘query_idx’, ‘raw_idx’,‘sequence’,‘decoy’,‘sequence_naked’,‘target’]. train_fdr_level (float, optional): Only targets below the train_fdr_level cutoff are considered for training the classifier. Defaults to 0.1. ini_score (str, optional): Initial score to select psms set for semi-supervised learning. Defaults to ‘x_tandem’. min_train (int, optional): Minimum number of psms in the training set. Defaults to 1000. test_size (float, optional): Fraction of psms used for testing. Defaults to 0.8. max_depth (list, optional): List of clf__max_depth parameters to test in the grid search. Defaults to [5,25,50]. max_leaf_nodes (list, optional): List of clf__max_leaf_nodes parameters to test in the grid search. Defaults to [150,200,250]. n_jobs (int, optional): Number of jobs to use for parallelizing the gridsearch. Defaults to -1, which in GridSearchCV corresponds to ‘use all available cores’. scoring (str, optional): Scoring method for the gridsearch. Defaults to ‘accuracy’. plot (bool, optional): flag to enable plot. Defaults to ‘False’. random_state (int, optional): Random state for initializing the RandomForestClassifier. Defaults to 42.
Returns: [GridSearchCV, list]: GridSearchCV: GridSearchCV object with trained RandomForestClassifier. list: list of features used for training the classifier.
[1] Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc 11, 2301–2319 (2016). https://doi.org/10.1038/nprot.2016.136
Args: data (pd.DataFrame): psms table of scored and filtered search results from alphapept. pept_dict (dict): A dictionary mapping peptide indices to the originating proteins as a list. fasta_dict (dict): A dictionary with fasta sequences.
Returns: pd.DataFrame: alphapept results table now including protein level information.
Function to perform protein grouping by razor approach. This function calls assign_proteins and get_shared_proteins. ToDo: implement callback for solving Each protein is indicated with a p -> protein index
+
Function to perform protein grouping by razor approach. This function calls assign_proteins and get_shared_proteins. ToDo: implement callback for solving Each protein is indicated with a p -> protein index
Args: data (pd.DataFrame): psms table of scored and filtered search results from alphapept. pept_dict (dict): A dictionary mapping peptide indices to the originating proteins as a list. fasta_dict (dict): A dictionary with fasta sequences. decoy (bool, optional): Defaults to False. callback (bool, optional): Defaults to None.
Returns: pd.DataFrame: alphapept results table now including protein level information.
get_shared_proteinsArgs: data (pd.DataFrame): psms table of scored and filtered search results from alphapept, appended with n_possible_proteins. found_proteins (dict): dictionary mapping psms indices to proteins pept_dict (dict): dictionary mapping peptide indices to the originating proteins as a list
Returns: dict: dictionary mapping peptides to razor proteins
Args: data (pd.DataFrame): psms table of scored and filtered search results from alphapept. pept_dict (dict): dictionary that matches peptide sequences to proteins
Returns: pd.DataFrame: psms table of search results from alphapept appended with the number of matched proteins. dict: dictionary mapping psms indices to proteins.
Apply protein grouping on all files in an experiment. This function will load all dataframes (peptide_fdr level) and perform protein grouping.
Args: settings: (dict): Settings file for the experiment pept_dict: (dict): A peptide dictionary. fast_dict: (dict): A FASTA dictionary. callback: (Callable): Optional callback.
To efficiently compare two spectra, we use a pointer based approach. We start with two sorted arrays, the query_frag that contains the m/z positions of the experimental query spectrum and the db_frag which contains the database fragment that is compared against to. The two pointers compare each m/z position with each other and check wheter they are within a certain tolerance frag_tol. Depending on their delta, either of the pointers is advanced. The function returns an arrray named hits that is the same length as the database spectrum and encodes the hit positions.
To compare multiple spectra against a database, we first need some helper functions. First, we need a conversion function to convert from Dalton masses to ppm, which is implemented in the ppm_to_dalton function.
-
To minimize the search space, we typically only compare spectra with precursors in the same mass range as defined by prec_tol. To look up the limits for search, we define the function get_idxs, which is a wrapper to the fast searchsorted method from NumPy.
-
The actual search takes place in compare_spectrum_parallel, which utilizes the performance decorator from the performance notebook. Here we save the top matching spectra for each query spectrum. Note that for code compilation reasons, the code of the previously defined function compare_frags is duplicated in here.
+
To compare multiple spectra against a database, we first need some helper functions. First, we need a conversion function to convert from Dalton masses to ppm, which is implemented in the ppm_to_dalton function.
+
To minimize the search space, we typically only compare spectra with precursors in the same mass range as defined by prec_tol. To look up the limits for search, we define the function get_idxs, which is a wrapper to the fast searchsorted method from NumPy.
+
The actual search takes place in compare_spectrum_parallel, which utilizes the performance decorator from the performance notebook. Here we save the top matching spectra for each query spectrum. Note that for code compilation reasons, the code of the previously defined function compare_frags is duplicated in here.
To conveniently perform peptide-spectrum matches on multiple datasets we define a wrapper get_psms that returns the PSMS when handing over query_data and db_data.
+
To conveniently perform peptide-spectrum matches on multiple datasets we define a wrapper get_psms that returns the PSMS when handing over query_data and db_data.
The basic fragment comparison only counts the number of hits and matched intensity fraction when comparing a theoretical spectrum to an experimental one. Based on this metric, we can drastically reduce the number of candidates one wants to analyze for an in-depth comparison, which requires additional features. The following section describes several functions which extract parameters to compare spectrum matches better.
Frag Delta
-
frag_delta substracts the experimental fragment masses from the theoretical fragment masses for each hit.
+
frag_delta substracts the experimental fragment masses from the theoretical fragment masses for each hit.
intensity_fraction calculates the fraction of matched intensity. This refers to the intensity of all hits compared to the intensity of all peaks in the query spectrum.
+
intensity_fraction calculates the fraction of matched intensity. This refers to the intensity of all hits compared to the intensity of all peaks in the query spectrum.
To have an efficient data format to store PSMs in the search. We use numpy-recarrays and define the utility functions add_column and remove_column to append and remove data.
+
To have an efficient data format to store PSMs in the search. We use numpy-recarrays and define the utility functions add_column and remove_column to append and remove data.
In the score-function we use the pre-filtered PSMs to extract additional columns for scoring such as the offset from theoretical to experimental precursor or the number of b- and y-ion hits.
+
In the score-function we use the pre-filtered PSMs to extract additional columns for scoring such as the offset from theoretical to experimental precursor or the number of b- and y-ion hits.
Args: query_frag (np.ndarray): Array with query fragments. query_int (np.ndarray): Array with query intensities. db_frag (np.ndarray): Array with database fragments. db_int (np.ndarray): Array with database intensities. frag_type (np.ndarray): Array with fragment types. mtol (float): Mass tolerance. ppm (bool): Flag to use ppm instead of Dalton. losses (list): List of losses.
Returns: np.ndarray: NumPy array that stores ion information.
Args: psms (np.recarray): Recordarray containing PSMs. query_masses (np.ndarray): Array with query masses. query_masses_raw (np.ndarray): Array with raw query masses. query_frags (np.ndarray): Array with frag types of the query data. query_ints (np.ndarray): Array with fragment intensities from the query. query_indices (np.ndarray): Array with indices to the query data. db_masses (np.ndarray): Array with database masses. db_frags (np.ndarray): Array with fragment masses. frag_types (np.ndarray): Array with fragment types. mtol (float): Mass tolerance. db_indices (np.ndarray): Array with indices to the database array. ppm (bool): Flag to use ppm instead of Dalton. psms_dtype (list): List describing the dtype of the PSMs record array. db_ints (np.ndarray, optional): Array with database intensities. Defaults to None. parallel (bool, optional): Flag to use parallel processing. Defaults to False.
Returns: np.recarray: Recordarray containing PSMs with additional columns. np.ndarray: NumPy array containing ion information.
Args: to_process (tuple): Tuple containing an index to the file and the experiment settings. callback (Callable, optional): Callback function to indicate progress. Defaults to None. parallel (bool, optional): Flag to use parallel processing. Defaults to False. first_search (bool, optional): Flag to indicate this is the first search. Defaults to True.
Returns: Union[bool, str]: Returns True if the search was successfull, otherwise returns a string containing the Exception.
Args: settings (dict): Settings file containg the experimental definitions. calibration (Union[list, None], optional): List of calibrated offsets. Defaults to None. fragment_calibration (Union[list, None], optional): List of calibrated fragment offsets. Defaults to None. callback (Union[Callable, None], optional): Callback function. Defaults to None.