You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently have two low level functions for parsing Uniprot variants. The first one is _fetch_uniprot_variants which is quick and dirty. The second one is something I had wrote up earlier but wasn't in the code base due its complexity. It first uses the Uniprot guidelines to parse the text. Following, regex for parsing the gff annotation, showing: the variant ids, reference/mutated residues, disease name, is_cancer and is_germline. The biggest difference between the functions is that the first has multiple rows the same residue and the second has one row per residue. I add an parameter, group_residues, to map_gff_features_to_sequence to add this feature to the function. I think we can keep both function and use as different engines in the select_variant function.
Now the issue: some proteins have multi-residue variants (e.g. http://www.uniprot.org/uniprot/P04637.gff look for VAR_047158 which span over two residues). In general we just look at SNP (single nucleotide polymorphism). So we need to decide, is this a missense variant in the residue 29 and 30, or just in 29 or 30? Or we ignore those?
The text was updated successfully, but these errors were encountered:
Hi @biomadeira @stuartmac
We currently have two low level functions for parsing Uniprot variants. The first one is _fetch_uniprot_variants which is quick and dirty. The second one is something I had wrote up earlier but wasn't in the code base due its complexity. It first uses the Uniprot guidelines to parse the text. Following, regex for parsing the gff annotation, showing: the variant ids, reference/mutated residues, disease name, is_cancer and is_germline. The biggest difference between the functions is that the first has multiple rows the same residue and the second has one row per residue. I add an parameter, group_residues, to map_gff_features_to_sequence to add this feature to the function. I think we can keep both function and use as different engines in the select_variant function.
Now the issue: some proteins have multi-residue variants (e.g. http://www.uniprot.org/uniprot/P04637.gff look for VAR_047158 which span over two residues). In general we just look at SNP (single nucleotide polymorphism). So we need to decide, is this a missense variant in the residue 29 and 30, or just in 29 or 30? Or we ignore those?
The text was updated successfully, but these errors were encountered: