Design decision in _fetch_uniprot_variants #17

tbrittoborges · 2016-02-22T12:47:19Z

We currently have two low level functions for parsing Uniprot variants. The first one is _fetch_uniprot_variants which is quick and dirty. The second one is something I had wrote up earlier but wasn't in the code base due its complexity. It first uses the Uniprot guidelines to parse the text. Following, regex for parsing the gff annotation, showing: the variant ids, reference/mutated residues, disease name, is_cancer and is_germline. The biggest difference between the functions is that the first has multiple rows the same residue and the second has one row per residue. I add an parameter, group_residues, to map_gff_features_to_sequence to add this feature to the function. I think we can keep both function and use as different engines in the select_variant function.

Now the issue: some proteins have multi-residue variants (e.g. http://www.uniprot.org/uniprot/P04637.gff look for VAR_047158 which span over two residues). In general we just look at SNP (single nucleotide polymorphism). So we need to decide, is this a missense variant in the residue 29 and 30, or just in 29 or 30? Or we ignore those?

biomadeira added the question label Aug 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design decision in _fetch_uniprot_variants #17

Design decision in _fetch_uniprot_variants #17

tbrittoborges commented Feb 22, 2016

Design decision in _fetch_uniprot_variants #17

Design decision in _fetch_uniprot_variants #17

Comments

tbrittoborges commented Feb 22, 2016