Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design decision in _fetch_uniprot_variants #17

Open
tbrittoborges opened this issue Feb 22, 2016 · 0 comments
Open

Design decision in _fetch_uniprot_variants #17

tbrittoborges opened this issue Feb 22, 2016 · 0 comments
Labels

Comments

@tbrittoborges
Copy link
Collaborator

Hi @biomadeira @stuartmac

We currently have two low level functions for parsing Uniprot variants. The first one is _fetch_uniprot_variants which is quick and dirty. The second one is something I had wrote up earlier but wasn't in the code base due its complexity. It first uses the Uniprot guidelines to parse the text. Following, regex for parsing the gff annotation, showing: the variant ids, reference/mutated residues, disease name, is_cancer and is_germline. The biggest difference between the functions is that the first has multiple rows the same residue and the second has one row per residue. I add an parameter, group_residues, to map_gff_features_to_sequence to add this feature to the function. I think we can keep both function and use as different engines in the select_variant function.

Now the issue: some proteins have multi-residue variants (e.g. http://www.uniprot.org/uniprot/P04637.gff look for VAR_047158 which span over two residues). In general we just look at SNP (single nucleotide polymorphism). So we need to decide, is this a missense variant in the residue 29 and 30, or just in 29 or 30? Or we ignore those?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants