-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate Overall Mondo Stats #8570
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twhetzel
Thank you for these statistics.
I added a few comments, mostly to get confirmation on a few points, though there are a few questions too.
Would it be possible to get these numbers in a table, please? It takes careful reading to make sure we understand what the numbers refer to. (if it is too complicated, we can deal in the immediate)
- OMIM | ||
- Orphanet | ||
- ICD10CM | ||
- UMLS (as provided by MedGen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we count the UMLS?
Counting how many are given to MedGen is not representative of how many are in the source itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's where the xref value is from UMLS. I don't recall why a mention of MedGen is here from my notes when I talked to Nico about this in our 1:1. All of the information on how the data was generated and what query was run is included in this PR.
- DOID | ||
- NCIT (neoplasm branch) --> special processing needed | ||
- OMIM | ||
- Orphanet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the "whole" orphanet or focused on the Orphanet disease and disease subtypes?
We should count only the one that are in the orphanet disease and disease subtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it is currently all of Orphanet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is updated now.
- GARD | ||
- MedGen | ||
- NANDO | ||
- NORD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't UMLS be included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this might be a better place since that data is in the EMC pipeline. However, UMLS was listed as a Source twice in the document. Was there a reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just move UMLS to EMC for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, UMLS is managed externally by MedGen
1071 | ||
?cancerClassCount | ||
4710 | ||
?hereditaryClassCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was hereditary class counted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
children of MONDO_0003847
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be correct as long as all diseases that have a 'has material basis in germline mutation in' annotation are also inferred to be "hereditary". I am assuming that this is the case, but I would like to confirm. Do we have some logic to infer this? (tagging @matentzn too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can run a query to check this.
When I worked with Nicole to get stats about Mondo for the Epilepsy workshop, the definition of hereditary diseases we used was the same as here: "all unique non-obsolete classes (asserted or inferred) that are children of MONDO:0003847 'hereditary disease' and have a Mondo CURIE".
All of this past work was reviewed in these PRs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is resolved based on the Slack comment on Feb. 5, 2025 from Sabrina:
correct, no issues with the query for the "hereditary" disease. I can follow-up
with Nico/Nicole/Chris about the rules we have around this classification
|
||
#### Results - Number of Sources for Synonyms | ||
Number of Sources Count of Synonyms | ||
0 12389 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that we have 12389 synonyms that do not have any provenance/x-ref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there are 12389 exact synonyms without provenance/xref for Mondo terms that are children of 'human disease'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a lot. We should check where these came from. Could this be a consequence of the synonym sync?
Is it possible to check if it is? (outside of the PR, it is unrelated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go back to an earlier Mondo release and get the number of synonyms without provenance to compare.
@twhetzel, I think it would be fantastic if these numbers could be generated and updated in our document with each release. |
- NANDO | ||
- NORD | ||
|
||
#### Results for Externally Managed Content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please confirm what these numbers are showing? Is this the number of Mondo terms that have x-ref to gard?
For MedGen, is that the number of terms that have x-ref to MedGen? How about the number of UMLS x-ref that we get though MedGen?
For Nando: is this the number of Mondo terms with at least one x-ref to NANDO? or the number of NANDO xref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the count of Mondo terms from the inferred hierarchy that are children of 'human disease' that have an xref to each of these sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more comment
@sabrinatoro I believe I have addressed all comments and if additional query changes are needed. Let's talk more about which of all these stats would be useful to include for the monthly Mondo releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, but I am missing a single, executable pipeline like sh run.sh make metrics
to generate all numbers - and then there is this new request by @sabrinatoro to generate a specific table (this may need some python to collate tables).
- GARD | ||
- MedGen | ||
- NANDO | ||
- NORD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just move UMLS to EMC for clarity
@matentzn a review of the queries would have been useful since there were comments that some numbers were incorrect in the grant, although I do not know which numbers were in question and not all numbers in the grant were added or generated by me. There was never an intent for this work to be created as a workflow. That will be done in a different PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twhetzel
I haven't reviewed the sparql queries.
I reviewed the README document, and I added some comments.
I am still having trouble understanding the numbers I am seeing, specifically for the Alignment part.
What is the total number of terms at the source, and what it the total number these terms that are in Mondo? are they line 55 and line 93?
If so, I am still confused by the NCIT numbers. I still see "3804" for what I think is the total number of NCIT name (line 58). If this number represents the total number of terms at the source, this is not correct.
I am just commenting on the PR because I don't know whether these numbers are correct/make sense or if changes are requested.
|
||
$ wc data/all_mondo_classes_with_ncit_xref_neoplasm-branch.tsv | ||
3804 20177 229560 data/all_mondo_classes_with_ncit_xref_neoplasm-branch.tsv | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still very confusing to me.
Does it mean that we are counting the Mondo terms which have an equivalent to an NCIT term which is in the NCIT neoplasm branch?
If yes, then its all good.
3804 20177 229560 data/all_mondo_classes_with_ncit_xref_neoplasm-branch.tsv | ||
``` | ||
|
||
#### Results for Alignment with the source - Total number of diseases in sources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This title is confusing. What does "Results for Alignment with the source" mean?
Are these the total number of diseases in the source?
Or is it the total number of diseases in the source which are represented in Mondo as "equivalentTo"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IF this is the total number of terms at the source,
- NCIT is too low
- UMLS is the same as in the next query. How is UMLS counted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's being counted is defined earlier in the README. The top-level section headings are following what was in the gDoc.
3452 18507 210236 data/output/all_mondo_classes_with_ncit_xref_neoplasm-branch-equivalentTo.tsv | ||
``` | ||
|
||
#### Results for Alignment with the source - Number of terms from the sources that are in Mondo (as Mondo:equivalent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference with the results above?
For instance, NCIT numbers are still very close to each other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, UMLS is exactly the same as in the query above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is being counted in this section is defined here in the README, just a few lines up.
- GARD | ||
- MedGen | ||
- NANDO | ||
- NORD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, UMLS is managed externally by MedGen
1071 | ||
?cancerClassCount | ||
4710 | ||
?hereditaryClassCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be correct as long as all diseases that have a 'has material basis in germline mutation in' annotation are also inferred to be "hereditary". I am assuming that this is the case, but I would like to confirm. Do we have some logic to infer this? (tagging @matentzn too)
|
||
#### Results - Number of Sources for Synonyms | ||
Number of Sources Count of Synonyms | ||
0 12389 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a lot. We should check where these came from. Could this be a consequence of the synonym sync?
Is it possible to check if it is? (outside of the PR, it is unrelated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didnt review the queries in detail, but the numbers look what I expect, and I think we can move forward with this.
I would kinda like a single make goal "metrics_dependencies" or something that has all of the queries in them that need to be run.
Generate Mondo Stats as requested here. See the README for stats and how they were generated.
@sabrinatoro As future work, it would be helpful to know if any of these stats should be generated as report(s) for the Mondo releases. This was mentioned at the very end of the Tech call.