Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Overall Mondo Stats #8570

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Generate Overall Mondo Stats #8570

wants to merge 15 commits into from

Conversation

twhetzel
Copy link
Collaborator

@twhetzel twhetzel commented Jan 14, 2025

Generate Mondo Stats as requested here. See the README for stats and how they were generated.

@sabrinatoro As future work, it would be helpful to know if any of these stats should be generated as report(s) for the Mondo releases. This was mentioned at the very end of the Tech call.

@twhetzel twhetzel marked this pull request as draft January 14, 2025 18:15
@twhetzel twhetzel requested review from matentzn and removed request for nicolevasilevsky January 15, 2025 01:23
@twhetzel twhetzel marked this pull request as ready for review January 15, 2025 01:25
Copy link
Collaborator

@sabrinatoro sabrinatoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twhetzel
Thank you for these statistics.
I added a few comments, mostly to get confirmation on a few points, though there are a few questions too.

Would it be possible to get these numbers in a table, please? It takes careful reading to make sure we understand what the numbers refer to. (if it is too complicated, we can deal in the immediate)

- OMIM
- Orphanet
- ICD10CM
- UMLS (as provided by MedGen)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we count the UMLS?
Counting how many are given to MedGen is not representative of how many are in the source itself.

Copy link
Collaborator Author

@twhetzel twhetzel Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's where the xref value is from UMLS. I don't recall why a mention of MedGen is here from my notes when I talked to Nico about this in our 1:1. All of the information on how the data was generated and what query was run is included in this PR.

- DOID
- NCIT (neoplasm branch) --> special processing needed
- OMIM
- Orphanet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the "whole" orphanet or focused on the Orphanet disease and disease subtypes?
We should count only the one that are in the orphanet disease and disease subtypes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it is currently all of Orphanet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is updated now.

- GARD
- MedGen
- NANDO
- NORD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't UMLS be included here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this might be a better place since that data is in the EMC pipeline. However, UMLS was listed as a Source twice in the document. Was there a reason for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just move UMLS to EMC for clarity

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, UMLS is managed externally by MedGen

1071
?cancerClassCount
4710
?hereditaryClassCount
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was hereditary class counted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

children of MONDO_0003847

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be correct as long as all diseases that have a 'has material basis in germline mutation in' annotation are also inferred to be "hereditary". I am assuming that this is the case, but I would like to confirm. Do we have some logic to infer this? (tagging @matentzn too)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can run a query to check this.

When I worked with Nicole to get stats about Mondo for the Epilepsy workshop, the definition of hereditary diseases we used was the same as here: "all unique non-obsolete classes (asserted or inferred) that are children of MONDO:0003847 'hereditary disease' and have a Mondo CURIE".

All of this past work was reviewed in these PRs:

Copy link
Collaborator Author

@twhetzel twhetzel Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is resolved based on the Slack comment on Feb. 5, 2025 from Sabrina:

correct, no issues with the query for the "hereditary" disease. I can follow-up 
with Nico/Nicole/Chris about the rules we have around this classification


#### Results - Number of Sources for Synonyms
Number of Sources Count of Synonyms
0 12389
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that we have 12389 synonyms that do not have any provenance/x-ref?

Copy link
Collaborator Author

@twhetzel twhetzel Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there are 12389 exact synonyms without provenance/xref for Mondo terms that are children of 'human disease'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a lot. We should check where these came from. Could this be a consequence of the synonym sync?
Is it possible to check if it is? (outside of the PR, it is unrelated)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go back to an earlier Mondo release and get the number of synonyms without provenance to compare.

@sabrinatoro
Copy link
Collaborator

@twhetzel, I think it would be fantastic if these numbers could be generated and updated in our document with each release.

- NANDO
- NORD

#### Results for Externally Managed Content
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please confirm what these numbers are showing? Is this the number of Mondo terms that have x-ref to gard?

For MedGen, is that the number of terms that have x-ref to MedGen? How about the number of UMLS x-ref that we get though MedGen?

For Nando: is this the number of Mondo terms with at least one x-ref to NANDO? or the number of NANDO xref?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the count of Mondo terms from the inferred hierarchy that are children of 'human disease' that have an xref to each of these sources.

Copy link
Collaborator

@sabrinatoro sabrinatoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more comment

@twhetzel
Copy link
Collaborator Author

@sabrinatoro I believe I have addressed all comments and if additional query changes are needed. Let's talk more about which of all these stats would be useful to include for the monthly Mondo releases.

@twhetzel twhetzel requested a review from sabrinatoro January 25, 2025 16:10
Copy link
Member

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, but I am missing a single, executable pipeline like sh run.sh make metrics to generate all numbers - and then there is this new request by @sabrinatoro to generate a specific table (this may need some python to collate tables).

- GARD
- MedGen
- NANDO
- NORD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just move UMLS to EMC for clarity

@twhetzel twhetzel requested a review from matentzn February 4, 2025 17:20
@twhetzel
Copy link
Collaborator Author

twhetzel commented Feb 4, 2025

@matentzn a review of the queries would have been useful since there were comments that some numbers were incorrect in the grant, although I do not know which numbers were in question and not all numbers in the grant were added or generated by me. There was never an intent for this work to be created as a workflow. That will be done in a different PR.

Copy link
Collaborator

@sabrinatoro sabrinatoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twhetzel
I haven't reviewed the sparql queries.
I reviewed the README document, and I added some comments.

I am still having trouble understanding the numbers I am seeing, specifically for the Alignment part.

What is the total number of terms at the source, and what it the total number these terms that are in Mondo? are they line 55 and line 93?

If so, I am still confused by the NCIT numbers. I still see "3804" for what I think is the total number of NCIT name (line 58). If this number represents the total number of terms at the source, this is not correct.

I am just commenting on the PR because I don't know whether these numbers are correct/make sense or if changes are requested.


$ wc data/all_mondo_classes_with_ncit_xref_neoplasm-branch.tsv
3804 20177 229560 data/all_mondo_classes_with_ncit_xref_neoplasm-branch.tsv
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still very confusing to me.
Does it mean that we are counting the Mondo terms which have an equivalent to an NCIT term which is in the NCIT neoplasm branch?
If yes, then its all good.

3804 20177 229560 data/all_mondo_classes_with_ncit_xref_neoplasm-branch.tsv
```

#### Results for Alignment with the source - Total number of diseases in sources
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This title is confusing. What does "Results for Alignment with the source" mean?

Are these the total number of diseases in the source?
Or is it the total number of diseases in the source which are represented in Mondo as "equivalentTo"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IF this is the total number of terms at the source,

  • NCIT is too low
  • UMLS is the same as in the next query. How is UMLS counted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's being counted is defined earlier in the README. The top-level section headings are following what was in the gDoc.

3452 18507 210236 data/output/all_mondo_classes_with_ncit_xref_neoplasm-branch-equivalentTo.tsv
```

#### Results for Alignment with the source - Number of terms from the sources that are in Mondo (as Mondo:equivalent)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference with the results above?

For instance, NCIT numbers are still very close to each other

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, UMLS is exactly the same as in the query above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is being counted in this section is defined here in the README, just a few lines up.

- GARD
- MedGen
- NANDO
- NORD
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, UMLS is managed externally by MedGen

1071
?cancerClassCount
4710
?hereditaryClassCount
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be correct as long as all diseases that have a 'has material basis in germline mutation in' annotation are also inferred to be "hereditary". I am assuming that this is the case, but I would like to confirm. Do we have some logic to infer this? (tagging @matentzn too)


#### Results - Number of Sources for Synonyms
Number of Sources Count of Synonyms
0 12389
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a lot. We should check where these came from. Could this be a consequence of the synonym sync?
Is it possible to check if it is? (outside of the PR, it is unrelated)

@sabrinatoro sabrinatoro removed their assignment Feb 4, 2025
@twhetzel twhetzel requested a review from sabrinatoro February 19, 2025 05:26
Copy link
Member

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt review the queries in detail, but the numbers look what I expect, and I think we can move forward with this.

I would kinda like a single make goal "metrics_dependencies" or something that has all of the queries in them that need to be run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants