-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add anndata factory #255
base: main
Are you sure you want to change the base?
Add anndata factory #255
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah 🥳 Fantastic!
- I think how to handle duplicated protein groups is a good question - is this expected to happen? Otherwise I would raise a warning/error, drop them, and use the strategy
first
while pivoting - For some downstream analyses it might be good to consider additional information from the psm files (e.g. gene names). Would it be possible to add additional metadata to the metadata attributes? (e.g. list of columns the
.obs
and.obs
attributes?)
alphabase/anndata/anndata_factory.py
Outdated
index=PsmDfCols.RAW_NAME, | ||
columns=PsmDfCols.PROTEINS, | ||
values=PsmDfCols.INTENSITY, | ||
aggfunc=np.nanmean, # how to aggregate intensities for same protein in same raw file TODO first? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there scenarios in which the same protein occurs multiple times in a file? I tested the diann_test_input_mDIA.tsv
with the DiannReader
class and did not find any.
I think aggregating by the mean might be dangerous. One could add a test on whether there are duplicates, and at least raise a warning.
duplicated_proteins = self._psm_df[PsmDfCols.PROTEINS].duplicated()
if duplicated_proteins.sum() > 0:
warning.warn(f"{duplicated_proteins.sum()} duplicated protein groups")
Alternatively, this could be an optional argument agg_duplicates: Literal["mean", "drop", "raise"]
with "raise" raising a ValueError
, "drop" dropping the duplicated entries, and "mean" aggregating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user-define agg_duplicates => https://github.com/orgs/MannLabs/projects/20/views/1?pane=issue&itemId=88563720
if missing_cols: | ||
raise ValueError(f"Missing required columns: {missing_cols}") | ||
|
||
self._psm_df = psm_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add optional metadata columns to the .obs
and .var
attributes by passing obs_columns: Optional[str, List[str]]
and var_columns: Optional[str, List[str]]
to the factory class?
This would add to the complexity as one had to validate that the columns are in the data frame, but other than that one could just use .pivot_table while passing the list of columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
36f0ce4
to
c19e9ce
Compare
requirements.txt
Outdated
@@ -1,3 +1,4 @@ | |||
anndata==0.11.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is that in order to use 0.11.1
we would need to drop support for python 3.8 (latest supported versions are 0.10.9, 0.11.0rc1, 0.11.0rc2) .. would be an argument for moving this module out of alphabase
f6bb454
to
427e64a
Compare
3e745b3
to
239e2ef
Compare
427e64a
to
8ce3ce5
Compare
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Really cool! Is it intended that the When I run the following code, it fails due to a missing argument for the intensity column. url = r"https://datashare.biochem.mpg.de/s/Hk41INtwBvBl0kP/download?path=%2F&files=diann_1.9.0_report_head.tsv"
with tempfile.TemporaryDirectory() as temp_dir:
file_path = DataShareDownloader(
url=url, output_dir=temp_dir
).download()
factory = AnnDataFactory.from_files(
file_paths=file_path,
reader_type="diann"
)
#> ValueError: Missing required columns: ['intensity'] In contrast, this code works very well. factory = AnnDataFactory.from_files(
file_paths=file_path,
reader_type="diann",
intensity_column="PG.MaxLFQ"
)
# Works Edit: Ah, I see that this is related to the configuration in the alphabase/constants/const_files/psm_reader.yaml. |
From a user perspective : it might be helpful to see all available readers directly from the the anndata factory, e.g. by having a small class function that returns a list of available readers (something like Edit: Nevermind, I see that this is implemented in the |
'uniprot_ids': 'Protein.Ids' | ||
'genes': 'Genes' | ||
'scan_num': 'MS2.Scan' | ||
'score': 'CScore' | ||
'fdr': 'Q.Value' | ||
# 'intensity': "PG.MaxLFQ" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the diann-repo
README.md/Output/Main Report
MaxLFQ means normalised protein quantity calculated using the MaxLFQ algorithm - it is strongly recommended to use these MaxLFQ quantities and not the regular quantities (also reported by DIA-NN)
So I think this is the generally accepted quantity to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added it to psm_reader.yaml
, updated the example in the notebook showing how to use custom columns
# Conflicts: # requirements.txt # tests/integration/test_psm_readers.py
'sequence': 'Stripped.Sequence' | ||
'charge': 'Precursor.Charge' | ||
'rt': 'RT' | ||
'rt_start': 'RT.Start' | ||
'rt_stop': 'RT.Stop' | ||
'ccs': 'CCS' | ||
'mobility': ['IM','IonMobility'] | ||
'proteins': 'Protein.Names' | ||
'proteins': 'Protein.Names' # Protein.Group ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which one to use here @GeorgWa @vbrennsteiner ?
and: if we change it, this would be a breaking change .. how to deal with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the difference is whether the Uniprot Names (Protein.Names) or potentially different names are utilized, but to me, it sounds like the information is the same.
From the official DIANN Docs:
- Protein.Group - inferred proteins. See the description of the Protein inference GUI setting and the --relaxed-prot-inf option.
--relaxed-prot-inf instructs DIA-NN to use a very heuristical protein inference algorithm (similar to the one used by FragPipe and many other software tools), wherein DIA-NN aims to make sure that no protein is present simultaneously in multiple protein groups. This mode (i) is recommended for method optimisation & benchmarks, (ii) might be convenient for gene set enrichment analysis and related kinds of downstream processing. However the alternative protein inference strategy of DIA-NN is more reliable for differential expression analyses (this is one of the advantages of DIA-NN). Equivalent to the 'Heuristic protein inference' GUI setting, which is activated by default since DIA-NN 1.8.1
- Protein.Ids - all proteins matched to the precursor in the library or, in case of library-free search, in the sequence database
- Protein.Names names (UniProt names) of the proteins in the Protein.Group
Add first version of anndata conversion.