Add anndata factory #255

mschwoer · 2024-11-22T15:19:35Z

Add first version of anndata conversion.

lucas-diedrich

Yeah 🥳 Fantastic!

I think how to handle duplicated protein groups is a good question - is this expected to happen? Otherwise I would raise a warning/error, drop them, and use the strategy first while pivoting
For some downstream analyses it might be good to consider additional information from the psm files (e.g. gene names). Would it be possible to add additional metadata to the metadata attributes? (e.g. list of columns the .obs and .obs attributes?)

lucas-diedrich · 2024-11-22T15:39:27Z

alphabase/anndata/anndata_factory.py

+            index=PsmDfCols.RAW_NAME,
+            columns=PsmDfCols.PROTEINS,
+            values=PsmDfCols.INTENSITY,
+            aggfunc=np.nanmean,  # how to aggregate intensities for same protein in same raw file TODO first?


Are there scenarios in which the same protein occurs multiple times in a file? I tested the diann_test_input_mDIA.tsv with the DiannReader class and did not find any.

I think aggregating by the mean might be dangerous. One could add a test on whether there are duplicates, and at least raise a warning.

duplicated_proteins = self._psm_df[PsmDfCols.PROTEINS].duplicated() if duplicated_proteins.sum() > 0: warning.warn(f"{duplicated_proteins.sum()} duplicated protein groups")

Alternatively, this could be an optional argument agg_duplicates: Literal["mean", "drop", "raise"] with "raise" raising a ValueError, "drop" dropping the duplicated entries, and "mean" aggregating

user-define agg_duplicates => https://github.com/orgs/MannLabs/projects/20/views/1?pane=issue&itemId=88563720

lucas-diedrich · 2024-11-22T15:47:47Z

alphabase/anndata/anndata_factory.py

+        if missing_cols:
+            raise ValueError(f"Missing required columns: {missing_cols}")
+
+        self._psm_df = psm_df


Would it be possible to add optional metadata columns to the .obs and .var attributes by passing obs_columns: Optional[str, List[str]] and var_columns: Optional[str, List[str]] to the factory class?

This would add to the complexity as one had to validate that the columns are in the data frame, but other than that one could just use .pivot_table while passing the list of columns

=> https://github.com/orgs/MannLabs/projects/20/views/1?pane=issue&itemId=88563842

lucas-diedrich · 2024-11-25T14:37:25Z

requirements.txt

@@ -1,3 +1,4 @@
+anndata==0.11.1


the problem is that in order to use 0.11.1 we would need to drop support for python 3.8 (latest supported versions are 0.10.9, 0.11.0rc1, 0.11.0rc2) .. would be an argument for moving this module out of alphabase

review-notebook-app · 2024-11-26T10:25:10Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

lucas-diedrich · 2025-01-21T08:52:07Z

Really cool! Is it intended that the alphabase.psm_reader.dia_psm_reader.DiannReader does not have a registired intensity column?

When I run the following code, it fails due to a missing argument for the intensity column.

url = r"https://datashare.biochem.mpg.de/s/Hk41INtwBvBl0kP/download?path=%2F&files=diann_1.9.0_report_head.tsv"
with tempfile.TemporaryDirectory() as temp_dir:
    file_path = DataShareDownloader(
        url=url, output_dir=temp_dir
    ).download()
   

    factory = AnnDataFactory.from_files(
            file_paths=file_path,
            reader_type="diann"
        )
#> ValueError: Missing required columns: ['intensity']

In contrast, this code works very well.

factory = AnnDataFactory.from_files(
        file_paths=file_path,
        reader_type="diann",
        intensity_column="PG.MaxLFQ"
    )
# Works

Edit: Ah, I see that this is related to the configuration in the alphabase/constants/const_files/psm_reader.yaml.

lucas-diedrich · 2025-01-21T08:55:36Z

From a user perspective : it might be helpful to see all available readers directly from the the anndata factory, e.g. by having a small class function that returns a list of available readers (something like get_available_readers). That would basically wrap the alphabase.psm_reader.psm_reader_provider.reader_dict attribute.

Edit: Nevermind, I see that this is implemented in the load function.

lucas-diedrich · 2025-01-21T09:09:25Z

alphabase/constants/const_files/psm_reader.yaml

    'uniprot_ids': 'Protein.Ids'
    'genes': 'Genes'
    'scan_num': 'MS2.Scan'
    'score': 'CScore'
    'fdr': 'Q.Value'
+#    'intensity': "PG.MaxLFQ"


From the diann-repo

README.md/Output/Main Report

MaxLFQ means normalised protein quantity calculated using the MaxLFQ algorithm - it is strongly recommended to use these MaxLFQ quantities and not the regular quantities (also reported by DIA-NN)

So I think this is the generally accepted quantity to use.

added it to psm_reader.yaml, updated the example in the notebook showing how to use custom columns

# Conflicts: # requirements.txt # tests/integration/test_psm_readers.py

mschwoer · 2025-01-21T10:55:11Z

alphabase/constants/const_files/psm_reader.yaml

    'sequence': 'Stripped.Sequence'
    'charge': 'Precursor.Charge'
    'rt': 'RT'
    'rt_start': 'RT.Start'
    'rt_stop': 'RT.Stop'
    'ccs': 'CCS'
    'mobility': ['IM','IonMobility']
-    'proteins': 'Protein.Names'
+    'proteins': 'Protein.Names' # Protein.Group ?


which one to use here @GeorgWa @vbrennsteiner ?

and: if we change it, this would be a breaking change .. how to deal with that?

It seems like the difference is whether the Uniprot Names (Protein.Names) or potentially different names are utilized, but to me, it sounds like the information is the same.

From the official DIANN Docs:

Protein.Group - inferred proteins. See the description of the Protein inference GUI setting and the --relaxed-prot-inf option.

--relaxed-prot-inf instructs DIA-NN to use a very heuristical protein inference algorithm (similar to the one used by FragPipe and many other software tools), wherein DIA-NN aims to make sure that no protein is present simultaneously in multiple protein groups. This mode (i) is recommended for method optimisation & benchmarks, (ii) might be convenient for gene set enrichment analysis and related kinds of downstream processing. However the alternative protein inference strategy of DIA-NN is more reliable for differential expression analyses (this is one of the advantages of DIA-NN). Equivalent to the 'Heuristic protein inference' GUI setting, which is activated by default since DIA-NN 1.8.1

Protein.Ids - all proteins matched to the precursor in the library or, in case of library-free search, in the sequence database

Protein.Names names (UniProt names) of the proteins in the Protein.Group

mschwoer requested review from GeorgWa, lucas-diedrich and vbrennsteiner November 22, 2024 15:19

lucas-diedrich reviewed Nov 22, 2024

View reviewed changes

lucas-diedrich approved these changes Nov 22, 2024

View reviewed changes

mschwoer force-pushed the add_anndata_factory branch from 36f0ce4 to c19e9ce Compare November 25, 2024 14:01

lucas-diedrich reviewed Nov 25, 2024

View reviewed changes

lucas-diedrich approved these changes Nov 25, 2024 •

edited

Loading

View reviewed changes

mschwoer force-pushed the add_anndata_factory branch from f6bb454 to 427e64a Compare November 25, 2024 16:06

mschwoer changed the base branch from refactor_readers_XI to add_alphadia_reader November 25, 2024 16:09

mschwoer force-pushed the add_alphadia_reader branch from 3e745b3 to 239e2ef Compare November 26, 2024 09:35

mschwoer added 6 commits November 26, 2024 10:35

add first anndata conversion

fe00974

add first anndata conversion

24572c1

add anndata dependency

0d2042c

add possibility to add custom column mapping

1a84472

nice error message on unsupported readers

542cb1b

refactor and add unit tests

8ce3ce5

mschwoer force-pushed the add_anndata_factory branch from 427e64a to 8ce3ce5 Compare November 26, 2024 09:35

mschwoer added 2 commits November 26, 2024 10:58

add tutorial notebook

1c71385

add anndata tests

d51c7ed

mschwoer added 2 commits November 26, 2024 11:29

use only first protein

40a44bf

add reference data

6fdc7ee

mschwoer requested a review from lucas-diedrich November 26, 2024 10:36

mschwoer marked this pull request as ready for review November 26, 2024 10:36

mschwoer added 2 commits November 26, 2024 15:38

add potentially correct columns

1af8e6f

use potentially correct columns

b7e9a90

Base automatically changed from add_alphadia_reader to development January 9, 2025 16:27

lucas-diedrich reviewed Jan 21, 2025

View reviewed changes

mschwoer added 4 commits January 21, 2025 11:37

add intensity column to diann output

e833839

adapt test references

6200f12

Merge branch 'development' into add_anndata_factory

75e3ef4

# Conflicts: # requirements.txt # tests/integration/test_psm_readers.py

fix merge conflicts

131e2de

lucas-diedrich approved these changes Jan 21, 2025

View reviewed changes

mschwoer commented Jan 21, 2025

View reviewed changes

reorder

1e6b3d6

lucas-diedrich approved these changes Jan 21, 2025

View reviewed changes

mschwoer changed the base branch from development to main January 22, 2025 09:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add anndata factory #255

Add anndata factory #255

mschwoer commented Nov 22, 2024 •

edited

Loading

lucas-diedrich left a comment •

edited

Loading

lucas-diedrich Nov 22, 2024 •

edited

Loading

mschwoer Nov 25, 2024 •

edited

Loading

lucas-diedrich Nov 22, 2024

mschwoer Nov 25, 2024

lucas-diedrich Nov 25, 2024

mschwoer Nov 26, 2024

review-notebook-app bot commented Nov 26, 2024

lucas-diedrich commented Jan 21, 2025 •

edited

Loading

lucas-diedrich commented Jan 21, 2025 •

edited

Loading

lucas-diedrich Jan 21, 2025

mschwoer Jan 21, 2025

mschwoer Jan 21, 2025

lucas-diedrich Jan 22, 2025 •

edited

Loading

Add anndata factory #255

Are you sure you want to change the base?

Add anndata factory #255

Conversation

mschwoer commented Nov 22, 2024 • edited Loading

lucas-diedrich left a comment • edited Loading

Choose a reason for hiding this comment

lucas-diedrich Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

mschwoer Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

lucas-diedrich Nov 22, 2024

Choose a reason for hiding this comment

mschwoer Nov 25, 2024

Choose a reason for hiding this comment

lucas-diedrich Nov 25, 2024

Choose a reason for hiding this comment

mschwoer Nov 26, 2024

Choose a reason for hiding this comment

review-notebook-app bot commented Nov 26, 2024

lucas-diedrich commented Jan 21, 2025 • edited Loading

lucas-diedrich commented Jan 21, 2025 • edited Loading

lucas-diedrich Jan 21, 2025

Choose a reason for hiding this comment

mschwoer Jan 21, 2025

Choose a reason for hiding this comment

mschwoer Jan 21, 2025

Choose a reason for hiding this comment

lucas-diedrich Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

mschwoer commented Nov 22, 2024 •

edited

Loading

lucas-diedrich left a comment •

edited

Loading

lucas-diedrich Nov 22, 2024 •

edited

Loading

mschwoer Nov 25, 2024 •

edited

Loading

lucas-diedrich commented Jan 21, 2025 •

edited

Loading

lucas-diedrich commented Jan 21, 2025 •

edited

Loading

lucas-diedrich Jan 22, 2025 •

edited

Loading