Add two step classifier #431

anna-charlotte · 2025-01-16T12:32:55Z

Adding a two-step-classifier, that's to be used with a logistic regression followed by a neural network classifier, as inspired by DIA-NN. With this we aim to increase sensitivity, particularly in samples with only few peptides present, such as single cell samples.

Steps:

add a TwoStepClassifier
add a parameter to config to use the two-step-classifier
adjust call in FDRManager.fit_predict(), where the classifier is trained
add tests

alphadia/fdrexperimental.py

GeorgWa · 2025-01-16T13:47:09Z

alphadia/fdrexperimental.py

+
+
+def apply_absolute_transformations(df: pd.DataFrame) -> pd.DataFrame:
+    df_transformed = df.copy()


just FYI, this is a very expensive operation memory wise.
I think we should do this in place here.

couldn't we just copy the columns that are changed (assuming they are much smaller than the whole df) and merge them to the original df?

Yes, that would be an option. My favourite would be to move the abs() to the location where the feature is calculated in the first place. But this is a bit complicated for prototyping.

alphadia/fdr.py

alphadia/fdrexperimental.py

GeorgWa · 2025-01-16T14:09:05Z

alphadia/fdrexperimental.py

+    return matching_rows
+
+
+def keep_best(


was this function copied from the old implementation?

was this function copied from here?

adressed in a previous comment :)

mschwoer · 2025-01-16T18:15:28Z

alphadia/fdr.py

-
-    fpr_test, tpr_test, thresholds_test = sklearn.metrics.roc_curve(
-        y_test, y_test_proba
+    df.dropna(subset=available_columns, inplace=True)


could we avoid the inplace operation?

😄 (after reading @GeorgWa comments going in thte exact opposite direction)

hahaha, code review classic :D
I think in this case we actually want to do all operations in-place for performance reasons. But we should be explicit about it.

The issue is that precursor df is a very large dataframe (multiple GBs) and doing copies here and there could easily lead to OOM errors.

mschwoer · 2025-01-16T18:16:55Z

alphadia/fdr_utils.py

+logger = logging.getLogger()
+
+
+def keep_best(


+1 for reorganizing code, but this makes it hard to spot any changes :-)
would it be a large effort to move this back to fdr.py for now and do the reordering (=just moving) later (or: before) in a dedicated PR?

To my understanding the functions were copied as they are from

alphadia/alphadia/fdr.py

Line 171 in da99596

def keep_best(

So there should be no changes here?

yes exactly, it was just moved over, due to a circular import issue. That one has been resolved now, so I moved it back to alphadia/fdr.py for now:)

@GeorgWa I noticed there are duplicates of the functions keep_best(), fdr_to_q_values() and some more in aphadia/fdrx/stats.py. Is that on purpose? If so, why do we have those?

mschwoer · 2025-01-16T18:18:03Z

alphadia/fdrexperimental.py

+
+
+def apply_absolute_transformations(df: pd.DataFrame) -> pd.DataFrame:
+    df_transformed = df.copy()


couldn't we just copy the columns that are changed (assuming they are much smaller than the whole df) and merge them to the original df?

mschwoer · 2025-01-16T18:20:54Z

alphadia/fdrexperimental.py

+
+    @classmethod
+    def _update_classifier(
+        cls, classifier, df_, x_cols, y_col, fdr, group_columns


docstrings missing ;-) (also at other places)
also, I would avoid underscore prefixes in variable names unless they are really required (private variables, small-scoped clashes with built-ins, named ignores..)

@GeorgWa correct me if I'm wrong, but df = df.copy() would be okay in such a case?

In this implementation we need the copy, good point!
I think we should only create a copy to store the indices for calculating the FDR etc. and not duplicate the features though.

yeah, let me check all the .copy()'s and remove the redundant ones;)

mschwoer · 2025-01-16T18:23:58Z

alphadia/fdrexperimental.py

@@ -107,6 +125,347 @@ def from_state_dict(self, state_dict: dict):
        """


+class TwoStepClassifier(Classifier):


could all this new code be moved to a new module?

great point! lets roll out a fdr module.

Do you think we should do this in a separate PR @mschwoer to prevent overloading of this PR with moving of 'old' functionality?

new code I would put already into the new module .. moving old code would be a separate PR (cf #431 (comment))

I moved the TwoStepClassifier and LogisticRegressionModel to a new module, which I called fdr_analysis for now, as it was name-clashing with the fdr.py file otherwise.

could you move it to fdrx?

anna-charlotte · 2025-01-17T10:55:23Z

alphadia/fdrexperimental.py

+        X = df_[x_cols]
+        y = df_[y_col]
+        df = df_.copy()
+        if hasattr(classifier, "fitted") and classifier.fitted:


yes, you're right :)

anna-charlotte · 2025-01-17T11:27:55Z

alphadia/workflow/manager.py

                if classifier_hash not in self.classifier_store:
                    classifier = deepcopy(self.classifier_base)
-                    classifier.from_state_dict(torch.load(os.path.join(path, file)))
+                    with contextlib.suppress(Exception):
+                        classifier.from_state_dict(torch.load(os.path.join(path, file)))


@GeorgWa What is this alphadia/constants/classifier/fa9945ae23db872d.pth file that we are loading here, some pretrained model? Shall I store a similar file for the two-step-classifier?

Yes exactly 👍🏻 We will do the same with the two step classifier eventually

anna-charlotte · 2025-01-21T10:31:26Z

@GeorgWa Should I add an e2e or performance test for the two step classifier, or just the unit tests?

mschwoer

LGTM!

mschwoer · 2025-01-21T13:24:57Z

alphadia/fdrx/models/__init__.py

+from .logistic_regression import LogisticRegressionClassifier
+from .two_step_classifier import TwoStepClassifier
+
+__all__ = ["LogisticRegressionClassifier", "TwoStepClassifier"]


do we need the logic in this file? (especially the __all__ -> we're not using that idiom anywhere else in alphadia)

mschwoer · 2025-01-21T13:43:54Z

alphadia/fdrx/models/two_step_classifier.py

+        self.first_classifier = first_classifier
+        self.second_classifier = second_classifier
+        self.first_fdr_cutoff = first_fdr_cutoff
+        self.second_fdr_cutoff = second_fdr_cutoff
+
+        self.min_precursors_for_update = min_precursors_for_update
+        self.train_on_top_n = train_on_top_n


could those be private? (check also LogisticRegression)

mschwoer · 2025-01-21T13:45:36Z

alphadia/fdrx/models/two_step_classifier.py

+                    f"Stop training after iteration {i}, "
+                    f"due to decreasing target count ({current_target_count} < {best_precursor_count})"


(nit) "Stopping .." .. ".. decreased .."

mschwoer · 2025-01-21T13:46:09Z

alphadia/fdrx/models/two_step_classifier.py

+                    df_filtered, df, x_cols, y_col, group_columns
+                )
+            else:
+                break


would it be worth to log something here?

and log if we reached max_iterations? (could use the for ... else pattern here)

mschwoer · 2025-01-21T13:46:38Z

alphadia/fdrx/models/two_step_classifier.py

+
+        return best_result
+
+    def preprocess_data(self, df: pd.DataFrame, x_cols: list[str]) -> pd.DataFrame:


please check all methods for being potentially private

mschwoer · 2025-01-21T13:57:41Z

alphadia/workflow/manager.py

@@ -722,6 +766,7 @@ def fit_predict(
            raise ValueError(f"Invalid decoy_strategy: {decoy_strategy}")

        self.is_fitted = True
+        # n_precursor = len(psm_df[psm_df["qval"] <= 0.01])


please delete

mschwoer · 2025-01-21T13:58:03Z

alphadia/workflow/manager.py

@@ -776,7 +821,8 @@ def load_classifier_store(self, path: None | str = None):

                if classifier_hash not in self.classifier_store:
                    classifier = deepcopy(self.classifier_base)
-                    classifier.from_state_dict(torch.load(os.path.join(path, file)))
+                    with contextlib.suppress(Exception):


isn't is dangerous to suppress Exceptions here?

mschwoer · 2025-01-21T14:03:00Z

alphadia/workflow/manager.py

@@ -597,6 +627,10 @@ def __init__(
            self.feature_columns = feature_columns
            self.classifier_store = defaultdict(list)
            self.classifier_base = classifier_base
+            self.enable_two_step_classifier = isinstance(


(nit) is_two_step_classifier (strictly speaking, it's already enabled)

mschwoer · 2025-01-21T14:04:31Z

alphadia/fdrx/models/logistic_regression.py

+        x : np.array, dtype=float
+            Data of shape (n_samples, n_features).


at other places you are using capital X .. please choose one ;-) (check also what the rest of the code uses)

(and adapt also x_scaled)

mschwoer · 2025-01-21T14:06:52Z

alphadia/fdrx/models/logistic_regression.py

+        """
+        self._fitted = state_dict["_fitted"]
+
+        if self.fitted:


please check if this should rather be if self._fitted:? if not, add a comment which deconfuses me :-)

add logreg and two step classifer

bf480ff

anna-charlotte marked this pull request as draft January 16, 2025 12:53

anna-charlotte added 2 commits January 16, 2025 14:24

add config param to enable 2-step-classifier

e6d3e3b

fix FDRManger two-step-classifier parameter

1c2065c

GeorgWa reviewed Jan 16, 2025

View reviewed changes

anna-charlotte added 5 commits January 16, 2025 15:30

extract fdr utility functions due to circular import

29168e4

fix logreg initialization

124260f

fix fdr_utils test

6480c6a

Merge remote-tracking branch 'origin/main' into add-two-step-classifier

e450f24

fix fdr_utils refactoring

5850360

mschwoer reviewed Jan 16, 2025

View reviewed changes

anna-charlotte added 5 commits January 17, 2025 08:38

remove redundant perform_fdr_new function

e8fcc3f

revert fdr_utils changes

052c109

clean up and add docstrings

9df38a8

add missing docstring

9e8c0c8

clean up fdrexperimental.py

1605cfa

anna-charlotte commented Jan 17, 2025

View reviewed changes

anna-charlotte added 9 commits January 17, 2025 13:57

Merge remote-tracking branch 'origin/main' into add-two-step-classifier

8551503

formatting

2d2be2a

move models to new fdr_analysis module

7a1f4a7

move files from fdr_analysis to fdrx module

ceabe7c

add max_iteration parameter to 2-step-classifier.fit_predict()

fdf62db

Merge remote-tracking branch 'origin/main' into add-two-step-classifier

1c8157f

refactoring of two-step-classifier helper functions

e6d6b95

add unit tests

65a11bb

formatting

ef6fc45

anna-charlotte requested review from mschwoer and GeorgWa January 21, 2025 10:31

fix test for get_target_decoy_partners

584bab7

mschwoer approved these changes Jan 21, 2025

View reviewed changes



		def apply_absolute_transformations(df: pd.DataFrame) -> pd.DataFrame:
		df_transformed = df.copy()

		@@ -107,6 +125,347 @@ def from_state_dict(self, state_dict: dict):
		"""


		class TwoStepClassifier(Classifier):

		f"Stop training after iteration {i}, "
		f"due to decreasing target count ({current_target_count} < {best_precursor_count})"


		return best_result

		def preprocess_data(self, df: pd.DataFrame, x_cols: list[str]) -> pd.DataFrame:

		x : np.array, dtype=float
		Data of shape (n_samples, n_features).

Add two step classifier #431

Are you sure you want to change the base?

Add two step classifier #431

Conversation

anna-charlotte commented Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anna-charlotte Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anna-charlotte commented Jan 21, 2025

mschwoer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anna-charlotte commented Jan 16, 2025 •

edited

Loading

anna-charlotte Jan 17, 2025 •

edited

Loading