Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add two step classifier #431

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft

Add two step classifier #431

wants to merge 23 commits into from

Conversation

anna-charlotte
Copy link
Contributor

@anna-charlotte anna-charlotte commented Jan 16, 2025

Adding a two-step-classifier, that's to be used with a logistic regression followed by a neural network classifier, as inspired by DIA-NN. With this we aim to increase sensitivity, particularly in samples with only few peptides present, such as single cell samples.

Steps:

  • add a TwoStepClassifier
  • add a parameter to config to use the two-step-classifier
  • adjust call in FDRManager.fit_predict(), where the classifier is trained
  • add tests

@anna-charlotte anna-charlotte marked this pull request as draft January 16, 2025 12:53
alphadia/fdrexperimental.py Outdated Show resolved Hide resolved
alphadia/fdrexperimental.py Outdated Show resolved Hide resolved
alphadia/fdrexperimental.py Outdated Show resolved Hide resolved


def apply_absolute_transformations(df: pd.DataFrame) -> pd.DataFrame:
df_transformed = df.copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just FYI, this is a very expensive operation memory wise.
I think we should do this in place here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't we just copy the columns that are changed (assuming they are much smaller than the whole df) and merge them to the original df?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be an option. My favourite would be to move the abs() to the location where the feature is calculated in the first place. But this is a bit complicated for prototyping.

alphadia/fdr.py Outdated Show resolved Hide resolved
alphadia/fdrexperimental.py Outdated Show resolved Hide resolved
alphadia/fdrexperimental.py Outdated Show resolved Hide resolved
alphadia/fdrexperimental.py Outdated Show resolved Hide resolved
alphadia/fdrexperimental.py Outdated Show resolved Hide resolved
return matching_rows


def keep_best(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this function copied from the old implementation?

Copy link
Contributor Author

@anna-charlotte anna-charlotte Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this function copied from here?

adressed in a previous comment :)

alphadia/fdr.py Outdated

fpr_test, tpr_test, thresholds_test = sklearn.metrics.roc_curve(
y_test, y_test_proba
df.dropna(subset=available_columns, inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we avoid the inplace operation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 (after reading @GeorgWa comments going in thte exact opposite direction)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahaha, code review classic :D
I think in this case we actually want to do all operations in-place for performance reasons. But we should be explicit about it.

The issue is that precursor df is a very large dataframe (multiple GBs) and doing copies here and there could easily lead to OOM errors.

logger = logging.getLogger()


def keep_best(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for reorganizing code, but this makes it hard to spot any changes :-)
would it be a large effort to move this back to fdr.py for now and do the reordering (=just moving) later (or: before) in a dedicated PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding the functions were copied as they are from

def keep_best(

So there should be no changes here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly, it was just moved over, due to a circular import issue. That one has been resolved now, so I moved it back to alphadia/fdr.py for now:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgWa I noticed there are duplicates of the functions keep_best(), fdr_to_q_values() and some more in aphadia/fdrx/stats.py. Is that on purpose? If so, why do we have those?



def apply_absolute_transformations(df: pd.DataFrame) -> pd.DataFrame:
df_transformed = df.copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't we just copy the columns that are changed (assuming they are much smaller than the whole df) and merge them to the original df?


@classmethod
def _update_classifier(
cls, classifier, df_, x_cols, y_col, fdr, group_columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstrings missing ;-) (also at other places)
also, I would avoid underscore prefixes in variable names unless they are really required (private variables, small-scoped clashes with built-ins, named ignores..)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgWa correct me if I'm wrong, but df = df.copy() would be okay in such a case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation we need the copy, good point!
I think we should only create a copy to store the indices for calculating the FDR etc. and not duplicate the features though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let me check all the .copy()'s and remove the redundant ones;)

@@ -107,6 +125,347 @@ def from_state_dict(self, state_dict: dict):
"""


class TwoStepClassifier(Classifier):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could all this new code be moved to a new module?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great point! lets roll out a fdr module.

Do you think we should do this in a separate PR @mschwoer to prevent overloading of this PR with moving of 'old' functionality?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new code I would put already into the new module .. moving old code would be a separate PR (cf #431 (comment))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the TwoStepClassifier and LogisticRegressionModel to a new module, which I called fdr_analysis for now, as it was name-clashing with the fdr.py file otherwise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move it to fdrx?

X = df_[x_cols]
y = df_[y_col]
df = df_.copy()
if hasattr(classifier, "fitted") and classifier.fitted:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you're right :)

Comment on lines 767 to +770
if classifier_hash not in self.classifier_store:
classifier = deepcopy(self.classifier_base)
classifier.from_state_dict(torch.load(os.path.join(path, file)))
with contextlib.suppress(Exception):
classifier.from_state_dict(torch.load(os.path.join(path, file)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgWa What is this alphadia/constants/classifier/fa9945ae23db872d.pth file that we are loading here, some pretrained model? Shall I store a similar file for the two-step-classifier?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly 👍🏻 We will do the same with the two step classifier eventually

@anna-charlotte
Copy link
Contributor Author

@GeorgWa Should I add an e2e or performance test for the two step classifier, or just the unit tests?

Copy link
Collaborator

@mschwoer mschwoer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

from .logistic_regression import LogisticRegressionClassifier
from .two_step_classifier import TwoStepClassifier

__all__ = ["LogisticRegressionClassifier", "TwoStepClassifier"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the logic in this file? (especially the __all__ -> we're not using that idiom anywhere else in alphadia)

Comment on lines +39 to +45
self.first_classifier = first_classifier
self.second_classifier = second_classifier
self.first_fdr_cutoff = first_fdr_cutoff
self.second_fdr_cutoff = second_fdr_cutoff

self.min_precursors_for_update = min_precursors_for_update
self.train_on_top_n = train_on_top_n
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could those be private? (check also LogisticRegression)

Comment on lines +105 to +106
f"Stop training after iteration {i}, "
f"due to decreasing target count ({current_target_count} < {best_precursor_count})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) "Stopping .." .. ".. decreased .."

df_filtered, df, x_cols, y_col, group_columns
)
else:
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worth to log something here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and log if we reached max_iterations? (could use the for ... else pattern here)


return best_result

def preprocess_data(self, df: pd.DataFrame, x_cols: list[str]) -> pd.DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check all methods for being potentially private

@@ -722,6 +766,7 @@ def fit_predict(
raise ValueError(f"Invalid decoy_strategy: {decoy_strategy}")

self.is_fitted = True
# n_precursor = len(psm_df[psm_df["qval"] <= 0.01])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete

@@ -776,7 +821,8 @@ def load_classifier_store(self, path: None | str = None):

if classifier_hash not in self.classifier_store:
classifier = deepcopy(self.classifier_base)
classifier.from_state_dict(torch.load(os.path.join(path, file)))
with contextlib.suppress(Exception):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't is dangerous to suppress Exceptions here?

@@ -597,6 +627,10 @@ def __init__(
self.feature_columns = feature_columns
self.classifier_store = defaultdict(list)
self.classifier_base = classifier_base
self.enable_two_step_classifier = isinstance(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) is_two_step_classifier (strictly speaking, it's already enabled)

Comment on lines +46 to +47
x : np.array, dtype=float
Data of shape (n_samples, n_features).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at other places you are using capital X .. please choose one ;-) (check also what the rest of the code uses)

(and adapt also x_scaled)

"""
self._fitted = state_dict["_fitted"]

if self.fitted:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check if this should rather be if self._fitted:? if not, add a comment which deconfuses me :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants