-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add two step classifier #431
base: main
Are you sure you want to change the base?
Conversation
alphadia/fdrexperimental.py
Outdated
|
||
|
||
def apply_absolute_transformations(df: pd.DataFrame) -> pd.DataFrame: | ||
df_transformed = df.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just FYI, this is a very expensive operation memory wise.
I think we should do this in place here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't we just copy the columns that are changed (assuming they are much smaller than the whole df) and merge them to the original df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be an option. My favourite would be to move the abs() to the location where the feature is calculated in the first place. But this is a bit complicated for prototyping.
alphadia/fdrexperimental.py
Outdated
return matching_rows | ||
|
||
|
||
def keep_best( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this function copied from the old implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this function copied from here?
adressed in a previous comment :)
alphadia/fdr.py
Outdated
|
||
fpr_test, tpr_test, thresholds_test = sklearn.metrics.roc_curve( | ||
y_test, y_test_proba | ||
df.dropna(subset=available_columns, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we avoid the inplace operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄 (after reading @GeorgWa comments going in thte exact opposite direction)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hahaha, code review classic :D
I think in this case we actually want to do all operations in-place for performance reasons. But we should be explicit about it.
The issue is that precursor df is a very large dataframe (multiple GBs) and doing copies here and there could easily lead to OOM errors.
alphadia/fdr_utils.py
Outdated
logger = logging.getLogger() | ||
|
||
|
||
def keep_best( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for reorganizing code, but this makes it hard to spot any changes :-)
would it be a large effort to move this back to fdr.py
for now and do the reordering (=just moving) later (or: before) in a dedicated PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding the functions were copied as they are from
Line 171 in da99596
def keep_best( |
So there should be no changes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly, it was just moved over, due to a circular import issue. That one has been resolved now, so I moved it back to alphadia/fdr.py
for now:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgWa I noticed there are duplicates of the functions keep_best()
, fdr_to_q_values()
and some more in aphadia/fdrx/stats.py
. Is that on purpose? If so, why do we have those?
alphadia/fdrexperimental.py
Outdated
|
||
|
||
def apply_absolute_transformations(df: pd.DataFrame) -> pd.DataFrame: | ||
df_transformed = df.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't we just copy the columns that are changed (assuming they are much smaller than the whole df) and merge them to the original df?
alphadia/fdrexperimental.py
Outdated
|
||
@classmethod | ||
def _update_classifier( | ||
cls, classifier, df_, x_cols, y_col, fdr, group_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstrings missing ;-) (also at other places)
also, I would avoid underscore prefixes in variable names unless they are really required (private variables, small-scoped clashes with built-ins, named ignores..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgWa correct me if I'm wrong, but df = df.copy()
would be okay in such a case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this implementation we need the copy, good point!
I think we should only create a copy to store the indices for calculating the FDR etc. and not duplicate the features though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let me check all the .copy()
's and remove the redundant ones;)
alphadia/fdrexperimental.py
Outdated
@@ -107,6 +125,347 @@ def from_state_dict(self, state_dict: dict): | |||
""" | |||
|
|||
|
|||
class TwoStepClassifier(Classifier): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could all this new code be moved to a new module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great point! lets roll out a fdr
module.
Do you think we should do this in a separate PR @mschwoer to prevent overloading of this PR with moving of 'old' functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new code I would put already into the new module .. moving old code would be a separate PR (cf #431 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the TwoStepClassifier
and LogisticRegressionModel
to a new module, which I called fdr_analysis
for now, as it was name-clashing with the fdr.py
file otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you move it to fdrx
?
alphadia/fdrexperimental.py
Outdated
X = df_[x_cols] | ||
y = df_[y_col] | ||
df = df_.copy() | ||
if hasattr(classifier, "fitted") and classifier.fitted: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you're right :)
if classifier_hash not in self.classifier_store: | ||
classifier = deepcopy(self.classifier_base) | ||
classifier.from_state_dict(torch.load(os.path.join(path, file))) | ||
with contextlib.suppress(Exception): | ||
classifier.from_state_dict(torch.load(os.path.join(path, file))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgWa What is this alphadia/constants/classifier/fa9945ae23db872d.pth
file that we are loading here, some pretrained model? Shall I store a similar file for the two-step-classifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly 👍🏻 We will do the same with the two step classifier eventually
@GeorgWa Should I add an e2e or performance test for the two step classifier, or just the unit tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
from .logistic_regression import LogisticRegressionClassifier | ||
from .two_step_classifier import TwoStepClassifier | ||
|
||
__all__ = ["LogisticRegressionClassifier", "TwoStepClassifier"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the logic in this file? (especially the __all__
-> we're not using that idiom anywhere else in alphadia)
self.first_classifier = first_classifier | ||
self.second_classifier = second_classifier | ||
self.first_fdr_cutoff = first_fdr_cutoff | ||
self.second_fdr_cutoff = second_fdr_cutoff | ||
|
||
self.min_precursors_for_update = min_precursors_for_update | ||
self.train_on_top_n = train_on_top_n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could those be private? (check also LogisticRegression
)
f"Stop training after iteration {i}, " | ||
f"due to decreasing target count ({current_target_count} < {best_precursor_count})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) "Stopping .." .. ".. decreased .."
df_filtered, df, x_cols, y_col, group_columns | ||
) | ||
else: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be worth to log something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and log if we reached max_iterations
? (could use the for ... else
pattern here)
|
||
return best_result | ||
|
||
def preprocess_data(self, df: pd.DataFrame, x_cols: list[str]) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check all methods for being potentially private
@@ -722,6 +766,7 @@ def fit_predict( | |||
raise ValueError(f"Invalid decoy_strategy: {decoy_strategy}") | |||
|
|||
self.is_fitted = True | |||
# n_precursor = len(psm_df[psm_df["qval"] <= 0.01]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please delete
@@ -776,7 +821,8 @@ def load_classifier_store(self, path: None | str = None): | |||
|
|||
if classifier_hash not in self.classifier_store: | |||
classifier = deepcopy(self.classifier_base) | |||
classifier.from_state_dict(torch.load(os.path.join(path, file))) | |||
with contextlib.suppress(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't is dangerous to suppress Exceptions here?
@@ -597,6 +627,10 @@ def __init__( | |||
self.feature_columns = feature_columns | |||
self.classifier_store = defaultdict(list) | |||
self.classifier_base = classifier_base | |||
self.enable_two_step_classifier = isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) is_two_step_classifier
(strictly speaking, it's already enabled)
x : np.array, dtype=float | ||
Data of shape (n_samples, n_features). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at other places you are using capital X
.. please choose one ;-) (check also what the rest of the code uses)
(and adapt also x_scaled
)
""" | ||
self._fitted = state_dict["_fitted"] | ||
|
||
if self.fitted: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check if this should rather be if self._fitted:
? if not, add a comment which deconfuses me :-)
Adding a two-step-classifier, that's to be used with a logistic regression followed by a neural network classifier, as inspired by DIA-NN. With this we aim to increase sensitivity, particularly in samples with only few peptides present, such as single cell samples.
Steps:
TwoStepClassifier
FDRManager.fit_predict()
, where the classifier is trained