Skip to content

Torch support part 1 #249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 270 commits into
base: master
Choose a base branch
from
Open

Torch support part 1 #249

wants to merge 270 commits into from

Conversation

sergeyf
Copy link
Collaborator

@sergeyf sergeyf commented Jan 12, 2025

OK, I think I've made enough progress to warrant a review. You can get a sense of what's going on from the test files. My goal with this PR was to take the example from the readme

mhcflurry-predict --alleles HLA-A0201 HLA-A0301 --peptides SIINFEKL SIINFEKD SIINFEKQ --out predictions.csv

and to be able to run it with torch and get the same output. The test for this is in test_predict_command.py

I think you would have to have the weights.csv already downloaded for it to run so it will probably fail if run during CI... So just pull yourself and double-check. We can edit the tests to make them run via CI.

Why 200 commits? Because I did this almost entirely with aider... It was an experiment and I have learned a lot about how incredibly naive AI-code generation is if you just let it tell its own stories without constant questioning, re-questioning and demands for more tests/debugging/logging/analysis.

1. In the model initialization method:
   - Wrapped the linear layer in a Sequential model with a Sigmoid activation
   - Updated weight and bias assignment to use the first layer of the Sequential model

2. In the prediction method:
   - Compute probabilities for both classes (0 and 1)
   - Concatenate the probabilities to match sklearn's predict_proba format
   - Use the probability of the positive class (index 1) as the presentation score

These changes should resolve the `AttributeError: 'Linear' object has no attribute 'predict_proba'` by making the PyTorch model behave more like a scikit-learn classifier.

Would you like me to explain any part of the changes in more detail?
…patible with the existing code. Let's modify the `torch_presentation_predictor.py` to add a `predict_proba` method to the Sequential model:

```python
def predict(self, *args, **kwargs):
    df = super().predict(*args, **kwargs)
    return df

def predict_proba(self, inputs):
    with torch.no_grad():
        self.eval()
        probs = self(inputs)
        return torch.cat([1 - probs, probs], dim=1).cpu().numpy()
```

This approach adds a `predict_proba` method to the Sequential model that returns probabilities for both classes, similar to scikit-learn's implementation.

Modify the `torch_presentation_predictor.py` file to:

```python
class TorchPresentationPredictor(Class1PresentationPredictor):
    def predict(self, *args, **kwargs):
        df = super().predict(*args, **kwargs)
        return df

    def _get_model(self, model_name):
        model = super()._get_model(model_name)

        # Add predict_proba method to the model
        def predict_proba(inputs):
            with torch.no_grad():
                model.eval()
                probs = model(inputs)
                return torch.cat([1 - probs, probs], dim=1).cpu().numpy()

        model.predict_proba = predict_proba
        return model
```

The key changes are:
1. Override the `predict` method to call the parent class's method
2. Add a `predict_proba` method to the model in `_get_model`
3. The `predict_proba` method returns probabilities for both classes

This should resolve the `AttributeError: 'Sequential' object has no attribute 'predict_proba'` error by dynamically adding the method to the model.

Try running the test again:
```bash
nosetests test\test_torch_implementations.py:test_presentation_predictor_matches_keras -v
```

If this doesn't work, could you share the full traceback and the contents of the `test_torch_implementations.py` file? That would help diagnose the specific issue more precisely.
…we need to modify the PyTorch presentation predictor to match the Keras implementation's prediction method.

Here's a comprehensive fix for the `torch_presentation_predictor.py`:

```python
class TorchPresentationPredictor(Class1PresentationPredictor):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._torch_models = {}

    def get_model(self, name):
        if name not in self._torch_models:
            # Create a model without sigmoid in the final layer
            model = nn.Sequential(
                nn.Linear(len(self.model_inputs), 1)
            )
            row = self.weights_dataframe.loc[name]

            # Load weights
            state_dict = {
                '0.weight': torch.FloatTensor(row['weights'].reshape(1, -1)),
                '0.bias': torch.FloatTensor([row['intercept']])
            }
            model.load_state_dict(state_dict)

            # Move to appropriate device
            model = model.to(self.device)

            self._torch_models[name] = model

        return self._torch_models[name]

    def predict(self, df, model_name=None, *args, **kwargs):
        # Existing method implementation
        if model_name is None:
            model_name = self.default_model

        model = self.get_model(model_name)

        input_matrix = df[self.model_inputs]
        null_mask = None
        if not kwargs.get("throw", True):
            null_mask = input_matrix.isnull().any(axis=1)
            input_matrix = input_matrix.fillna(0.0)

        # Convert to PyTorch tensor
        inputs = torch.FloatTensor(input_matrix.values).to(self.device)

        # Get predictions
        with torch.no_grad():
            model.eval()
            logits = model(inputs)  # Get raw logits
            probs = torch.sigmoid(logits)  # Apply sigmoid manually
            df["presentation_score"] = probs.squeeze().cpu().numpy()

        if null_mask is not None:
            df.loc[null_mask, "presentation_score"] = np.nan

        return df
```

Key changes:
1. Removed `nn.Sigmoid()` from the model definition
2. Use `model(inputs)` to get raw logits
3. Apply `torch.sigmoid()` manually to convert logits to probabilities
4. Simplified model loading and weight assignment
5. Ensured compatibility with existing prediction method

This implementation should now match the Keras version's prediction method by:
- Using raw linear layer output
- Applying sigmoid manually
- Handling null values consistently

Try running the test again:
```bash
nosetests test\test_torch_implementations.py:test_presentation_predictor_matches_keras -v
```

The changes ensure that:
- The model architecture matches the Keras implementation
- Probability calculation is consistent
- Null value handling remains the same

Let me know if this resolves the issue or if you need further adjustments.
…The key issue is that we need to handle the predict_proba call correctly.

Here's an updated implementation for the `torch_presentation_predictor.py`:

```python
def get_model(self, name=None):
    """
    Load or instantiate a new logistic regression model in PyTorch.

    Parameters
    ----------
    name : string
        Model variant name ('with_flanks' or 'without_flanks')

    Returns
    -------
    torch.nn.Module
    """
    if name is None:
        return nn.Linear(len(self.model_inputs), 1)

    if name not in self._torch_models:
        model = nn.Linear(len(self.model_inputs), 1)
        row = self.weights_dataframe.loc[name]

        # Convert weights and bias to PyTorch tensors
        weights = torch.FloatTensor(row[self.model_inputs].values)
        bias = torch.FloatTensor([row.intercept])

        # Assign the weights
        with torch.no_grad():
            model.weight.copy_(weights.unsqueeze(0))
            model.bias.copy_(bias)

        model = model.to(self.device)
        self._torch_models[name] = model

    return self._torch_models[name]

def predict(self, *args, **kwargs):
    """
    Override predict to use PyTorch models for the final presentation score calculation
    """
    df = super().predict(*args, **kwargs)

    if "processing_score" in df.columns and "affinity_score" in df.columns:
        if len(df) > 0:
            model_name = 'with_flanks' if 'n_flank' in df.columns else "without_flanks"
            model = self.get_model(model_name)

            input_matrix = df[self.model_inputs]
            null_mask = None
            if not kwargs.get("throw", True):
                null_mask = input_matrix.isnull().any(axis=1)
                input_matrix = input_matrix.fillna(0.0)

            # Convert to PyTorch tensor
            inputs = torch.FloatTensor(input_matrix.values).to(self.device)

            # Get predictions
            with torch.no_grad():
                model.eval()
                logits = model(inputs)
                probs = torch.sigmoid(logits)
                df["presentation_score"] = probs.squeeze().cpu().numpy()

            if null_mask is not None:
                df.loc[null_mask, "presentation_score"] = numpy.nan

            df["presentation_percentile"] = self.percentile_ranks(
                df["presentation_score"], throw=kwargs.get("throw", True))

        else:
            df["presentation_score"] = []
            df["presentation_percentile"] = []

        del df["affinity_score"]

    return df

def predict_proba(self, X):
    """
    Predict probabilities using the PyTorch model.

    Parameters
    ----------
    X : pandas.DataFrame
        Input features

    Returns
    -------
    numpy.ndarray
        Predicted probabilities
    """
    model_name = 'with_flanks' if 'n_flank' in X.columns else "without_flanks"
    model = self.get_model(model_name)

    # Ensure only model inputs are used
    X = X[self.model_inputs]

    # Convert to PyTorch tensor
    inputs = torch.FloatTensor(X.values).to(self.device)

    # Get predictions
    with torch.no_grad():
        model.eval()
        logits = model(inputs)
        probs = torch.sigmoid(logits)
        return probs.squeeze().cpu().numpy()
```

The key changes are:
1. Added a `predict_proba` method that directly returns probabilities
2. Simplified the model loading and prediction logic
3. Ensured that only model inputs are used in predictions
4. Used sigmoid activation to convert logits to probabilities

This implementation should resolve the `predict_proba` issue and make the PyTorch implementation consistent with the Keras version.

Try running the test again:
```bash
nosetests test\test_torch_implementations.py:test_presentation_predictor_matches_keras -v
```

Would you like me to explain any part of the implementation?
…ba` or `decision_function` method like scikit-learn models do. For PyTorch models, you'll need to modify the prediction logic to use the model's forward pass and apply a sigmoid activation.

Here's a potential fix in `mhcflurry/torch_presentation_predictor.py`:

```python
import torch
import numpy as np

class TorchPresentationPredictor(Class1PresentationPredictor):
    def predict(self, *args, **kwargs):
        # Ensure the model is in evaluation mode
        self.model.eval()

        # Call the parent class's predict method
        df = super().predict(*args, **kwargs)
        return df

    def _predict_scores(self, input_matrix):
        # Convert input to PyTorch tensor
        X = torch.tensor(input_matrix.values, dtype=torch.float32)

        # Disable gradient computation
        with torch.no_grad():
            # Get model output (logits)
            logits = self.model(X)

            # Apply sigmoid to get probabilities
            scores = torch.sigmoid(logits).numpy()

        return scores
```

This modification does a few key things:
1. Sets the model to evaluation mode
2. Converts input to a PyTorch tensor
3. Disables gradient computation
4. Applies sigmoid activation to get probabilities
5. Converts back to numpy array for further processing

The key change is using `torch.sigmoid()` to convert the model's raw output to a probability between 0 and 1, which is equivalent to the `predict_proba` method in scikit-learn models.

If this doesn't fully resolve the issue, could you share the specific implementation of the PyTorch model's forward method?
@sergeyf sergeyf marked this pull request as draft January 12, 2025 05:54
refactor: Conditionally compute presentation score for sklearn models
…test is expecting a column named "presentation_score", but it's not being created. Let's modify the `predict` method in the `TorchPresentationPredictor` class to ensure this column is added.

Here's an updated implementation:

```python
def predict(self, *args, **kwargs):
    """
    Override predict to use PyTorch models for the final presentation score calculation
    """
    df = super().predict(*args, **kwargs)

    if "processing_score" in df.columns and "affinity_score" in df.columns:
        if len(df) > 0:
            model_name = 'with_flanks' if 'n_flank' in df.columns else "without_flanks"
            model = self.get_model(model_name)

            input_matrix = df[self.model_inputs]
            null_mask = None
            if not kwargs.get("throw", True):
                null_mask = input_matrix.isnull().any(axis=1)
                input_matrix = input_matrix.fillna(0.0)

            # Convert to PyTorch tensor
            inputs = torch.FloatTensor(input_matrix.values).to(self.device)

            # Get predictions
            with torch.no_grad():
                model.eval()
                logits = model(inputs)
                presentation_scores = torch.sigmoid(logits).squeeze().cpu().numpy()

                # Ensure the column is added even if it's empty
                df["presentation_score"] = presentation_scores

            if null_mask is not None:
                df.loc[null_mask, "presentation_score"] = numpy.nan

            df["presentation_percentile"] = self.percentile_ranks(
                df["presentation_score"], throw=kwargs.get("throw", True))
        else:
            # Ensure the column is added even for empty dataframes
            df["presentation_score"] = []
            df["presentation_percentile"] = []

        del df["affinity_score"]

    return df
```

Key changes:
1. Explicitly create the "presentation_score" column
2. Ensure the column is added even for empty dataframes
3. Simplified the prediction logic

Please replace the existing `predict` method in the `TorchPresentationPredictor` class with this implementation and run the test again.

Would you like me to help you apply this change?
@sergeyf
Copy link
Collaborator Author

sergeyf commented Jan 20, 2025

@timodonnell There are now special torch-only tests in the Github actions CI that happen after torch is installed.

My plan for this PR is to complete the stub tests that are trivially passing now and then see if we should merge, before moving on to make the rest of the code work with torch.

The stub tests left

def test_allele_sequence_handling():
    """Test loading and using allele sequences"""
    pass


def test_ensemble_predictions():
    """Test predictions with multiple models for same allele"""
    pass


def test_pan_allele_predictions():
    """Test pan-allele model predictions"""
    pass


def test_percentile_ranks():
    """Test percentile rank calculations"""
    pass


def test_mixed_model_predictions():
    """Test predictions using both allele-specific and pan-allele models"""
    pass


def test_full_predictor():
    """Test complete predictor functionality"""
    pass

In particular: The TorchClass1AffinityPredictor in torch_implementations.py is missing many of the methods and internal logic that make the Keras Class1AffinityPredictor in class1_affinity_predictor.py fully featured. In particular (in the words of o1):

• The Torch version doesn’t do ensemble predictions across multiple models for each allele (it only grabs the first model for that allele). By contrast, the Keras version can handle multiple allele-specific and pan-allele models and combine their predictions (usually by geometric mean).
• The Torch version doesn’t have predict_to_dataframe(), calibrate_percentile_ranks(), or percentile rank calibration/lookup tables. It currently has a placeholder percentile_ranks() method that returns 50.0.
• The Torch version doesn’t provide methods like fit_allele_specific_predictors() or fit_class1_pan_allele_models() for training new models.
• It doesn’t implement the “save” functionality (creating manifest.csv, writing model weights, writing “info.txt,” etc.).
• It lacks the clear_cache(), check_consistency(), merge() / merge_in_place(), and overall “manifest_df” logic.
• For loading Keras weights, it only partially does so by reading each TorchNeuralNetwork’s weights from .npz files. It doesn’t support a more direct“Keras → Torch” approach for ensembles or multiple allele/pan-allele models.
• It also can’t handle percentile-rank transformations or advanced metadata (like “metadata_dataframes,” “provenance_string”) the way the Keras predictor does.

After all that, I'll ask for a final review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants