Skip to content

Adding Principal Covariates Classification (PCovC) Code #248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

rvasav26
Copy link
Collaborator

@rvasav26 rvasav26 commented May 14, 2025

Adding PCovC code, examples, and testing suite. Adding a PCov base class that both PCovR and PCovC inherit from.

Contributor (creator of PR) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

For Reviewer

  • CHANGELOG updated if important change?

📚 Documentation preview 📚: https://scikit-matter--248.org.readthedocs.build/en/248/

@rvasav26 rvasav26 marked this pull request as ready for review May 14, 2025 17:39
@rvasav26 rvasav26 requested review from rosecers and ceriottm May 14, 2025 17:40
@rvasav26 rvasav26 requested a review from cajchristian May 14, 2025 18:48
@rosecers rosecers force-pushed the adding-pcovc-new branch from e302dbd to c0a16aa Compare May 14, 2025 19:57
@rvasav26 rvasav26 force-pushed the adding-pcovc-new branch from 3995e16 to df8fa2e Compare May 15, 2025 02:34
Copy link
Collaborator

@rosecers rosecers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking very good. There's some standardizing tasks I noted and a couple style/organization questions I'd like discussed before approving.

Comment on lines 239 to 242
W : numpy.ndarray, shape (n_features, n_properties)
Classification weights, optional when classifier=`precomputed`. If
not passed, it is assumed that the weights will be taken from a
linear classifier fit between X and Y
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a note here or above about the relationship between X, Y, Z, and W.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a more detailed note in the "Notes" section of the docstring

classifier = LogisticRegression()
if W is None:
W = LogisticRegression().fit(X, Y).coef_.T
W = W.reshape(X.shape[1], -1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecov says this is not covered

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be covered now - added test case to test_pcovc.py's test_precomputed_classification

@rosecers rosecers requested a review from PicoCentauri May 16, 2025 16:41
@rvasav26 rvasav26 requested a review from rosecers May 19, 2025 18:56
Copy link
Collaborator

@PicoCentauri PicoCentauri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this nice new feature!

I took a very quick look. Once the example is a clean Python file I can check more carefully.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be python example file as the others so be rendered properly on the website. Take a look at the example folder how we do format a file. I can also help if there are issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be working now! I think we will go in and polish these up in a future update but for now we have two brief examples

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +22 to +30
self,
mixing=0.5,
n_components=None,
svd_solver="auto",
tol=1e-12,
space="auto",
iterated_power="auto",
random_state=None,
whiten=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we decide if we are starting to use type hints or not?

Copy link
Collaborator Author

@rvasav26 rvasav26 May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is taken from the original _pcovr.py, @rosecers or @cajchristian may be able to better answer this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have a later conversation about type hints, but the general "scikit-learn does not do it" points to no at this time.

self.fit_svd_solver_ = self.svd_solver
if self.fit_svd_solver_ == "auto":
# Small problem or self.n_components_ == 'mle', just call full PCA
if max(X.shape) <= 500 or self.n_components_ == "mle":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the value of 500? Does it make sense to hardcode this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment

Comment on lines +123 to +124
S_sqrt = np.diagflat([np.sqrt(s) if s > self.tol else 0.0 for s in S])
S_sqrt_inv = np.diagflat([1.0 / np.sqrt(s) if s > self.tol else 0.0 for s in S])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move the list comprehension as extra variable above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment

Copy link
Collaborator

@PicoCentauri PicoCentauri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without checking the math carefully the overall design looks good.

Can you update the CHANGELOG file in the root of the repo with the addition you made. Thanks

@@ -1,5 +1,5 @@
Principal Covariates Regression (PCovR)
=======================================
Principal Covariates Regression (PCovR) and Classification (PCovC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe find a shorter name. This is making the sidebar in the docs super larger

Screenshot 2025-05-27 at 21 22 37

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about :

Suggested change
Principal Covariates Regression (PCovR) and Classification (PCovC)
Hybrid Mapping Techniques (PCovR and PCovC)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@rvasav26 rvasav26 force-pushed the adding-pcovc-new branch from ddee0d5 to 12a2c39 Compare May 27, 2025 23:45
Copy link
Collaborator

@PicoCentauri PicoCentauri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks LGTM. If @rosecers is also happy we can merge.

@@ -45,3 +45,9 @@ References
Michele Ceriotti, "Improving Sample and Feature Selection with Principal Covariates
Regression" 2021 Mach. Learn.: Sci. Technol. 2 035038.
https://iopscience.iop.org/article/10.1088/2632-2153/abfe7c.

.. [Jorgensen2025]
Christian Jorgensen, Arthur Y. Lin, and Rose K. Cersonsky,
Copy link
Collaborator

@cajchristian cajchristian May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing an author!!!

Suggested change
Christian Jorgensen, Arthur Y. Lin, and Rose K. Cersonsky,
Christian Jorgensen, Arthur Y. Lin, Rhushil Vasavada, and Rose K. Cersonsky,

Comment on lines +33 to +43
print(load_breast_cancer().DESCR)

# %%
#
# Scale Feature Data
# ------------------
#
# Below, we transform the Breast Cancer feature data to have a mean of zero
# and standard deviation of one, while preserving relative relationships
# between feature values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably delete these lines

Suggested change
print(load_breast_cancer().DESCR)
# %%
#
# Scale Feature Data
# ------------------
#
# Below, we transform the Breast Cancer feature data to have a mean of zero
# and standard deviation of one, while preserving relative relationships
# between feature values.

# coding: utf-8

"""
PCovC with the Breast Cancer Dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PCovC with the Breast Cancer Dataset
Comparing PCovC with PCA and LDA

Comment on lines +52 to +54
# We use Principal Component Analysis to reduce the Breast Cancer feature
# data to two features that retain as much information as possible
# about the original dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can delete this too

Suggested change
# We use Principal Component Analysis to reduce the Breast Cancer feature
# data to two features that retain as much information as possible
# about the original dataset.

Comment on lines +76 to +78
# Here, we use Linear Discriminant Analysis to find a projection
# of the feature data that maximizes separability between
# the benign/malignant classes.
Copy link
Collaborator

@cajchristian cajchristian May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Suggested change
# Here, we use Linear Discriminant Analysis to find a projection
# of the feature data that maximizes separability between
# the benign/malignant classes.


fig, axis = plt.subplots()
axis.scatter(-T_lda[:], np.zeros(len(T_lda[:])), c=y)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the sections above, let's add a section where we perform solely the PCovC projection of the data. Then we can get rid of the lower for loop.

# coding: utf-8

"""
PCovC with the Iris Dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PCovC with the Iris Dataset
PCovC Hyperparameter Tuning

Comment on lines +35 to +45
print(load_iris().DESCR)

# %%
#
# Scale Feature Data
# ------------------
#
# Below, we transform the Iris feature data to have a mean of zero and
# standard deviation of one, while preserving relative relationships
# between feature values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can delete

Suggested change
print(load_iris().DESCR)
# %%
#
# Scale Feature Data
# ------------------
#
# Below, we transform the Iris feature data to have a mean of zero and
# standard deviation of one, while preserving relative relationships
# between feature values.

Comment on lines +54 to +56
# We use Principal Component Analysis to reduce the Iris feature
# data to two features that retain as much information as possible
# about the original dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Suggested change
# We use Principal Component Analysis to reduce the Iris feature
# data to two features that retain as much information as possible
# about the original dataset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to also include a description of PCovC here - this shows up in the "getting started" page of the docs


__all__ = ["pcovr_covariance", "pcovr_kernel", "PCovR", "KernelPCovR"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this functionality correct? Are pcovr_covariance and pcovr_kernel still callable?

Comment on lines +103 to +110
classifier: {`LogisticRegression`, `LogisticRegressionCV`, `LinearSVC`, `LinearDiscriminantAnalysis`,
`RidgeClassifier`, `RidgeClassifierCV`, `SGDClassifier`, `Perceptron`, `precomputed`}, default=None
classifier for computing :math:`{\mathbf{Z}}`. The classifier should be one of
`sklearn.linear_model.LogisticRegression`, `sklearn.linear_model.LogisticRegressionCV`,
`sklearn.svm.LinearSVC`, `sklearn.discriminant_analysis.LinearDiscriminantAnalysis`,
`sklearn.linear_model.RidgeClassifier`, `sklearn.linear_model.RidgeClassifierCV`,
`sklearn.linear_model.SGDClassifier`, or `Perceptron`. If a pre-fitted classifier
is provided, it is used to compute :math:`{\mathbf{Z}}`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not showing up correctly in the docs
image

Comment on lines +384 to +385
def decision_function(self, X=None, T=None):
"""Predicts confidence scores from X or T."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should touch this up further. Here's a start

Suggested change
def decision_function(self, X=None, T=None):
"""Predicts confidence scores from X or T."""
r"""Predicts confidence scores from X or T.
.. math::
\mathbf{Z} = \mathbf{T} \mathbf{P}_{TZ}
= \mathbf{X} \mathbf{P}_{XT} \mathbf{P}_{TZ}
Parameters
----------
X : ndarray, shape(n_samples, n_features)
Original data for which we want to get confidence scores, where n_samples is the number of samples and n_features is the number of features.
T : ndarray, shape (n_samples, n_components)
Projected data for which we want to get confidence scores, where n_samples is the number of samples
and n_components is the number of components.
Returns
--------
Z

Copy link
Collaborator

@cajchristian cajchristian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're almost there. I think the docs still need some work, but most of my changes are nitpicky. Also, I'm not seeing the examples show up in the section navigation dropdown.

@cajchristian
Copy link
Collaborator

Should we also change the PCovR section of the landing page to something hybrid learning-esque?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants