-
Notifications
You must be signed in to change notification settings - Fork 22
Adding Principal Covariates Classification (PCovC) Code #248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
New PR for exclusively adding PCovC code
e302dbd
to
c0a16aa
Compare
3995e16
to
df8fa2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is looking very good. There's some standardizing tasks I noted and a couple style/organization questions I'd like discussed before approving.
src/skmatter/decomposition/_pcovc.py
Outdated
W : numpy.ndarray, shape (n_features, n_properties) | ||
Classification weights, optional when classifier=`precomputed`. If | ||
not passed, it is assumed that the weights will be taken from a | ||
linear classifier fit between X and Y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a note here or above about the relationship between X, Y, Z, and W.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a more detailed note in the "Notes" section of the docstring
classifier = LogisticRegression() | ||
if W is None: | ||
W = LogisticRegression().fit(X, Y).coef_.T | ||
W = W.reshape(X.shape[1], -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov says this is not covered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be covered now - added test case to test_pcovc.py's test_precomputed_classification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this nice new feature!
I took a very quick look. Once the example is a clean Python file I can check more carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to be python example file as the others so be rendered properly on the website. Take a look at the example folder how we do format a file. I can also help if there are issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be working now! I think we will go in and polish these up in a future update but for now we have two brief examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
self, | ||
mixing=0.5, | ||
n_components=None, | ||
svd_solver="auto", | ||
tol=1e-12, | ||
space="auto", | ||
iterated_power="auto", | ||
random_state=None, | ||
whiten=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we decide if we are starting to use type hints or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is taken from the original _pcovr.py
, @rosecers or @cajchristian may be able to better answer this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have a later conversation about type hints, but the general "scikit-learn does not do it" points to no at this time.
self.fit_svd_solver_ = self.svd_solver | ||
if self.fit_svd_solver_ == "auto": | ||
# Small problem or self.n_components_ == 'mle', just call full PCA | ||
if max(X.shape) <= 500 or self.n_components_ == "mle": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the value of 500? Does it make sense to hardcode this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment
S_sqrt = np.diagflat([np.sqrt(s) if s > self.tol else 0.0 for s in S]) | ||
S_sqrt_inv = np.diagflat([1.0 / np.sqrt(s) if s > self.tol else 0.0 for s in S]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move the list comprehension as extra variable above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without checking the math carefully the overall design looks good.
Can you update the CHANGELOG file in the root of the repo with the addition you made. Thanks
@@ -1,5 +1,5 @@ | |||
Principal Covariates Regression (PCovR) | |||
======================================= | |||
Principal Covariates Regression (PCovR) and Classification (PCovC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about :
Principal Covariates Regression (PCovR) and Classification (PCovC) | |
Hybrid Mapping Techniques (PCovR and PCovC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
ddee0d5
to
12a2c39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks LGTM. If @rosecers is also happy we can merge.
@@ -45,3 +45,9 @@ References | |||
Michele Ceriotti, "Improving Sample and Feature Selection with Principal Covariates | |||
Regression" 2021 Mach. Learn.: Sci. Technol. 2 035038. | |||
https://iopscience.iop.org/article/10.1088/2632-2153/abfe7c. | |||
|
|||
.. [Jorgensen2025] | |||
Christian Jorgensen, Arthur Y. Lin, and Rose K. Cersonsky, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing an author!!!
Christian Jorgensen, Arthur Y. Lin, and Rose K. Cersonsky, | |
Christian Jorgensen, Arthur Y. Lin, Rhushil Vasavada, and Rose K. Cersonsky, |
print(load_breast_cancer().DESCR) | ||
|
||
# %% | ||
# | ||
# Scale Feature Data | ||
# ------------------ | ||
# | ||
# Below, we transform the Breast Cancer feature data to have a mean of zero | ||
# and standard deviation of one, while preserving relative relationships | ||
# between feature values. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably delete these lines
print(load_breast_cancer().DESCR) | |
# %% | |
# | |
# Scale Feature Data | |
# ------------------ | |
# | |
# Below, we transform the Breast Cancer feature data to have a mean of zero | |
# and standard deviation of one, while preserving relative relationships | |
# between feature values. |
# coding: utf-8 | ||
|
||
""" | ||
PCovC with the Breast Cancer Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCovC with the Breast Cancer Dataset | |
Comparing PCovC with PCA and LDA |
# We use Principal Component Analysis to reduce the Breast Cancer feature | ||
# data to two features that retain as much information as possible | ||
# about the original dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can delete this too
# We use Principal Component Analysis to reduce the Breast Cancer feature | |
# data to two features that retain as much information as possible | |
# about the original dataset. |
# Here, we use Linear Discriminant Analysis to find a projection | ||
# of the feature data that maximizes separability between | ||
# the benign/malignant classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
# Here, we use Linear Discriminant Analysis to find a projection | |
# of the feature data that maximizes separability between | |
# the benign/malignant classes. |
|
||
fig, axis = plt.subplots() | ||
axis.scatter(-T_lda[:], np.zeros(len(T_lda[:])), c=y) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the sections above, let's add a section where we perform solely the PCovC projection of the data. Then we can get rid of the lower for loop.
# coding: utf-8 | ||
|
||
""" | ||
PCovC with the Iris Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCovC with the Iris Dataset | |
PCovC Hyperparameter Tuning |
print(load_iris().DESCR) | ||
|
||
# %% | ||
# | ||
# Scale Feature Data | ||
# ------------------ | ||
# | ||
# Below, we transform the Iris feature data to have a mean of zero and | ||
# standard deviation of one, while preserving relative relationships | ||
# between feature values. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can delete
print(load_iris().DESCR) | |
# %% | |
# | |
# Scale Feature Data | |
# ------------------ | |
# | |
# Below, we transform the Iris feature data to have a mean of zero and | |
# standard deviation of one, while preserving relative relationships | |
# between feature values. |
# We use Principal Component Analysis to reduce the Iris feature | ||
# data to two features that retain as much information as possible | ||
# about the original dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
# We use Principal Component Analysis to reduce the Iris feature | |
# data to two features that retain as much information as possible | |
# about the original dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to also include a description of PCovC here - this shows up in the "getting started" page of the docs
|
||
__all__ = ["pcovr_covariance", "pcovr_kernel", "PCovR", "KernelPCovR"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this functionality correct? Are pcovr_covariance and pcovr_kernel still callable?
classifier: {`LogisticRegression`, `LogisticRegressionCV`, `LinearSVC`, `LinearDiscriminantAnalysis`, | ||
`RidgeClassifier`, `RidgeClassifierCV`, `SGDClassifier`, `Perceptron`, `precomputed`}, default=None | ||
classifier for computing :math:`{\mathbf{Z}}`. The classifier should be one of | ||
`sklearn.linear_model.LogisticRegression`, `sklearn.linear_model.LogisticRegressionCV`, | ||
`sklearn.svm.LinearSVC`, `sklearn.discriminant_analysis.LinearDiscriminantAnalysis`, | ||
`sklearn.linear_model.RidgeClassifier`, `sklearn.linear_model.RidgeClassifierCV`, | ||
`sklearn.linear_model.SGDClassifier`, or `Perceptron`. If a pre-fitted classifier | ||
is provided, it is used to compute :math:`{\mathbf{Z}}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def decision_function(self, X=None, T=None): | ||
"""Predicts confidence scores from X or T.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should touch this up further. Here's a start
def decision_function(self, X=None, T=None): | |
"""Predicts confidence scores from X or T.""" | |
r"""Predicts confidence scores from X or T. | |
.. math:: | |
\mathbf{Z} = \mathbf{T} \mathbf{P}_{TZ} | |
= \mathbf{X} \mathbf{P}_{XT} \mathbf{P}_{TZ} | |
Parameters | |
---------- | |
X : ndarray, shape(n_samples, n_features) | |
Original data for which we want to get confidence scores, where n_samples is the number of samples and n_features is the number of features. | |
T : ndarray, shape (n_samples, n_components) | |
Projected data for which we want to get confidence scores, where n_samples is the number of samples | |
and n_components is the number of components. | |
Returns | |
-------- | |
Z |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're almost there. I think the docs still need some work, but most of my changes are nitpicky. Also, I'm not seeing the examples show up in the section navigation dropdown.
Should we also change the PCovR section of the landing page to something hybrid learning-esque? |
Adding PCovC code, examples, and testing suite. Adding a PCov base class that both PCovR and PCovC inherit from.
Contributor (creator of PR) checklist
For Reviewer
📚 Documentation preview 📚: https://scikit-matter--248.org.readthedocs.build/en/248/