Error when Used with Categorical/Text features #20

ddofer · 2024-12-25T15:13:20Z

Is there any way to use the tool with categorical/text features?
I'm using a fitted CatBoost model (that handles the feature's transformations itself). Since the method works on the shap values of the features, it should not need the raw features to be transformed. Nevertheless, I get an error when trying:

model = CatBoostClassifier( cat_features= categorical_cols ,text_features=text_cols) model.fit(X_train, y_train)

This works fine:

explainer = shap.Explainer(model)
shap_values = explainer.shap_values(Pool(X_train, y_train, cat_features=categorical_cols,text_features=text_cols))
shap.summary_plot(shap_values, X_train,)

Running shap select results in an error :


df_val = pd.concat( [X_test, y_test],axis=1)

from shap_select import shap_select
selected_features_df = shap_select(tree_model=clf_model, validation_df=df_val, target="y", task="binary", threshold=0.5)

Output error:

TypeError Traceback (most recent call last)
File _catboost.pyx:2547, in _catboost.get_float_feature()

File _catboost.pyx:1226, in _catboost._FloatOrNan()

File _catboost.pyx:1021, in _catboost._FloatOrNanFromString()

TypeError: Cannot convert 'b'Secondary education teaching professionals'' to float

During handling of the above exception, another exception occurred:

CatBoostError Traceback (most recent call last)
Cell In[57], line 4
1 from shap_select import shap_select
2 # Here model is any model supported by the shap library, fitted on a different (train) dataset
3 # Task can be regression, binary, or multiclass
----> 4 selected_features_df = shap_select(tree_model=clf_model, validation_df=df_val, target="y", task="binary", threshold=0.5)

File /opt/anaconda3/envs/MedRag/lib/python3.11/site-packages/shap_select/select.py:316, in shap_select(tree_model, validation_df, target, feature_names, task, threshold, return_extended_data, alpha)
312 shap_features = create_shap_features(
313 tree_model, validation_df[feature_names], unique_classes
314 )
315 else:
--> 316 shap_features = create_shap_features(tree_model, validation_df[feature_names])
318 # Compute statistical significance of each feature, recursively ablating
319 significance_df = iterative_shap_feature_reduction(
320 shap_features, target, task, alpha
321 )

File /opt/anaconda3/envs/MedRag/lib/python3.11/site-packages/shap_select/select.py:24, in create_shap_features(tree_model, validation_df, classes)
9 def create_shap_features(
10 tree_model: Any, validation_df: pd.DataFrame, classes: List | None = None
11 ) -> pd.DataFrame | Dict[Any, pd.DataFrame]:
12 """
13 Generates SHAP (SHapley Additive exPlanations) values for a given tree-based model on a validation dataset.
14
(...)
22 corresponds to the SHAP values of a feature, and the rows match the index of the validation_df.
23 """
---> 24 explainer = shap.Explainer(tree_model, model_output="raw")(validation_df)
25 shap_values = explainer.values
27 if len(shap_values.shape) == 2:

File /opt/anaconda3/envs/MedRag/lib/python3.11/site-packages/shap/explainers/_tree.py:262, in TreeExplainer.call(self, X, y, interactions, check_additivity)
259 feature_names = getattr(self, "data_feature_names", None)
261 if not interactions:
--> 262 v = self.shap_values(X, y=y, from_call=True, check_additivity=check_additivity, approximate=self.approximate)
263 if isinstance(v, list):
264 v = np.stack(v, axis=-1) # put outputs at the end

File /opt/anaconda3/envs/MedRag/lib/python3.11/site-packages/shap/explainers/_tree.py:464, in TreeExplainer.shap_values(self, X, y, tree_limit, approximate, check_additivity, from_call)
462 import catboost
463 if type(X) != catboost.Pool:
--> 464 X = catboost.Pool(X, cat_features=self.model.cat_feature_indices)
465 phi = self.model.original_model.get_feature_importance(data=X, fstr_type='ShapValues')
467 # note we pull off the last column and keep it as our expected_value

File /opt/anaconda3/envs/MedRag/lib/python3.11/site-packages/catboost/core.py:855, in Pool.init(self, data, label, cat_features, text_features, embedding_features, embedding_features_data, column_description, pairs, graph, delimiter, has_header, ignore_csv_quoting, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count, log_cout, log_cerr, data_can_be_none)
849 if isinstance(feature_names, PATH_TYPES):
850 raise CatBoostError(
851 "feature_names must be None or have non-string type when the pool is created from "
852 "python objects."
853 )
--> 855 self._init(data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, graph, weight,
856 group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count)
857 elif not data_can_be_none:
858 raise CatBoostError("'data' parameter can't be None")

File /opt/anaconda3/envs/MedRag/lib/python3.11/site-packages/catboost/core.py:1491, in Pool._init(self, data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, graph, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count)
1489 if feature_tags is not None:
1490 feature_tags = self._check_transform_tags(feature_tags, feature_names)
-> 1491 self._init_pool(data, label, cat_features, text_features, embedding_features, embedding_features_data, pairs, graph, weight,
1492 group_id, group_weight, subgroup_id, pairs_weight, baseline, timestamp, feature_names, feature_tags, thread_count)

File _catboost.pyx:4339, in _catboost._PoolBase._init_pool()

File _catboost.pyx:4391, in _catboost._PoolBase._init_pool()

File _catboost.pyx:4200, in _catboost._PoolBase._init_features_order_layout_pool()

File _catboost.pyx:3127, in _catboost._set_features_order_data_pd_data_frame()

File _catboost.pyx:2591, in _catboost.create_num_factor_data()

File _catboost.pyx:2549, in _catboost.get_float_feature()

CatBoostError: Bad value for num_feature[non_default_doc_idx=0,feature_idx=2]="Secondary education teaching professionals": Cannot convert 'b'Secondary education teaching professionals'' to float

The text was updated successfully, but these errors were encountered:

ddofer · 2024-12-25T15:22:13Z

Addendum: Disabling the use of TextFeatures in Catboost does seem to solve part of this, although a different fatal error then occurs:


lib/python3.11/site-packages/statsmodels/base/l1_solvers_common.py:71: ConvergenceWarning: QC check did not pass for 189 out of 204 parameters
Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers
  warnings.warn(message, ConvergenceWarning)
/lib/python3.11/site-packages/statsmodels/base/l1_solvers_common.py:144: ConvergenceWarning: Could not trim params automatically due to failed QC check. Trimming using trim_mode == 'size' will still work.
  warnings.warn(msg, ConvergenceWarning)
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
Cell In[68], line 4
      1 from shap_select import shap_select
      2 # Here model is any model supported by the shap library, fitted on a different (train) dataset
----> 4 selected_features_df = shap_select(tree_model=clf_model, validation_df=df_val, target="y", task="binary", threshold=0.05)

File /lib/python3.11/site-packages/shap_select/select.py:319, in shap_select(tree_model, validation_df, target, feature_names, task, threshold, return_extended_data, alpha)
    316     shap_features = create_shap_features(tree_model, validation_df[feature_names])
    318 # Compute statistical significance of each feature, recursively ablating
--> 319 significance_df = iterative_shap_feature_reduction(
    320     shap_features, target, task, alpha
    321 )
    323 # Add 'Selected' column based on the threshold
    324 significance_df["selected"] = (
    325     significance_df["stat.significance"] < threshold
    326 ).astype(int)

File /lib/python3.11/site-packages/shap_select/select.py:236, in iterative_shap_feature_reduction(shap_features, target, task, alpha)
    233 features_left = True
    234 while features_left:
    235     # Call the original shap_features_to_significance function
--> 236     significance_df = shap_features_to_significance(
    237         shap_features, target, task, alpha
    238     )
    240     # Find the feature with the lowest t-value
    241     min_t_value_row = significance_df.loc[significance_df["t-value"].idxmin()]

File /lib/python3.11/site-packages/shap_select/select.py:211, in shap_features_to_significance(shap_features, target, task, alpha)
    209     result_df = regression_significance(shap_features, target, alpha)
    210 elif task == "binary":
--> 211     result_df = binary_classifier_significance(shap_features, target, alpha)
    212 elif task == "multiclass":
    213     result_df = multi_classifier_significance(shap_features, target, alpha)

File /lib/python3.11/site-packages/shap_select/select.py:70, in binary_classifier_significance(shap_features, target, alpha)
     68 # Fit the logistic regression model that will generate confidence intervals
     69 logit_model = sm.Logit(target, shap_features_with_constant)
---> 70 result = logit_model.fit_regularized(disp=False, alpha=alpha)
     72 # Extract the results
     73 summary_frame = result.summary2().tables[1]

Filelib/python3.11/site-packages/statsmodels/discrete/discrete_model.py:565, in BinaryModel.fit_regularized(self, start_params, method, maxiter, full_output, disp, callback, alpha, trim_mode, auto_trim_tol, size_trim_tol, qc_tol, **kwargs)
    557 @Appender(DiscreteModel.fit_regularized.__doc__)
    558 def fit_regularized(self, start_params=None, method='l1',
    559         maxiter='defined_by_method', full_output=1, disp=1, callback=None,
    560         alpha=0, trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=1e-4,
    561         qc_tol=0.03, **kwargs):
    563     _validate_l1_method(method)
--> 565     bnryfit = super().fit_regularized(start_params=start_params,
    566                                       method=method,
    567                                       maxiter=maxiter,
    568                                       full_output=full_output,
    569                                       disp=disp,
    570                                       callback=callback,
    571                                       alpha=alpha,
    572                                       trim_mode=trim_mode,
    573                                       auto_trim_tol=auto_trim_tol,
    574                                       size_trim_tol=size_trim_tol,
    575                                       qc_tol=qc_tol,
    576                                       **kwargs)
    578     discretefit = L1BinaryResults(self, bnryfit)
    579     return L1BinaryResultsWrapper(discretefit)

File /lib/python3.11/site-packages/statsmodels/discrete/discrete_model.py:402, in DiscreteModel.fit_regularized(self, start_params, method, maxiter, full_output, disp, callback, alpha, trim_mode, auto_trim_tol, size_trim_tol, qc_tol, qc_verbose, **kwargs)
    399 else:
    400     pass  # make a function factory to have multiple call-backs
--> 402 mlefit = super().fit(start_params=start_params,
    403                      method=method,
    404                      maxiter=maxiter,
    405                      full_output=full_output,
    406                      disp=disp,
    407                      callback=callback,
    408                      extra_fit_funcs=extra_fit_funcs,
    409                      cov_params_func=cov_params_func,
    410                      **kwargs)
    412 return mlefit

File /lib/python3.11/site-packages/statsmodels/base/model.py:580, in LikelihoodModel.fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    578 cov_params_func = kwargs.setdefault('cov_params_func', None)
    579 if cov_params_func:
--> 580     Hinv = cov_params_func(self, xopt, retvals)
    581 elif method == 'newton' and full_output:
    582     Hinv = np.linalg.inv(-retvals['Hessian']) / nobs

File/lib/python3.11/site-packages/statsmodels/discrete/discrete_model.py:430, in DiscreteModel.cov_params_func_l1(self, likelihood_model, xopt, retvals)
    428     H_restricted = H[nz_idx[:, None], nz_idx]
    429     # Covariance estimate for the nonzero params
--> 430     H_restricted_inv = np.linalg.inv(-H_restricted)
    431 else:
    432     H_restricted_inv = np.zeros(0)

File /lib/python3.11/site-packages/numpy/linalg/linalg.py:561, in inv(a)
    559 signature = 'D->D' if isComplexType(t) else 'd->d'
    560 extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 561 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    562 return wrap(ainv.astype(result_t, copy=False))

File /lib/python3.11/site-packages/numpy/linalg/linalg.py:112, in _raise_linalgerror_singular(err, flag)
    111 def _raise_linalgerror_singular(err, flag):
--> 112     raise LinAlgError("Singular matrix")

LinAlgError: Singular matrix

ddofer · 2024-12-25T16:18:43Z

Singular error happens with sklearn's HistGradientBoostingClassifier also.
It happens when using categorical features. (Features were converted to Categoricla and encoded as ordinals beforehand - no change, same errorr with singular).

Without them it works.

ddofer changed the title ~~Error when Used with Categorical/Text features (Catboost)~~ Error when Used with Categorical/Text features Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when Used with Categorical/Text features #20

Error when Used with Categorical/Text features #20

ddofer commented Dec 25, 2024

ddofer commented Dec 25, 2024

ddofer commented Dec 25, 2024

Error when Used with Categorical/Text features #20

Error when Used with Categorical/Text features #20

Comments

ddofer commented Dec 25, 2024

ddofer commented Dec 25, 2024

ddofer commented Dec 25, 2024