Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major revision to simplify; in particular, to remove obsfit, obspredict, obstransform #30

Merged
merged 27 commits into from
Oct 2, 2024

Conversation

ablaom
Copy link
Member

@ablaom ablaom commented May 19, 2024

This is a substantial revision aimed at simplification, whose most important goal was to provide a cleaner opt-out for the obs business.

The methods obsfit, obspredict, obstransform were previously introduced to avoid type ambiguities when trying to overload the existing methods fit, predict, transform to accept the output of obs in practice. The issue has been resolved by requiring primary implementations of fit, predict and transform to accept a single data argument. For example, an implementation now implements fit(algorithm, (X, y)) instead of fit(algorithm, X, y). However the latter method is still provided as a convenience method, which means the previous user interface is preserved.

Other changes include:

  • The obs signatures have been simplified: instead of obs(::typeof(fit), algorithm, data), etc, we have obs(algorithm, data) (for observations passed to fit) and obs(model, data) (for observations passed to predict or transform).

  • Simplification of the requirements of an "algorithm": Each algorithm comes with an overloading of a new method (trait) LearnAPI.constructor(algorithm) returning a keyword constructor, and the algorithm qualifies if it can be recovered from its constructor (which also gives a way to make mutated copies) and that is all. See the new docs for details. There are other reasons for adding the constructor trait.

  • The traits position_of_target and postion_of_weights are replaced by cleaner and more general methods target(algorithm, data) and weights(algorithm, data) to do the extraction. (The methods target(algorithm) and weights(algorithm), taking values true or false, replace the traits themselves.)

  • Adds some new KindsOfProxy types and re-organises them under three abstract subtypes. In particular, we now have types applying to ordinary distribution-fitting algorithms.

  • Adds the training_predictions accessor function; see my Julia Discourse comment after "...but I think we can handle it
    using the proposed API if we add one accessor function." addressing an issue raised by @jeremiedb.

  • Adds a new trait data_interface indicating the observation accesss implemented for data output by obs. The fallback is for the MLUtils.jl interface, but another option is a plain iterate interface (think data loaders).

  • Resolves Add new KindOfProxy: LabelAmbiguousFuzzy #29, Add fit_transform? #18

To do:

  • Consider dropping some of the traits governing the form of method output/input. Perhaps it is enough to stick to scitype (not both scitype or typeof) and to only worry about the types of individual observations, and not about containers.

  • The only traits dealing with the type of input and output that I've kept are fit_observation_scitype, target_observation_scitype (and kinds_of_proxy which articulates the supported forms of predict output - point predictions, distributions, etc)

To do:

  • Add update, update_observations and update_features

@ablaom ablaom marked this pull request as draft May 19, 2024 05:00
@ablaom ablaom mentioned this pull request May 19, 2024
@ablaom ablaom marked this pull request as ready for review May 21, 2024 04:37
@jeremiedb
Copy link

jeremiedb commented Jun 14, 2024

Interesting PR @ablaom, the changes seem overall to move in a promising direction!

Quick consideration regarding predict: this PR uses the prediction type (ex: LiteralTarget) as a second argument. My understanding was that it could be preferable to have a 2 arg function predict(model, data), and have a 3-arg method to cover situation where support for predict basis is different than the default LiteralTarget.

The primary `predict` call will look like this:

Looking at a rough learnAPI MWE for EvoTrees lead me to the question of how to integrate support for LearnAPI through the ecosystem. Current drawbacks in see in MLJ are:

  • Forces to add a dependency (MLJMOdelInterface, although a light one)
  • Forces a type inheritance to the the now-called algorithm struct (ex: MMI.Deterministic). Limits the flexibility in the design of these algo structs, and support for other ML interfaces that would also require such type inheritance.
  • The above lead to many separate packages that act as MLJ wrappers (MLJXGBoostInterface.jl). I see as desirable for a ML Interface not to end up in creation of a spin-off package for each algo as I think it adds to the sometimes perceived dissemination of in some Julia pkg ecosystems.

Use of package extensionmay address the above concerns. It makes the usage of a specific ML interface both an opt-in, and very lightweight/lightweight to the user. And maybe most importantly, it can make it easy for an algo to be compatible with multiple interfaces (ex: a times-series specialized interface).
Here's a very minimal implementation: https://github.com/Evovest/EvoTrees.jl/blob/learnAPI/ext/EvoTreesLearnAPIExt/EvoTreesLearnAPIExt.jl
And an example of usage:
https://github.com/Evovest/EvoTrees.jl/blob/learnAPI/experiments/learnAPI/test.jl

Note that I used the single data argument to fit in a little different fashion as instead of passing a tuple (X, y), I just pass a df::DataFrame. This results in the need to specific the features and target names as kwargs.
Benefits I see:

  • To the extent that MLJ's scope is for is a tabular / table driven modeling, in practice the process resulting in the fit ready data was most likely performed on a table/DF where the target is present. A single table is therefore the natural entity entering the fitting step in a modeling pipeline.
  • Since it's table based, columns names are available, making it easy to fit models on distinct sets of features. It only require to change the features names passed as a kwarg, rather than passing multiple copies or views the same full data source.
  • With other potential ancilliary variables such as weight and offset, having a length 4 tuple input (ex: (X, y, weight, offset)) appears like a more convoluted approach that sticking to a single table.

Here I understand that this kwarg could still be used, like I did in the EvoTree test implementation, but without having it as a default/recommended default, I think it increases the risk of seeing various usage patterns to emerge.
I also noticed that utility functions to extract target and weight were anticpated: (

LearnAPI.target(::Ridge, data) = last(data)
). However, it appeared like these "extract" functions were assuming that a tuple would be used as a single input. Could they be made to work for a single data frame input? I couldn't think of a way to make an implementation of LearnAPI.target(::Ridge, data) where data is a table input other than by making an assumption about the the target name write in the algo implementation, which is obviously problematic.

An example of the potential pattern dissemination from using a tuple as a data input is the convenience method where X and y are passed as positional arguments:

LearnAPI.fit(algorithm::Ridge, X, y; kwargs...) =

If the fit interface was strict about the 1 arg data input, I see potential for reserving an optional additional arg for an eval data. Otherwise, such eval data will be passed as a kwargs and potentially subject to the same ambiguity about the nature of that data (tuple, single table), while not benefiting from the dispatch that exists only AFAIK on the positional args.

Regarding the handling of iterative models, with eval data, early stopping, etc., I remain unclear as what should be the story, other than for the hints on the above reserved positional args for train and eval data. The current design in EvoTrees works ok in practice, but it remains debatable whether early-stopping-rounds and or metric functions should be part of the algo's hyper-params struct, or just be kwarg to fit function, or part of a specialized early tracking struct.
https://github.com/Evovest/EvoTrees.jl/blob/1058fb895478f6ac123c573d5c83a79a344509be/experiments/learnAPI/test.jl#L56
Also, if there ends up being a formal interface for iterative models with early stopping and monitoring of out-of-sample metrics, I think it will require the exposure of a mutating fit! method, and potentially a eval_predictions in addition to training_predictions.

Not sure that I grasped well the obs usage story, notably as at first sight it seems like it could be handled through dispatch with multiple predict methods for various data input types. I'm also influenced here by the perspective that I'm typically expecting the input type for both fit and train to be the same (a table, and more commonly a DataFrame).

Device / accelerator: did you have a take on how to specify the device on which to perform computation? My personal experience has been positive with the EvoTrees / NeuroTrees approach where a device kwarg is passed to fit (:cpu / :gpu).

@ablaom
Copy link
Member Author

ablaom commented Oct 2, 2024

Thanks @jeremiedb for spending time wading through the PR and trying it out with EvoTrees. I've made some tweaks. Note that I've changed LiteralTarget to Point. And the data-slurping signatures of fit, predict etc, are now provided by a fallback.

I've also detailed tentative contracts for new update methods (see also #13).

My understanding was that it could be preferable to have a 2 arg function predict(model, data), and have a 3-arg method to cover situation where support for predict basis is different than the default LiteralTarget.

Yes, we have that now: predict(algorithm, data) is now bound to predict(algorithm, kind, data) where kind is the first element of LearnAPI.kinds_of_proxy(algorithm), which new algorithms will need to implement.

However, it appeared like these "extract" functions were assuming that a tuple would be used as a single input. Could they be made to work for a single data frame input? I couldn't think of a way to make an implementation of LearnAPI.target(::Ridge, data) where data is a table input other than by making an assumption about the the target name write in the algo implementation, which is obviously problematic.

data need not be a tuple. It could be a single table, including the target variable. In that case the implementation will need to include :target (for name of target column) as a hyperparameter (or kwarg) and overload target(algorithm, data) to return just the target, and features(algorithm, data) to return just the features (something that can be passed to predict to get the training predictions).

It should even be possible to simultaneously support both skit-learn style input X, y and the R-style input just described, and perhaps provide a utility for the boilerplate code realizing this.

Forces to add a dependency (MLJModelInterface, although a light one)

I don't see how one can completely dispense with a base package. Look at the success of Tables.jl. A base API package is useful. MLJModelInterface is not perfect, but it is lightweight.

LearnAPI.jl has zero dependencies, and basically zero functionality (a few signature fallbacks aside)

Forces a type inheritance to the the now-called algorithm struct (ex: MMI.Deterministic).

There is no type hierarchy in LearnAPI.jl and no base algorithm type to subtype.

If the fit interface was strict about the 1 arg data input, I see potential for reserving an optional additional arg for an eval data. Otherwise, such eval data will be passed as a kwargs and potentially subject to the same ambiguity about the nature of that data (tuple, single table), while not benefiting from the dispatch that exists only AFAIK on the positional args.

Do you mean "If the fit interface was not strict about the 1 arg data input"?

Although I favour external out-of-sample evaluation of some kind (see below) the current
proposal does not preclude a call like this:

fit(algorithm, X, y, Xtest, ytest)

Since there are multiple arguments, a fallback now calls fit(algorithm, (X, y, Xtest, ytest)) and this is the method that needs to be implemented. However, since (X, y, Xtest, ytest) does not implement MLUtils.jl interface (because X and Xtest have
different number of observations) you are required to explicitly overload obs(algorithm, (X, y, Xtest, ytest)) to return something that does implement the interface. To do so,
return a thinly wrapped version and implement Base.length and Base.getindex to
subsample X and y but leave Xtest and ytest alone (or transform them into the
actual form you need, such as matrices). You then overload fit to handle this thinly
wrapped version, which will see all the data, and you actually take care of
fit(algorithm, (X, y, Xtest, ytest)) by calling this latter implementation.

I'm not sure if this has any benefit over specifying Xtest and ytest as kwargs, which
is also allowed.

Returning to the idea of externally (but efficiently) monitoring out-of-sample loss, you say:

I think it will require the exposure of a mutating fit! method, and potentially a eval_predictions in addition to training_predictions.

I am keen to ensure this kind of thing can work. Could you please detail what fit! and eval_predictions are intended to do here? I'm assuming that evaluation of the actual loss (and specification of the metric) is the responsibility of the external meta-algorithm. (Note that I have added an update method for increasing the iteration parameter
via a warm restart).

Not sure that I grasped well the obs usage story, notably as at first sight it seems like it could be handled through dispatch with multiple predict methods for various data input types.

Maybe the cross-validation example in the doc page for obs helps.

Device / accelerator: did you have a take on how to specify the device on which to perform computation? My personal experience has been positive with the EvoTrees / NeuroTrees approach where a device kwarg is passed to fit (:cpu / :gpu).

An expert on this kind of thing, @jpsamaroo, suggested dispatching on types provided by
ComputationalResources.jl (CPU1, CPUProcesses, CUDALibs, etc), so MLJ went with
that. I guess this future-proofs against different GPU types (non CUDA). Take a look
here. Instead of
device we use acceleration.

I'm going to merge this PR to make the docs more generally available for feedback. I'm not
planning a new release just yet.

Copy link

codecov bot commented Oct 2, 2024

Codecov Report

Attention: Patch coverage is 39.28571% with 17 lines in your changes missing coverage. Please review.

Project coverage is 62.65%. Comparing base (86ba3d5) to head (e25e4e7).
Report is 29 commits behind head on dev.

Files with missing lines Patch % Lines
src/traits.jl 30.00% 7 Missing ⚠️
src/fit_update.jl 20.00% 4 Missing ⚠️
src/predict_transform.jl 50.00% 3 Missing ⚠️
src/target_weights_features.jl 25.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev      #30      +/-   ##
==========================================
- Coverage   65.81%   62.65%   -3.17%     
==========================================
  Files           9       10       +1     
  Lines         117       83      -34     
==========================================
- Hits           77       52      -25     
+ Misses         40       31       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ablaom ablaom merged commit f2d4df8 into dev Oct 2, 2024
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add new KindOfProxy: LabelAmbiguousFuzzy
2 participants