Major revision to simplify; in particular, to remove `obsfit`, `obspredict`, `obstransform` #30

ablaom · 2024-05-19T05:00:09Z

This is a substantial revision aimed at simplification, whose most important goal was to provide a cleaner opt-out for the obs business.

The methods obsfit, obspredict, obstransform were previously introduced to avoid type ambiguities when trying to overload the existing methods fit, predict, transform to accept the output of obs in practice. The issue has been resolved by requiring primary implementations of fit, predict and transform to accept a single data argument. For example, an implementation now implements fit(algorithm, (X, y)) instead of fit(algorithm, X, y). However the latter method is still provided as a convenience method, which means the previous user interface is preserved.

Other changes include:

The obs signatures have been simplified: instead of obs(::typeof(fit), algorithm, data), etc, we have obs(algorithm, data) (for observations passed to fit) and obs(model, data) (for observations passed to predict or transform).
Simplification of the requirements of an "algorithm": Each algorithm comes with an overloading of a new method (trait) LearnAPI.constructor(algorithm) returning a keyword constructor, and the algorithm qualifies if it can be recovered from its constructor (which also gives a way to make mutated copies) and that is all. See the new docs for details. There are other reasons for adding the constructor trait.
The traits position_of_target and postion_of_weights are replaced by cleaner and more general methods target(algorithm, data) and weights(algorithm, data) to do the extraction. (The methods target(algorithm) and weights(algorithm), taking values true or false, replace the traits themselves.)
Adds some new KindsOfProxy types and re-organises them under three abstract subtypes. In particular, we now have types applying to ordinary distribution-fitting algorithms.
Adds the training_predictions accessor function; see my Julia Discourse comment after "...but I think we can handle it
using the proposed API if we add one accessor function." addressing an issue raised by @jeremiedb.
Adds a new trait data_interface indicating the observation accesss implemented for data output by obs. The fallback is for the MLUtils.jl interface, but another option is a plain iterate interface (think data loaders).
Resolves Add new KindOfProxy: LabelAmbiguousFuzzy #29, Add fit_transform? #18

To do:

Consider dropping some of the traits governing the form of method output/input. Perhaps it is enough to stick to scitype (not both scitype or typeof) and to only worry about the types of individual observations, and not about containers.
The only traits dealing with the type of input and output that I've kept are fit_observation_scitype, target_observation_scitype (and kinds_of_proxy which articulates the supported forms of predict output - point predictions, distributions, etc)

To do:

Add update, update_observations and update_features

jeremiedb · 2024-06-14T04:47:28Z

Interesting PR @ablaom, the changes seem overall to move in a promising direction!

Quick consideration regarding predict: this PR uses the prediction type (ex: LiteralTarget) as a second argument. My understanding was that it could be preferable to have a 2 arg function predict(model, data), and have a 3-arg method to cover situation where support for predict basis is different than the default LiteralTarget.

LearnAPI.jl/docs/src/anatomy_of_an_implementation.md

Line 121 in 0af5476

The primary `predict` call will look like this:

Looking at a rough learnAPI MWE for EvoTrees lead me to the question of how to integrate support for LearnAPI through the ecosystem. Current drawbacks in see in MLJ are:

Forces to add a dependency (MLJMOdelInterface, although a light one)
Forces a type inheritance to the the now-called algorithm struct (ex: MMI.Deterministic). Limits the flexibility in the design of these algo structs, and support for other ML interfaces that would also require such type inheritance.
The above lead to many separate packages that act as MLJ wrappers (MLJXGBoostInterface.jl). I see as desirable for a ML Interface not to end up in creation of a spin-off package for each algo as I think it adds to the sometimes perceived dissemination of in some Julia pkg ecosystems.

Use of package extensionmay address the above concerns. It makes the usage of a specific ML interface both an opt-in, and very lightweight/lightweight to the user. And maybe most importantly, it can make it easy for an algo to be compatible with multiple interfaces (ex: a times-series specialized interface).
Here's a very minimal implementation: https://github.com/Evovest/EvoTrees.jl/blob/learnAPI/ext/EvoTreesLearnAPIExt/EvoTreesLearnAPIExt.jl
And an example of usage:
https://github.com/Evovest/EvoTrees.jl/blob/learnAPI/experiments/learnAPI/test.jl

Note that I used the single data argument to fit in a little different fashion as instead of passing a tuple (X, y), I just pass a df::DataFrame. This results in the need to specific the features and target names as kwargs.
Benefits I see:

To the extent that MLJ's scope is for is a tabular / table driven modeling, in practice the process resulting in the fit ready data was most likely performed on a table/DF where the target is present. A single table is therefore the natural entity entering the fitting step in a modeling pipeline.
Since it's table based, columns names are available, making it easy to fit models on distinct sets of features. It only require to change the features names passed as a kwarg, rather than passing multiple copies or views the same full data source.
With other potential ancilliary variables such as weight and offset, having a length 4 tuple input (ex: (X, y, weight, offset)) appears like a more convoluted approach that sticking to a single table.

Here I understand that this kwarg could still be used, like I did in the EvoTree test implementation, but without having it as a default/recommended default, I think it increases the risk of seeing various usage patterns to emerge.
I also noticed that utility functions to extract target and weight were anticpated: (

LearnAPI.jl/docs/src/anatomy_of_an_implementation.md

Line 230 in 0af5476

LearnAPI.target(::Ridge, data) = last(data)

). However, it appeared like these "extract" functions were assuming that a tuple would be used as a single input. Could they be made to work for a single data frame input? I couldn't think of a way to make an implementation of LearnAPI.target(::Ridge, data) where data is a table input other than by making an assumption about the the target name write in the algo implementation, which is obviously problematic.

An example of the potential pattern dissemination from using a tuple as a data input is the convenience method where X and y are passed as positional arguments:

LearnAPI.jl/docs/src/anatomy_of_an_implementation.md

Line 239 in 0af5476

LearnAPI.fit(algorithm::Ridge, X, y; kwargs...) =

If the fit interface was strict about the 1 arg data input, I see potential for reserving an optional additional arg for an eval data. Otherwise, such eval data will be passed as a kwargs and potentially subject to the same ambiguity about the nature of that data (tuple, single table), while not benefiting from the dispatch that exists only AFAIK on the positional args.

Regarding the handling of iterative models, with eval data, early stopping, etc., I remain unclear as what should be the story, other than for the hints on the above reserved positional args for train and eval data. The current design in EvoTrees works ok in practice, but it remains debatable whether early-stopping-rounds and or metric functions should be part of the algo's hyper-params struct, or just be kwarg to fit function, or part of a specialized early tracking struct.
https://github.com/Evovest/EvoTrees.jl/blob/1058fb895478f6ac123c573d5c83a79a344509be/experiments/learnAPI/test.jl#L56
Also, if there ends up being a formal interface for iterative models with early stopping and monitoring of out-of-sample metrics, I think it will require the exposure of a mutating fit! method, and potentially a eval_predictions in addition to training_predictions.

Not sure that I grasped well the obs usage story, notably as at first sight it seems like it could be handled through dispatch with multiple predict methods for various data input types. I'm also influenced here by the perspective that I'm typically expecting the input type for both fit and train to be the same (a table, and more commonly a DataFrame).

Device / accelerator: did you have a take on how to specify the device on which to perform computation? My personal experience has been positive with the EvoTrees / NeuroTrees approach where a device kwarg is passed to fit (:cpu / :gpu).

ablaom · 2024-10-02T00:47:15Z

Thanks @jeremiedb for spending time wading through the PR and trying it out with EvoTrees. I've made some tweaks. Note that I've changed LiteralTarget to Point. And the data-slurping signatures of fit, predict etc, are now provided by a fallback.

I've also detailed tentative contracts for new update methods (see also #13).

My understanding was that it could be preferable to have a 2 arg function predict(model, data), and have a 3-arg method to cover situation where support for predict basis is different than the default LiteralTarget.

Yes, we have that now: predict(algorithm, data) is now bound to predict(algorithm, kind, data) where kind is the first element of LearnAPI.kinds_of_proxy(algorithm), which new algorithms will need to implement.

However, it appeared like these "extract" functions were assuming that a tuple would be used as a single input. Could they be made to work for a single data frame input? I couldn't think of a way to make an implementation of LearnAPI.target(::Ridge, data) where data is a table input other than by making an assumption about the the target name write in the algo implementation, which is obviously problematic.

data need not be a tuple. It could be a single table, including the target variable. In that case the implementation will need to include :target (for name of target column) as a hyperparameter (or kwarg) and overload target(algorithm, data) to return just the target, and features(algorithm, data) to return just the features (something that can be passed to predict to get the training predictions).

It should even be possible to simultaneously support both skit-learn style input X, y and the R-style input just described, and perhaps provide a utility for the boilerplate code realizing this.

Forces to add a dependency (MLJModelInterface, although a light one)

I don't see how one can completely dispense with a base package. Look at the success of Tables.jl. A base API package is useful. MLJModelInterface is not perfect, but it is lightweight.

LearnAPI.jl has zero dependencies, and basically zero functionality (a few signature fallbacks aside)

Forces a type inheritance to the the now-called algorithm struct (ex: MMI.Deterministic).

There is no type hierarchy in LearnAPI.jl and no base algorithm type to subtype.

If the fit interface was strict about the 1 arg data input, I see potential for reserving an optional additional arg for an eval data. Otherwise, such eval data will be passed as a kwargs and potentially subject to the same ambiguity about the nature of that data (tuple, single table), while not benefiting from the dispatch that exists only AFAIK on the positional args.

Do you mean "If the fit interface was not strict about the 1 arg data input"?

Although I favour external out-of-sample evaluation of some kind (see below) the current
proposal does not preclude a call like this:

fit(algorithm, X, y, Xtest, ytest)

Since there are multiple arguments, a fallback now calls fit(algorithm, (X, y, Xtest, ytest)) and this is the method that needs to be implemented. However, since (X, y, Xtest, ytest) does not implement MLUtils.jl interface (because X and Xtest have
different number of observations) you are required to explicitly overload obs(algorithm, (X, y, Xtest, ytest)) to return something that does implement the interface. To do so,
return a thinly wrapped version and implement Base.length and Base.getindex to
subsample X and y but leave Xtest and ytest alone (or transform them into the
actual form you need, such as matrices). You then overload fit to handle this thinly
wrapped version, which will see all the data, and you actually take care of
fit(algorithm, (X, y, Xtest, ytest)) by calling this latter implementation.

I'm not sure if this has any benefit over specifying Xtest and ytest as kwargs, which
is also allowed.

Returning to the idea of externally (but efficiently) monitoring out-of-sample loss, you say:

I think it will require the exposure of a mutating fit! method, and potentially a eval_predictions in addition to training_predictions.

I am keen to ensure this kind of thing can work. Could you please detail what fit! and eval_predictions are intended to do here? I'm assuming that evaluation of the actual loss (and specification of the metric) is the responsibility of the external meta-algorithm. (Note that I have added an update method for increasing the iteration parameter
via a warm restart).

Not sure that I grasped well the obs usage story, notably as at first sight it seems like it could be handled through dispatch with multiple predict methods for various data input types.

Maybe the cross-validation example in the doc page for obs helps.

Device / accelerator: did you have a take on how to specify the device on which to perform computation? My personal experience has been positive with the EvoTrees / NeuroTrees approach where a device kwarg is passed to fit (:cpu / :gpu).

An expert on this kind of thing, @jpsamaroo, suggested dispatching on types provided by
ComputationalResources.jl (CPU1, CPUProcesses, CUDALibs, etc), so MLJ went with
that. I guess this future-proofs against different GPU types (non CUDA). Take a look
here. Instead of
device we use acceleration.

I'm going to merge this PR to make the docs more generally available for feedback. I'm not
planning a new release just yet.

Acked-by: Anthony D. Blaom <[email protected]>

codecov · 2024-10-02T00:50:44Z

Codecov Report

Attention: Patch coverage is 39.28571% with 17 lines in your changes missing coverage. Please review.

Project coverage is 62.65%. Comparing base (86ba3d5) to head (e25e4e7).
Report is 29 commits behind head on dev.

Files with missing lines	Patch %	Lines
src/traits.jl	30.00%	7 Missing ⚠️
src/fit_update.jl	20.00%	4 Missing ⚠️
src/predict_transform.jl	50.00%	3 Missing ⚠️
src/target_weights_features.jl	25.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev      #30      +/-   ##
==========================================
- Coverage   65.81%   62.65%   -3.17%     
==========================================
  Files           9       10       +1     
  Lines         117       83      -34     
==========================================
- Hits           77       52      -25     
+ Misses         40       31       -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ablaom added 2 commits May 19, 2024 16:20

simplify, removing in particular, obsfit, obspredict, obstransform

b60fc22

rm redundant pkg from [extras]

d47cabe

ablaom marked this pull request as draft May 19, 2024 05:00

fix typos

f0c68d5

ablaom mentioned this pull request May 19, 2024

Add fit_transform? #18

Closed

more doc tweaks

3252e09

ablaom marked this pull request as ready for review May 21, 2024 04:37

ablaom added 8 commits May 28, 2024 17:30

fix table of contents for the docs

69bd859

tweak

f4b0fdd

doc tweak

acac24f

tweak

3b289f5

whitespace fixes

d6c320f

fix whitespace

54a5f9b

clarify importance of constructor over type in traits and docstrings

0af5476

add Expectile and Quantile target proxy types

4b7c09c

ablaom added 14 commits September 9, 2024 09:34

add target_observation_scitype

82ade40

more doc updates

7a781a0

add fallbacks to reduce need to overload some convenience methods

31c42c6

add fallbacks to rm need for overloading predict convenience fn

79c67e3

add some forgotten files

d270229

doc updates and some small re-organziation of code

6e721c8

complete addition of update methods + other tweaks

729e0d7

rename fit.* -> fit_update.* and descriptors -> tags

20b4bff

tweak

1a92f47

teak target_observation_scitype

d1f3259

purge a bunch of traits related to predict/transform input/output

d69c5b0

rename LiteralTarget -> Point

11b38cf

fix typos

8fd02c9

fix syntax error in test

60f8b6c

add julia 1.10 testing to matrix

e25e4e7

Acked-by: Anthony D. Blaom <[email protected]>

ablaom merged commit f2d4df8 into dev Oct 2, 2024
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Major revision to simplify; in particular, to remove `obsfit`, `obspredict`, `obstransform` #30

Major revision to simplify; in particular, to remove `obsfit`, `obspredict`, `obstransform` #30

Uh oh!

ablaom commented May 19, 2024 •

edited

Loading

Uh oh!

jeremiedb commented Jun 14, 2024 •

edited

Loading

Uh oh!

ablaom commented Oct 2, 2024 •

edited

Loading

Uh oh!

codecov bot commented Oct 2, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Major revision to simplify; in particular, to remove obsfit, obspredict, obstransform #30

Major revision to simplify; in particular, to remove obsfit, obspredict, obstransform #30

Uh oh!

Conversation

ablaom commented May 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedb commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ablaom commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Major revision to simplify; in particular, to remove `obsfit`, `obspredict`, `obstransform` #30

Major revision to simplify; in particular, to remove `obsfit`, `obspredict`, `obstransform` #30

ablaom commented May 19, 2024 •

edited

Loading

jeremiedb commented Jun 14, 2024 •

edited

Loading

ablaom commented Oct 2, 2024 •

edited

Loading

codecov bot commented Oct 2, 2024 •

edited

Loading