API change for the `SyntheticControl` experiment class #460

drbenvincent · 2025-04-21T10:08:28Z

Towards Upgrade synthetic control to model multiple treated units #456
This does not yet enable multiple treated units in synthetic control experiments. But it implements important prep work which will enable it. The changes focus on:
API changes
Bit of a spaghetti situation, but I had to switch from storing dataframes to xarray.DataArrays. This helps with the broadcasting. The model functions get varied input depending on the situation (e.g. experiment), so it was getting complicated. Xarray simplifies broadcasting in functions like PyMCModel.calculate_impact. (A bunch of time was used trying different solutions before it became clear that the xarray approach was the easiest).
Because the API is changing, I elected to get rid of some legacy backward compatibility/ depreciation handling stuff. So when we do the next release, this will include breaking API changes and abandoning backward compatibility with an old API. I'm not fussed about this, we are currently at version 0.x, so people should expect the API to change until we reach 1.x.

Remaining taks:

Check the multi cell geolift notebook is working as expected
Fix failing doctest
~~Resolve error below~~ Using a linear regression model with synthetic control is not a planned use case. If it becomes important then we can deal with that at the time.

Remaining bug, not captured by tests

In the scikit-learn synthetic control notebook, we are getting an error in the second part where we call

result = cp.SyntheticControl(
    df,
    treatment_time,
    control_units=["a", "b", "c", "d", "e", "f", "g"],
    treated_units=["actual"],
    model=LinearRegression(positive=True),
)

There is a broadcasting issue resulting in this:

So I need to think about if doing this (linear model with synthetic control experiment) even make sense. If it is, then I need to add a test to catch this error because it's not covered by tests currently, and fix it.

📚 Documentation preview 📚: https://causalpy--460.org.readthedocs.build/en/460/

review-notebook-app · 2025-04-21T10:08:39Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2025-04-23T13:38:27Z

Codecov Report

Attention: Patch coverage is 97.43590% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.40%. Comparing base (273daa2) to head (ec615bd).
Report is 63 commits behind head on main.

Files with missing lines	Patch %	Lines
causalpy/experiments/synthetic_control.py	95.65%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #460      +/-   ##
==========================================
- Coverage   94.67%   94.40%   -0.27%     
==========================================
  Files          32       29       -3     
  Lines        2196     2075     -121     
==========================================
- Hits         2079     1959     -120     
+ Misses        117      116       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

NathanielF

This was a quick pass but looks broadly sensible. Just trying to understand how extensive you want to make the API change? Asked some questions but mostly clarifying for my own understanding.

I can see how the API change makes it easier to allow for a matrix of weights in the WeightedFitter.

My one questions was whether you wanted to allow multiple "equations" for the different treatment groups i.e. generate different synthetic controls based on a varying set of inputs. Because If so maybe patsy could still work.... but you'd have to more flexible in the PyMC model than in the API ....

causalpy/experiments/diff_in_diff.py

causalpy/experiments/interrupted_time_series.py

causalpy/experiments/synthetic_control.py

causalpy/pymc_models.py

causalpy/experiments/interrupted_time_series.py

drbenvincent · 2025-05-09T10:13:02Z

Thanks for the review.

I can see how the API change makes it easier to allow for a matrix of weights in the WeightedFitter.

My one questions was whether you wanted to allow multiple "equations" for the different treatment groups i.e. generate different synthetic controls based on a varying set of inputs. Because If so maybe patsy could still work.... but you'd have to more flexible in the PyMC model than in the API ....

My original idea was to see if patsy allowed something like treated1, treated 2, treated 3 ~ control1 + ... + controlN, but it doesn't do that.

The rationale for going for a list of treated and a list of control units was to a) allow multiple treated units, b) allow the weights in the pymc model to change from a 1D vector to a 2D array. Crucially, if all treated units are predicted by the same set of control units then this would be a relatively trivial change to the shape of the Dirichlet distributed weights.

I have to admit, I didn't think of your idea of providing a list of formulas. This is interesting:

✅ Retains the same basic formula interface
✅ The user gets extra ability to add domain knowledge into the modelling process. They could have prior knowledge that some control outlets make sense to go into the pool for some treated units but not others. You could just let the data decide, but having that ability (to effectively set some coefficients at zero by not including a control unit as a predictor) to provide domain knowledge into the modelling is appealing.
❌ It would be slightly harder to implement in the WeightedSumFitter class. Probably not that hard though - you could probably just iterate over the treated units, building up the model with the right length 1-dimensional Dirichlet distributions as you go.

NathanielF · 2025-05-09T11:54:35Z

Yes, i think you have the suggestion right. I think it potentially gives you more flexibility. I'm trying to do the same here: pymc-labs/pymc-marketing#1654

But it does kind of move the complexity into the model class as you have to parse and construct the relevant components.

drbenvincent · 2025-05-15T09:10:07Z

Hi @NathanielF. I'm just thinking more about this. I'm not necessarily averse to diving in with the list of formula approach, and there is something to be said for "just deliver the feature" and don't worry if the implementation gets a bit messy. But for whatever reason I seem to be taking the approach of measure twice and cut once before I pull the trigger and implement. The code structure is semi close to elegant (or at least simple) at this point, so I'm just thinking it through.

If we change the API to allow lists of formulas

API

The formula kwarg is now a list of strings (formulas) rather than a single formula string. This allows users control over what control groups are treated as 'available' in the synthetic control donor pool.

`SyntheticControl` class

Still uses patsy to generate $X$ and $y$
Though now we will have pairs of $X, y$ for each treated unit
Because users can provide different numbers and/or identities of control units, we can't just put this into a 3D matrix shape observations, control_units, treatment_units.

`WeightedSumFitter` class

This poses a slight challenge. At first glance you can think to simply iterate over each treated unit, pulling the $X, y$ data for each treated unit.
However, the whole PyMCModel class is build around a scikit-learn style passing around of predictor matrices $X$ and outcomes $y$. So this presents a challenge. Though it might not be that problematic - for example the InstrumentalVariableRegression class has additional inputs to the build_model method, and this seems to be dealt with by simply overriding the fit method. I'm not 100% about this yet, but I think it will likely necessitate overriding more methods of the PyMCModel class.
Potentially the simplest way to do it would be to pass in lists of $X$ and $y$. The method signatures could remain the same, but the methods would have to be specialised (or dispatch) to deal with either ndarray or lists of ndarrays.

If we stick with list of treated and control units

As the PR currently stands we have lists of treated and control units. This assumes that the same set of control units are used in the synthetic control donor pool for all treated units.

❌ Less flexibility for the user
❌ A deviation in the API away from the formula approach
✅ Could allow for $X$ and $y$ to simply gain another dimension (i.e. treated_units) which will hopefully allow for the method signatures to stay the same, not have to override methods of PyMCModel, assuming that xarray can handle all the broadcasting properly.

drbenvincent · 2025-05-15T18:18:52Z

Another potential advantage of having the same set of control units for all target units is that we can have a 2D matrix of weights (the number of controls by the number of treated units). I'm not that experienced with the Dirichlet distribution, but perhaps this allows for partial pooling of weights across treated units. That might be a nice selling point?

drbenvincent · 2025-05-15T18:45:27Z

Just to make sure connections are made - as part of #463 we are talking about potentially swapping patsy for formulaic. As far as I can tell this allows multiple terms on the left hand side of the ~ which could be useful if we go down the formula route here.

drbenvincent · 2025-05-19T18:10:00Z

I'm thinking that we can end up with both API's. Though I propose moving forward with the list of treated and control unit API's just to keep the momentum going, and slightly because the implementation will be easier. Then we can follow up in the future with the list of formulas API. What do you think @NathanielF?

NathanielF · 2025-05-19T20:25:58Z

That makes sense @drbenvincent i definitely didn't want suggest this was blocking. More just food for thought. I agree with the current proposal to move forward

drbenvincent · 2025-05-26T10:02:15Z

In that case @NathanielF I think this is ready for a re-review/approve if you think it's ready. There's nothing obvious I can see that's lacking.

My next PR would be enable the new multiple treated unit modeling which this new API allows :)

NathanielF · 2025-05-26T10:03:59Z

Cool. Will have a look this evening.

drbenvincent · 2025-05-26T11:59:29Z

~~Oh. Might need to double check building the docs locally~~

It was fine. Remote docs being built fine. Locally I had some uncleared cache that was causing a local docs build to break.

NathanielF

Looks great! Congrats.

drbenvincent added 3 commits April 21, 2025 10:16

initial efforts

7bbff4f

remove print statement

7ece785

obs_indx -> obs_ind (see #459)

98127fd

drbenvincent marked this pull request as draft April 21, 2025 10:08

drbenvincent added 25 commits April 21, 2025 11:10

update API in tests for SyntheticControl class

82d041d

Merge branch 'main' into sc-api-change

3bbabee

tidy up + fixes

1f4d17e

get deprecation tests working again

182aac0

bug fixes

876c154

fix bug with SyntheticControl.get_plot_data_bayesian

3d29fef

use new API in scikit-learn integration test

3eb24a9

update the pymc synthetic control notebooks

bd9beaa

remove test_api_stability

ed62f0a

fix bugs

a148ec3

bug fixing

15454b2

remove api backward compatibility and deprecation tests

b77c3f0

more deprecation removal

c89a147

add additional asserts to integration tests to detect shape problems

fad78d8

remove asserts which weren't doing the job I intended

2137091

start embracing xarray to handle broadcasting

45f1b1a

formatting

2726484

store data in xarray objects in more experiments

b920207

attempt to make LinearRegression doctest pass

a28c5da

revert a change which seems no longer required

642f651

fix some failing tests

a17f5c9

all tests now passing 🎉 (one failing doctest)

9941c02

update api calls in multi cell geolift notebook

6db026e

undo plot colour change that I made when debugging

b49ed7e

final doctest now passes 😍

1f753e9

drbenvincent added 3 commits May 7, 2025 09:19

update notebook on scikit-learn synthetic control example

d855557

rerun synthetic control notebook

c92c117

fix minor issue in multi-cell geolift notebook

2fcb3ba

drbenvincent marked this pull request as ready for review May 7, 2025 08:58

drbenvincent requested a review from NathanielF May 7, 2025 19:29

NathanielF reviewed May 7, 2025

View reviewed changes

update geolift notebook to use new API

e60d924

drbenvincent requested a review from NathanielF May 26, 2025 10:02

update UML diagrams

ec615bd

NathanielF approved these changes May 30, 2025

View reviewed changes

drbenvincent merged commit bb7c2bb into main May 31, 2025
9 checks passed

drbenvincent deleted the sc-api-change branch May 31, 2025 05:39

API change for the SyntheticControl experiment class #460

API change for the SyntheticControl experiment class #460

Uh oh!

Conversation

drbenvincent commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Remaining taks:

Remaining bug, not captured by tests

Uh oh!

review-notebook-app bot commented Apr 21, 2025

Uh oh!

codecov bot commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

NathanielF left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drbenvincent commented May 9, 2025

Uh oh!

NathanielF commented May 9, 2025

Uh oh!

drbenvincent commented May 15, 2025

If we change the API to allow lists of formulas

API

SyntheticControl class

WeightedSumFitter class

If we stick with list of treated and control units

Uh oh!

drbenvincent commented May 15, 2025

Uh oh!

drbenvincent commented May 15, 2025

Uh oh!

drbenvincent commented May 19, 2025

Uh oh!

NathanielF commented May 19, 2025

Uh oh!

drbenvincent commented May 26, 2025

Uh oh!

NathanielF commented May 26, 2025

Uh oh!

drbenvincent commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanielF left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

API change for the `SyntheticControl` experiment class #460

API change for the `SyntheticControl` experiment class #460

drbenvincent commented Apr 21, 2025 •

edited

Loading

codecov bot commented Apr 23, 2025 •

edited

Loading

`SyntheticControl` class

`WeightedSumFitter` class

drbenvincent commented May 26, 2025 •

edited

Loading