Skip to content

API change for the SyntheticControl experiment class #460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

drbenvincent
Copy link
Collaborator

@drbenvincent drbenvincent commented Apr 21, 2025

  • Towards Upgrade synthetic control to model multiple treated units #456
  • This does not yet enable multiple treated units in synthetic control experiments. But it implements important prep work which will enable it. The changes focus on:
  • API changes
  • Bit of a spaghetti situation, but I had to switch from storing dataframes to xarray.DataArrays. This helps with the broadcasting. The model functions get varied input depending on the situation (e.g. experiment), so it was getting complicated. Xarray simplifies broadcasting in functions like PyMCModel.calculate_impact. (A bunch of time was used trying different solutions before it became clear that the xarray approach was the easiest).
  • Because the API is changing, I elected to get rid of some legacy backward compatibility/ depreciation handling stuff. So when we do the next release, this will include breaking API changes and abandoning backward compatibility with an old API. I'm not fussed about this, we are currently at version 0.x, so people should expect the API to change until we reach 1.x.

Remaining taks:

  • Check the multi cell geolift notebook is working as expected
  • Fix failing doctest
  • Resolve error below Using a linear regression model with synthetic control is not a planned use case. If it becomes important then we can deal with that at the time.

Remaining bug, not captured by tests

In the scikit-learn synthetic control notebook, we are getting an error in the second part where we call

result = cp.SyntheticControl(
    df,
    treatment_time,
    control_units=["a", "b", "c", "d", "e", "f", "g"],
    treated_units=["actual"],
    model=LinearRegression(positive=True),
)

There is a broadcasting issue resulting in this:
Screenshot 2025-04-23 at 10 34 27

So I need to think about if doing this (linear model with synthetic control experiment) even make sense. If it is, then I need to add a test to catch this error because it's not covered by tests currently, and fix it.


📚 Documentation preview 📚: https://causalpy--460.org.readthedocs.build/en/460/

@drbenvincent drbenvincent marked this pull request as draft April 21, 2025 10:08
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

codecov bot commented Apr 23, 2025

Codecov Report

Attention: Patch coverage is 97.43590% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.40%. Comparing base (273daa2) to head (2fcb3ba).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
causalpy/experiments/synthetic_control.py 95.65% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #460      +/-   ##
==========================================
- Coverage   94.67%   94.40%   -0.27%     
==========================================
  Files          32       29       -3     
  Lines        2196     2075     -121     
==========================================
- Hits         2079     1959     -120     
+ Misses        117      116       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@drbenvincent drbenvincent marked this pull request as ready for review May 7, 2025 08:58
@drbenvincent drbenvincent requested a review from NathanielF May 7, 2025 19:29
Copy link
Contributor

@NathanielF NathanielF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a quick pass but looks broadly sensible. Just trying to understand how extensive you want to make the API change? Asked some questions but mostly clarifying for my own understanding.

I can see how the API change makes it easier to allow for a matrix of weights in the WeightedFitter.

My one questions was whether you wanted to allow multiple "equations" for the different treatment groups i.e. generate different synthetic controls based on a varying set of inputs. Because If so maybe patsy could still work.... but you'd have to more flexible in the PyMC model than in the API ....

# turn into xarray.DataArray's
self.X = xr.DataArray(
self.X,
dims=["obs_ind", "coeffs"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dims of X as coeffs? Should surely be covariates... coeffs is the output, and while they should be 1:1 naming is a little misleading no?

self.model.fit(X=self.pre_X, y=self.pre_y, coords=COORDS)
COORDS = {
# key must stay as "coeffs" unless we can find a way to auto identify
# the predictor dimension name. "coeffs" is assumed by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok fair enough. I guess easier to be consistent across code base

:param control_units:
A list of control units to be used in the experiment
:param treated_units:
A list of treated units to be used in the experiment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sorry. i forgot this was synthetic control. So treated as target makes sense. The idea would be to enable multiple treatment groups? Do you want to enable different "equations" for the different treatment units? Or would it be just the same fixed controls for each potential treatment unit?

@drbenvincent
Copy link
Collaborator Author

Thanks for the review.

I can see how the API change makes it easier to allow for a matrix of weights in the WeightedFitter.

My one questions was whether you wanted to allow multiple "equations" for the different treatment groups i.e. generate different synthetic controls based on a varying set of inputs. Because If so maybe patsy could still work.... but you'd have to more flexible in the PyMC model than in the API ....

My original idea was to see if patsy allowed something like treated1, treated 2, treated 3 ~ control1 + ... + controlN, but it doesn't do that.

The rationale for going for a list of treated and a list of control units was to a) allow multiple treated units, b) allow the weights in the pymc model to change from a 1D vector to a 2D array. Crucially, if all treated units are predicted by the same set of control units then this would be a relatively trivial change to the shape of the Dirichlet distributed weights.

I have to admit, I didn't think of your idea of providing a list of formulas. This is interesting:

  • ✅ Retains the same basic formula interface
  • ✅ The user gets extra ability to add domain knowledge into the modelling process. They could have prior knowledge that some control outlets make sense to go into the pool for some treated units but not others. You could just let the data decide, but having that ability (to effectively set some coefficients at zero by not including a control unit as a predictor) to provide domain knowledge into the modelling is appealing.
  • ❌ It would be slightly harder to implement in the WeightedSumFitter class. Probably not that hard though - you could probably just iterate over the treated units, building up the model with the right length 1-dimensional Dirichlet distributions as you go.

@NathanielF
Copy link
Contributor

Yes, i think you have the suggestion right. I think it potentially gives you more flexibility. I'm trying to do the same here: pymc-labs/pymc-marketing#1654

But it does kind of move the complexity into the model class as you have to parse and construct the relevant components.

@drbenvincent
Copy link
Collaborator Author

Hi @NathanielF. I'm just thinking more about this. I'm not necessarily averse to diving in with the list of formula approach, and there is something to be said for "just deliver the feature" and don't worry if the implementation gets a bit messy. But for whatever reason I seem to be taking the approach of measure twice and cut once before I pull the trigger and implement. The code structure is semi close to elegant (or at least simple) at this point, so I'm just thinking it through.

If we change the API to allow lists of formulas

API

The formula kwarg is now a list of strings (formulas) rather than a single formula string. This allows users control over what control groups are treated as 'available' in the synthetic control donor pool.

SyntheticControl class

  • Still uses patsy to generate $X$ and $y$
  • Though now we will have pairs of $X, y$ for each treated unit
  • Because users can provide different numbers and/or identities of control units, we can't just put this into a 3D matrix shape observations, control_units, treatment_units.

WeightedSumFitter class

  • This poses a slight challenge. At first glance you can think to simply iterate over each treated unit, pulling the $X, y$ data for each treated unit.
  • However, the whole PyMCModel class is build around a scikit-learn style passing around of predictor matrices $X$ and outcomes $y$. So this presents a challenge. Though it might not be that problematic - for example the InstrumentalVariableRegression class has additional inputs to the build_model method, and this seems to be dealt with by simply overriding the fit method. I'm not 100% about this yet, but I think it will likely necessitate overriding more methods of the PyMCModel class.
  • Potentially the simplest way to do it would be to pass in lists of $X$ and $y$. The method signatures could remain the same, but the methods would have to be specialised (or dispatch) to deal with either ndarray or lists of ndarrays.

If we stick with list of treated and control units

As the PR currently stands we have lists of treated and control units. This assumes that the same set of control units are used in the synthetic control donor pool for all treated units.

❌ Less flexibility for the user
❌ A deviation in the API away from the formula approach
✅ Could allow for $X$ and $y$ to simply gain another dimension (i.e. treated_units) which will hopefully allow for the method signatures to stay the same, not have to override methods of PyMCModel, assuming that xarray can handle all the broadcasting properly.

@drbenvincent
Copy link
Collaborator Author

Another potential advantage of having the same set of control units for all target units is that we can have a 2D matrix of weights (the number of controls by the number of treated units). I'm not that experienced with the Dirichlet distribution, but perhaps this allows for partial pooling of weights across treated units. That might be a nice selling point?

@drbenvincent
Copy link
Collaborator Author

Just to make sure connections are made - as part of #463 we are talking about potentially swapping patsy for formulaic. As far as I can tell this allows multiple terms on the left hand side of the ~ which could be useful if we go down the formula route here.

@drbenvincent
Copy link
Collaborator Author

I'm thinking that we can end up with both API's. Though I propose moving forward with the list of treated and control unit API's just to keep the momentum going, and slightly because the implementation will be easier. Then we can follow up in the future with the list of formulas API. What do you think @NathanielF?

@NathanielF
Copy link
Contributor

That makes sense @drbenvincent i definitely didn't want suggest this was blocking. More just food for thought. I agree with the current proposal to move forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants