Skip to content

Change Log, Checkpoints and Run-Locks #166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 66 commits into
base: main
Choose a base branch
from
Draft

Conversation

meksor
Copy link
Contributor

@meksor meksor commented Mar 4, 2025

This PR is meant to contain the three major parts of what we internally often call the "versioning" feature.

Change Log

The change log is an in-database representation of all transactions run on a given ixmp4 platform.
This requires the generation of "version tables" from database models (by adding columns and removing constraints) and a central "transaction" table tracking all transactions committed, referenced by each version table.
This part of the feature is logically/structurally built around the concepts in a database, so it might actually cover more functionality than needed (f.e. there is no way to update a region, but the system tracks update statements on the region table).
Currently this is handled by sqlalchemy-history as it has support for alembic migrations and can be replaced with sqlalchemy-continuum which would enable database-native versioning for postgres via postgres triggers.
Enabling and disabling these features on a per-platform basis (to avoid performance problems with sqlite) is still a little unclear to me and will need a little more time investment.

The tricky parts:

  1. This change entails an inevitable performance regression. While it will not change scaling characteristics (think big O), it will worsen all linear operations siginificantly. What was once a single query statement might become three or more (sequential ones!) with this change:
INSERT (a, b, c) INTO xy;

# becomes:

INSERT (a, b, c) INTO xy;
INSERT (datetime.now()) INTO transactions;
INSERT (a, b, c, transaction__id, operation_type) INTO xy_versions;

Most of the overhead can be alleviated by using database-native table triggers, but these are hard to write and maintain so I would suggest relying on a library for that as well

  1. What to version & handling relationships

Versioning of the data points table puts me in front of a choice:
They can either:
a) Prevent deletes entirely for all regions, variables and units - hiding the unused ones when SELECTing
or
b) Version every single table in the database that is subject to a relation and query the version variant of the related table when rolling back

I went with option b).

Run Checkpoints

To be able to distinguish different workflow steps in a sea of "transaction" records (one ixmp4 api call might create multiple records) we will need to add another table to ixmp4 which references run__id and transaction__id and stores a user-supplied message.

We now have /four/ relevant tables.

The original table xy.

a b c
A B C

The version table xy_version.

a b c transaction_id end_transaction_id operation_type
A B C 1 NULL 0

The transaction table transactions.

id issued_at
1 13:37 27.08.1999

The new checkpoints table.
May also include a timestamp and user information.

message transaction_id run_id
"Hello World" 1 1

The last table (checkpoints) can be used to query all workflow steps for a specific run and to revert to said steps by looking at which transaction record they point to and replaying or rolling back the change history.
This would probably look kind of like this:

run = ...
with run.acquire_lock():
    run.checkpoints.create("Create run")
    run.iamc.add(my_data)
    run.checkpoints.create("Add data")

run.checkpoints.tabulate()

# msg | transaction_id
# "Create run" | 3
# "Add data" | 5

run.checkpoints.list()[0].revert()

run.iamc.tabulate()

# empty ... 

Run Locks

To make sure that a user does not include unintended changes by a different user, a locking mechanism is needed for each "run" object/row. A user shall be able to acquire a lock, preventing all other users from writing to objects related to the run.
Still unclear to me:
If a user tries to acquire the lock to a locked run, should the operation:
a) block until the other user releases the lock
b) fail immediately

Also, I remember the uploads crashing and not releasing the lock being a problem in the past. How do we want to handle this in the future?

Copy link

codecov bot commented Mar 20, 2025

Codecov Report

Attention: Patch coverage is 92.75591% with 46 lines in your changes missing coverage. Please review.

Project coverage is 87.4%. Comparing base (c7ea1f3) to head (f6f112e).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ixmp4/data/db/base.py 92.5% 10 Missing ⚠️
ixmp4/data/abstract/checkpoint.py 78.5% 6 Missing ⚠️
ixmp4/data/api/checkpoint.py 80.6% 6 Missing ⚠️
ixmp4/data/db/checkpoint.py 87.5% 6 Missing ⚠️
ixmp4/data/abstract/run.py 80.0% 3 Missing ⚠️
ixmp4/data/db/iamc/datapoint/repository.py 85.7% 3 Missing ⚠️
ixmp4/data/db/run/repository.py 96.0% 3 Missing ⚠️
ixmp4/data/db/meta/repository.py 87.5% 2 Missing ⚠️
ixmp4/core/run.py 97.6% 1 Missing ⚠️
ixmp4/data/api/run.py 94.7% 1 Missing ⚠️
... and 5 more
Additional details and impacted files
@@           Coverage Diff           @@
##            main    #166     +/-   ##
=======================================
+ Coverage   87.0%   87.4%   +0.4%     
=======================================
  Files        230     235      +5     
  Lines       8170    8709    +539     
=======================================
+ Hits        7112    7617    +505     
- Misses      1058    1092     +34     
Files with missing lines Coverage Δ
ixmp4/core/checkpoints.py 100.0% <100.0%> (ø)
ixmp4/core/exceptions.py 95.6% <100.0%> (+0.2%) ⬆️
ixmp4/core/iamc/data.py 91.6% <100.0%> (+0.1%) ⬆️
ixmp4/data/abstract/__init__.py 100.0% <100.0%> (ø)
ixmp4/data/abstract/annotations.py 100.0% <100.0%> (ø)
ixmp4/data/abstract/base.py 100.0% <100.0%> (ø)
ixmp4/data/abstract/iamc/datapoint.py 92.5% <ø> (ø)
ixmp4/data/abstract/iamc/measurand.py 96.0% <ø> (ø)
ixmp4/data/abstract/iamc/timeseries.py 71.0% <ø> (ø)
ixmp4/data/abstract/iamc/variable.py 83.3% <ø> (ø)
... and 64 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@meksor
Copy link
Contributor Author

meksor commented Mar 20, 2025

The PR is adding tables with a column 'transaction_id' and also the name (in the example code snippet) 'checkpoint', which I sense are the same thing.

No! Checkpoints connect a particular run (-id) to a transaction, they serve as points to roll back to, but i think this might be irrelevant for the message integration anyway.

I don't know if that makes it possible to have a different implementation, perhaps one with less performance impact (time, or storage, e.g. by not storing more than ~2 transaction_id associated with a given run_id) or maintenance overhead, but providing that info in case.

The revert logic is always the same: select the relevant objects from the version table at the particular transaction, then delete all current data and insert the old data. For the checkpoint we get run__id and transaction__id from the checkpoint itself, for the run.transact() feature we just have to remember the transaction at which the run was locked (currently done in a new column in the run table).

Similar but distinct, we did have frequent issues where users would call ixmp.TimeSeries.check_out() (usually as message_ix.Scenario.check_out(), which is identical). Then their code would error or crash. Because they never explicitly called TimeSeries.commit() or TimeSeries.discard(), the database would record that the object (~ixmp4 Run) was "checked out to" them, and not allow any user (them, or another) to change it.

In other words, if only "atomic transactions" via with run.acquire_lock() are provided, that would be livable. We can add noisy warnings to ixmp.TimeSeries.check_out() during the shim/migration period that say check_out() will not ever be supported with IXMP4Backend and encourage users to switch to using acquire_lock/transact. The 'check_out'/…/'discard' workflow was I guess designed with an eye to interactive use, e.g. in IPython/Jupyter, but with .transact() we can now teach users "best practices" without the code being more verbose.

Ok good, i think this approach is preferable for me as well.

Example cases: 1 and 2 seems doable with the same logic, but that logic is out of scope for this PR.
I think, theoretically, the chance of the race condition you describe is there, but very small.

3.: Okay, the timeout needs to be implemented, will need to look into that.

It would be good to consider whether words like 'transact(ion)', 'commit', 'checkpoint', etc. have existing usage and meanings (a) in ixmp, message_ix etc. or (b) in likely users' mind ('commit' e.g. because many people may also be Git users; fewer are RDBMS users); and consider whether the meaning in ixmp4/message_ix is different. This can help us avoid inviting potential confusion, as that creates work to steer users clear or help correct their misunderstanding.

I agree, unfortunately all the names i can think of are taken. PSQL even has "checkpoints" as well, so in fact all these terms ('transact(ion)', 'commit', 'checkpoint') have (sometimes multiple) other meanings.

@meksor
Copy link
Contributor Author

meksor commented Mar 24, 2025

Hi, the first version of this feature is available on this branch.
There are still a few open issues which I might try to solve on auxiliary branches and merge into this one.

  • exchange sqlachemy-history for sqlachemy-continuum in order to use its sql trigger generation functionality. This will (hopefully, presumably) get rid of some of the manual version object generation code (f.e. in bulk_insert_versions) and speed up this PR so the regression is acceptable.
  • Disable versioning for specific platforms as the sql trigger generation will only work for postgres. This is tricky since I assume I need to find a way to run the sqlalchemy mapper in a "reentrant" way. It also means throwing and testing a new exception.
  • Implement versioning and revert logic for optimization data, which is best done by fridolin or at least with fridolins help.

Further I need clarification on some issues by @danielhuppmann and @khaeru. Details follow.

Regions and Units

As of now regions and units are tracked as versioned objects, but not rolled back with a run. This means, the following code will throw a RegionNotFound error.

region1 = platgorm.regions.create("Region 1")
platgorm.regions.create("Region 2")

run = platform.runs.create(model, scenario)
data_with_region1 = pd.DataFrame(...) # Has IAMC Data that references "Region 1"
data_with_region2 = pd.DataFrame(...) # Has IAMC Data that references "Region 2"

with run.transact():
    run.iamc.add(data_with_region1)
    run.checkpoints.create("Add data with region 1")

    run.iamc.add(data_with_region2)
    run.checkpoints.create("Add data with region 2")

    run.iamc.remove(data_with_region1)
    run.checkpoints.create("Remove data with region 1")

platform.regions.delete(region1)

with run.transact():
    # Revert to "Add data with region 1"
    run.checkpoints[1].revert() # Will throw an error!!

This seems good to me, because if incorrect regions were deleted, we do not want to implicitly create them again when rolling back a run.
What do you think?

Automatic Rollback

Right now, every time a run exits the transact() context manager, it will roll back to the last checkpoint created. This means, if the user forgets to create a checkpoint after adding data, the run will be empty.
This adds quite a bit of overhead, especially via the web API.

with run.transact():
    run.iamc.add(test_data_1)
    run.checkpoints.create("Add test data 1")
    run.iamc.add(test_data_2)

# the run will now contain only test_data_1
assert run.iamc.tabulate() == test_data_1

Is that right or should the data stay?

Client-Side Locking Mechanism

The run lock is now implemented in the facade layer using the two backend methods backend.run.lock and backend.run.unlock. The client has to keep track of which locks it owns as the methods behave as follows:

backend.run.lock(run_id) will check the specified run's lock_transaction column. If the column contains NULL, it will be updated to contain the id of the last transaction in the transactions table.
If the column already contains a transaction.id, the Run.IsLocked exception is thrown.

backend.run.unlock(run_id) will always set the lock_transaction column to NULL, regardless of the current value.

Hence, the transact contextmanager logic tries to aquire the lock via backend.run.lock(run_id) and tracks that the operation succeeded in the owns_lock property of the corresponding run facade model.

    @contextmanager
    def transact(self) -> Generator[None, None, None]:
        self._model = self.backend.runs.lock(self._model.id)
        self.owns_lock = True

        yield

        checkpoint_df = self.checkpoints.tabulate()
        checkpoint_transaction = int(checkpoint_df["transaction__id"].max())

        assert self._model.lock_transaction is not None

        if checkpoint_transaction > self._model.lock_transaction:
            self.backend.runs.revert(self._model.id, checkpoint_transaction)
        else:
            self.backend.runs.revert(self._model.id, self._model.lock_transaction)

        self._model = self.backend.runs.unlock(self._model.id)
        self.owns_lock = False

A more sophisticated and secure locking mechanism is a lot more work, as it requires me to keep track of the users that have locked runs. Posponed for an indefinite time, alright?

LockRequired Exception

Since the client now keeps track of which locks it owns, we can throw exceptions in the facade layer when an operation is attempted that requires a lock.
Currently this applies to:

  • run.iamc.add
  • run.iamc.remove
  • run.checkpoints.create()

A list of operations in the optimization part will have to be worked out as well, can you think of anything else? Did I understand the assignment correctly?

Run Meta Indicators/Data

Are not covered by the rollback logic or the LockRequired exception, should this be the case?

I'm sure as soon as I post this comment I will think of something else, but for now: please advise!

@khaeru
Copy link
Member

khaeru commented Mar 24, 2025

Regions and Units

As of now regions and units are tracked as versioned objects, but not rolled back with a run.
[…]
What do you think?

As mentioned, message_ix/current ixmp test suites and usage don't need this, so no opinion.

Automatic Rollback
[…]
Is that right or should the data stay?

Stay? This is what I meant above about "following semantics of the standard library". See How to use the connection context manager in the docs for the sqlite3 module. This is the behaviour that the current ixmp.TimeSeries.transact() is trying to emulate—see the example there in the docs.

Client-Side Locking Mechanism
[…]
A more sophisticated and secure locking mechanism is a lot more work, as it requires me to keep track of the users that have locked runs. Postponed for an indefinite time, alright?

Yes, that's fine. If something gets inadvertently locked by another user (and it shouldn't if we're encouraging users to use the context manager), it's already a great improvement that backend.run.unlock() will be accessible, instead of having to do something like this.

LockRequired Exception
[…]
A list of operations in the optimization part will have to be worked out as well, can you think of anything else? Did I understand the assignment correctly?

This is more for @glatterf42 I think. Just to clarify, one point above was that creating a Run instance at all should be safe against collisions. That is to say, rather than locking a existing Run while certain things are done to it, two simultaneous attempts to create new (not-yet-existing) Runs that want the same or similar identifiers should be safe.

If that's already achieved some other way than the locking mechanism you've written, e.g. by lower-level transaction control in the DB layer, then we're fine.

Run Meta Indicators/Data
[…]
Are not covered by the rollback logic or the LockRequired exception, should this be the case?

I'm not aware of people using these through ixmp/message_ix, so no opinion.

@danielhuppmann
Copy link
Member

Thanks @meksor, this is very nice!

Regions and Units

As of now regions and units are tracked as versioned objects, but not rolled back with a run.
[…]
What do you think?

This makes sense.

Automatic Rollback
[…]
Is that right or should the data stay?

Would it make sense that the transact()-context ends with setting a checkpoint, and the checkpoint-description is part of the transact-arguments?

Client-Side Locking Mechanism
[…]
A more sophisticated and secure locking mechanism is a lot more work, as it requires me to keep track of the users that have locked runs. Postponed for an indefinite time, alright?

Agree with @khaeru, that's ok for our use case.

LockRequired Exception and Run Meta Indicators/Data
[…]
Are not covered by the rollback logic or the LockRequired exception, should this be the case?

Please add creation and deletion of meta indicators to the operations that are change-logged and require a lock.

Adding the optimization-items to the change-log-and-locks-logic can be left for a follow-up PR (led by @glatterf42), in my opinion.

@meksor
Copy link
Contributor Author

meksor commented Mar 31, 2025

Automatic Rollback
[…]
Is that right or should the data stay?

Would it make sense that the transact()-context ends with setting a checkpoint, and the checkpoint-description is part of the transact-arguments?

Yes, sounds good. The rollback now only happens upon an exception.

LockRequired Exception and Run Meta Indicators/Data
[…]
Are not covered by the rollback logic or the LockRequired exception, should this be the case?

Please add creation and deletion of meta indicators to the operations that are change-logged and require a lock.

Understood.

@meksor
Copy link
Contributor Author

meksor commented Apr 9, 2025

Ok, a few updates on the points from above:

  • exchange sqlachemy-history for sqlachemy-continuum in order to use its sql trigger generation functionality. This will (hopefully, presumably) get rid of some of the manual version object generation code (f.e. in bulk_insert_versions) and speed up this PR so the regression is acceptable.

I did this and it worked like a drop-in replacement, no changes besides adjusting the imports necessary.
However this does not really matter because I could not get the next objective to work:

  • Disable versioning for specific platforms as the sql trigger generation will only work for postgres. This is tricky since I assume I need to find a way to run the sqlalchemy mapper in a "reentrant" way. It also means throwing and testing a new exception.

Unfortunately the libraries I used here are not made for this and enabling/disabling versioning is only possible globally with our current code-base. This also means that I cannot use the "native versioning" feature from sqla_continuum because the trigger setup will fail every time on sqlite databases.

Making all of this work will take a lot more effort and might mean that I have to implement large parts of the library myself. (In which case I will try to get rid of the dependency entirely)
More specifically: I need to find a way to run the sqlalchemy mapper when a backend/platform is created as opposed to how its done currently (when the classes are imported). This means rebuilding the large parts of the Model/Repository code in the data layer. Its probably wise to consider a bigger refactor if im getting my hands dirty with something like that already.

What does this mean for this PR?
I suggest finishing up here by making sure the tests cover the desired use-cases and living with the performance regression (pretty much exactly 2x on my setup). The big thing missing here are the meta-indicators afaics.
This allows fridolin to continue with his part in the optimization modules.
Next, Ill try to focus on a few bug-fixes and the scse-toolkit integration since that has been on my back-log for ages. Ill have to make do without the test-container registry (which would be nice for this change) as ICT told us that they are "going to ignore all my tickets for 2 months". Hopefully that also fixes any issues that @danielhuppmann has encountered in the meantime.

Then, I will try to reduce the amount of code in this project while keeping the versioning performance improvement in mind. Right now we have some complicated inheritance hierarchies, a lot of boilerplate code and not a lot of internal documentation which can all be improved. A big improvement here would be to combine "abstract" and "db" layers (since the interface is pretty much completely defined now) and to generate the API server and client code with less copy-pasta and boilerplate-o-rama. Getting rid of the filter classes so we can use the TypedDicts introduced in the mypy PR as the sole source-of-truth for the possible arguments is also on my list.

After that Ill look at reworking this feature, at that point it should be a lot more straight forward (i might eat my words later, but ill stay optimistic for now).
This is a few months in the future. OK @danielhuppmann ?

Side note: For some reason one single CI test across all the jobs is failing:
py3.10 | backend=sqlite,rest-sqlite | with-pyarrow=true | pgsql=16 | pandas=2.1.0
the test in question is "test_mp_tabulate_big[rest-sqlite]". Ive been trying to reproduce this locally for a few days now with no luck (I installed the relevant deps and python versions, tried gh act pull_request, the test locally always pass). Has anyone ever encountered something like this before? @glatterf42 @pmussak @phackstock

@meksor
Copy link
Contributor Author

meksor commented Apr 10, 2025

Revert logic and LockRequired exceptions are implemented for the meta indicators now, FYI.

@pmussak
Copy link
Contributor

pmussak commented Apr 10, 2025

Side note: For some reason one single CI test across all the jobs is failing: py3.10 | backend=sqlite,rest-sqlite | with-pyarrow=true | pgsql=16 | pandas=2.1.0 the test in question is "test_mp_tabulate_big[rest-sqlite]". Ive been trying to reproduce this locally for a few days now with no luck (I installed the relevant deps and python versions, tried gh act pull_request, the test locally always pass). Has anyone ever encountered something like this before? @glatterf42 @pmussak @phackstock

I did not encounter this yet, but I looked into it

@pmussak pmussak closed this Apr 10, 2025
@meksor meksor reopened this Apr 10, 2025
@pmussak
Copy link
Contributor

pmussak commented Apr 10, 2025

sorry I accidentally closed this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants