Change Log, Checkpoints and Run-Locks #166

meksor · 2025-03-04T10:31:11Z

This PR is meant to contain the three major parts of what we internally often call the "versioning" feature.

Change Log

The change log is an in-database representation of all transactions run on a given ixmp4 platform.
This requires the generation of "version tables" from database models (by adding columns and removing constraints) and a central "transaction" table tracking all transactions committed, referenced by each version table.
This part of the feature is logically/structurally built around the concepts in a database, so it might actually cover more functionality than needed (f.e. there is no way to update a region, but the system tracks update statements on the region table).
Currently this is handled by sqlalchemy-history as it has support for alembic migrations and can be replaced with sqlalchemy-continuum which would enable database-native versioning for postgres via postgres triggers.
Enabling and disabling these features on a per-platform basis (to avoid performance problems with sqlite) is still a little unclear to me and will need a little more time investment.

The tricky parts:

This change entails an inevitable performance regression. While it will not change scaling characteristics (think big O), it will worsen all linear operations siginificantly. What was once a single query statement might become three or more (sequential ones!) with this change:

INSERT (a, b, c) INTO xy;

# becomes:

INSERT (a, b, c) INTO xy;
INSERT (datetime.now()) INTO transactions;
INSERT (a, b, c, transaction__id, operation_type) INTO xy_versions;

Most of the overhead can be alleviated by using database-native table triggers, but these are hard to write and maintain so I would suggest relying on a library for that as well

What to version & handling relationships

Versioning of the data points table puts me in front of a choice:
They can either:
a) Prevent deletes entirely for all regions, variables and units - hiding the unused ones when SELECTing
or
b) Version every single table in the database that is subject to a relation and query the version variant of the related table when rolling back

I went with option b).

Run Checkpoints

To be able to distinguish different workflow steps in a sea of "transaction" records (one ixmp4 api call might create multiple records) we will need to add another table to ixmp4 which references run__id and transaction__id and stores a user-supplied message.

We now have /four/ relevant tables.

The original table xy.

a	b	c
A	B	C

The version table xy_version.

a	b	c	transaction_id	end_transaction_id	operation_type
A	B	C	1	NULL	0

The transaction table transactions.

id	issued_at
1	13:37 27.08.1999

The new checkpoints table.
May also include a timestamp and user information.

message	transaction_id	run_id
"Hello World"	1	1

The last table (checkpoints) can be used to query all workflow steps for a specific run and to revert to said steps by looking at which transaction record they point to and replaying or rolling back the change history.
This would probably look kind of like this:

run = ...
with run.acquire_lock():
    run.checkpoints.create("Create run")
    run.iamc.add(my_data)
    run.checkpoints.create("Add data")

run.checkpoints.tabulate()

# msg | transaction_id
# "Create run" | 3
# "Add data" | 5

run.checkpoints.list()[0].revert()

run.iamc.tabulate()

# empty ...

Run Locks

To make sure that a user does not include unintended changes by a different user, a locking mechanism is needed for each "run" object/row. A user shall be able to acquire a lock, preventing all other users from writing to objects related to the run.
Still unclear to me:
If a user tries to acquire the lock to a locked run, should the operation:
a) block until the other user releases the lock
b) fail immediately

Also, I remember the uploads crashing and not releasing the lock being a problem in the past. How do we want to handle this in the future?

codecov · 2025-03-20T13:46:24Z

Codecov Report

Attention: Patch coverage is 92.75591% with 46 lines in your changes missing coverage. Please review.

Project coverage is 87.4%. Comparing base (c7ea1f3) to head (f6f112e).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
ixmp4/data/db/base.py	92.5%	10 Missing ⚠️
ixmp4/data/abstract/checkpoint.py	78.5%	6 Missing ⚠️
ixmp4/data/api/checkpoint.py	80.6%	6 Missing ⚠️
ixmp4/data/db/checkpoint.py	87.5%	6 Missing ⚠️
ixmp4/data/abstract/run.py	80.0%	3 Missing ⚠️
ixmp4/data/db/iamc/datapoint/repository.py	85.7%	3 Missing ⚠️
ixmp4/data/db/run/repository.py	96.0%	3 Missing ⚠️
ixmp4/data/db/meta/repository.py	87.5%	2 Missing ⚠️
ixmp4/core/run.py	97.6%	1 Missing ⚠️
ixmp4/data/api/run.py	94.7%	1 Missing ⚠️
... and 5 more

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #166     +/-   ##
=======================================
+ Coverage   87.0%   87.4%   +0.4%     
=======================================
  Files        230     235      +5     
  Lines       8170    8709    +539     
=======================================
+ Hits        7112    7617    +505     
- Misses      1058    1092     +34

Files with missing lines	Coverage Δ
ixmp4/core/checkpoints.py	`100.0% <100.0%> (ø)`
ixmp4/core/exceptions.py	`95.6% <100.0%> (+0.2%)`	⬆️
ixmp4/core/iamc/data.py	`91.6% <100.0%> (+0.1%)`	⬆️
ixmp4/data/abstract/__init__.py	`100.0% <100.0%> (ø)`
ixmp4/data/abstract/annotations.py	`100.0% <100.0%> (ø)`
ixmp4/data/abstract/base.py	`100.0% <100.0%> (ø)`
ixmp4/data/abstract/iamc/datapoint.py	`92.5% <ø> (ø)`
ixmp4/data/abstract/iamc/measurand.py	`96.0% <ø> (ø)`
ixmp4/data/abstract/iamc/timeseries.py	`71.0% <ø> (ø)`
ixmp4/data/abstract/iamc/variable.py	`83.3% <ø> (ø)`
... and 64 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meksor · 2025-03-20T14:23:18Z

The PR is adding tables with a column 'transaction_id' and also the name (in the example code snippet) 'checkpoint', which I sense are the same thing.

No! Checkpoints connect a particular run (-id) to a transaction, they serve as points to roll back to, but i think this might be irrelevant for the message integration anyway.

I don't know if that makes it possible to have a different implementation, perhaps one with less performance impact (time, or storage, e.g. by not storing more than ~2 transaction_id associated with a given run_id) or maintenance overhead, but providing that info in case.

The revert logic is always the same: select the relevant objects from the version table at the particular transaction, then delete all current data and insert the old data. For the checkpoint we get run__id and transaction__id from the checkpoint itself, for the run.transact() feature we just have to remember the transaction at which the run was locked (currently done in a new column in the run table).

Similar but distinct, we did have frequent issues where users would call ixmp.TimeSeries.check_out() (usually as message_ix.Scenario.check_out(), which is identical). Then their code would error or crash. Because they never explicitly called TimeSeries.commit() or TimeSeries.discard(), the database would record that the object (~ixmp4 Run) was "checked out to" them, and not allow any user (them, or another) to change it.

In other words, if only "atomic transactions" via with run.acquire_lock() are provided, that would be livable. We can add noisy warnings to ixmp.TimeSeries.check_out() during the shim/migration period that say check_out() will not ever be supported with IXMP4Backend and encourage users to switch to using acquire_lock/transact. The 'check_out'/…/'discard' workflow was I guess designed with an eye to interactive use, e.g. in IPython/Jupyter, but with .transact() we can now teach users "best practices" without the code being more verbose.

Ok good, i think this approach is preferable for me as well.

Example cases: 1 and 2 seems doable with the same logic, but that logic is out of scope for this PR.
I think, theoretically, the chance of the race condition you describe is there, but very small.

3.: Okay, the timeout needs to be implemented, will need to look into that.

It would be good to consider whether words like 'transact(ion)', 'commit', 'checkpoint', etc. have existing usage and meanings (a) in ixmp, message_ix etc. or (b) in likely users' mind ('commit' e.g. because many people may also be Git users; fewer are RDBMS users); and consider whether the meaning in ixmp4/message_ix is different. This can help us avoid inviting potential confusion, as that creates work to steer users clear or help correct their misunderstanding.

I agree, unfortunately all the names i can think of are taken. PSQL even has "checkpoints" as well, so in fact all these terms ('transact(ion)', 'commit', 'checkpoint') have (sometimes multiple) other meanings.

meksor · 2025-03-24T14:39:36Z

Hi, the first version of this feature is available on this branch.
There are still a few open issues which I might try to solve on auxiliary branches and merge into this one.

exchange sqlachemy-history for sqlachemy-continuum in order to use its sql trigger generation functionality. This will (hopefully, presumably) get rid of some of the manual version object generation code (f.e. in bulk_insert_versions) and speed up this PR so the regression is acceptable.
Disable versioning for specific platforms as the sql trigger generation will only work for postgres. This is tricky since I assume I need to find a way to run the sqlalchemy mapper in a "reentrant" way. It also means throwing and testing a new exception.
Implement versioning and revert logic for optimization data, which is best done by fridolin or at least with fridolins help.

Further I need clarification on some issues by @danielhuppmann and @khaeru. Details follow.

Regions and Units

As of now regions and units are tracked as versioned objects, but not rolled back with a run. This means, the following code will throw a RegionNotFound error.

region1 = platgorm.regions.create("Region 1")
platgorm.regions.create("Region 2")

run = platform.runs.create(model, scenario)
data_with_region1 = pd.DataFrame(...) # Has IAMC Data that references "Region 1"
data_with_region2 = pd.DataFrame(...) # Has IAMC Data that references "Region 2"

with run.transact():
    run.iamc.add(data_with_region1)
    run.checkpoints.create("Add data with region 1")

    run.iamc.add(data_with_region2)
    run.checkpoints.create("Add data with region 2")

    run.iamc.remove(data_with_region1)
    run.checkpoints.create("Remove data with region 1")

platform.regions.delete(region1)

with run.transact():
    # Revert to "Add data with region 1"
    run.checkpoints[1].revert() # Will throw an error!!

This seems good to me, because if incorrect regions were deleted, we do not want to implicitly create them again when rolling back a run.
What do you think?

Automatic Rollback

Right now, every time a run exits the transact() context manager, it will roll back to the last checkpoint created. This means, if the user forgets to create a checkpoint after adding data, the run will be empty.
This adds quite a bit of overhead, especially via the web API.

with run.transact():
    run.iamc.add(test_data_1)
    run.checkpoints.create("Add test data 1")
    run.iamc.add(test_data_2)

# the run will now contain only test_data_1
assert run.iamc.tabulate() == test_data_1

Is that right or should the data stay?

Client-Side Locking Mechanism

The run lock is now implemented in the facade layer using the two backend methods backend.run.lock and backend.run.unlock. The client has to keep track of which locks it owns as the methods behave as follows:

backend.run.lock(run_id) will check the specified run's lock_transaction column. If the column contains NULL, it will be updated to contain the id of the last transaction in the transactions table.
If the column already contains a transaction.id, the Run.IsLocked exception is thrown.

backend.run.unlock(run_id) will always set the lock_transaction column to NULL, regardless of the current value.

Hence, the transact contextmanager logic tries to aquire the lock via backend.run.lock(run_id) and tracks that the operation succeeded in the owns_lock property of the corresponding run facade model.

    @contextmanager
    def transact(self) -> Generator[None, None, None]:
        self._model = self.backend.runs.lock(self._model.id)
        self.owns_lock = True

        yield

        checkpoint_df = self.checkpoints.tabulate()
        checkpoint_transaction = int(checkpoint_df["transaction__id"].max())

        assert self._model.lock_transaction is not None

        if checkpoint_transaction > self._model.lock_transaction:
            self.backend.runs.revert(self._model.id, checkpoint_transaction)
        else:
            self.backend.runs.revert(self._model.id, self._model.lock_transaction)

        self._model = self.backend.runs.unlock(self._model.id)
        self.owns_lock = False

A more sophisticated and secure locking mechanism is a lot more work, as it requires me to keep track of the users that have locked runs. Posponed for an indefinite time, alright?

LockRequired Exception

Since the client now keeps track of which locks it owns, we can throw exceptions in the facade layer when an operation is attempted that requires a lock.
Currently this applies to:

run.iamc.add
run.iamc.remove
run.checkpoints.create()

A list of operations in the optimization part will have to be worked out as well, can you think of anything else? Did I understand the assignment correctly?

Run Meta Indicators/Data

Are not covered by the rollback logic or the LockRequired exception, should this be the case?

I'm sure as soon as I post this comment I will think of something else, but for now: please advise!

khaeru · 2025-03-24T15:54:55Z

Regions and Units

As of now regions and units are tracked as versioned objects, but not rolled back with a run.
[…]
What do you think?

As mentioned, message_ix/current ixmp test suites and usage don't need this, so no opinion.

Automatic Rollback
[…]
Is that right or should the data stay?

Stay? This is what I meant above about "following semantics of the standard library". See How to use the connection context manager in the docs for the sqlite3 module. This is the behaviour that the current ixmp.TimeSeries.transact() is trying to emulate—see the example there in the docs.

Client-Side Locking Mechanism
[…]
A more sophisticated and secure locking mechanism is a lot more work, as it requires me to keep track of the users that have locked runs. Postponed for an indefinite time, alright?

Yes, that's fine. If something gets inadvertently locked by another user (and it shouldn't if we're encouraging users to use the context manager), it's already a great improvement that backend.run.unlock() will be accessible, instead of having to do something like this.

LockRequired Exception
[…]
A list of operations in the optimization part will have to be worked out as well, can you think of anything else? Did I understand the assignment correctly?

This is more for @glatterf42 I think. Just to clarify, one point above was that creating a Run instance at all should be safe against collisions. That is to say, rather than locking a existing Run while certain things are done to it, two simultaneous attempts to create new (not-yet-existing) Runs that want the same or similar identifiers should be safe.

If that's already achieved some other way than the locking mechanism you've written, e.g. by lower-level transaction control in the DB layer, then we're fine.

Run Meta Indicators/Data
[…]
Are not covered by the rollback logic or the LockRequired exception, should this be the case?

I'm not aware of people using these through ixmp/message_ix, so no opinion.

danielhuppmann · 2025-03-25T05:18:56Z

Thanks @meksor, this is very nice!

Regions and Units

As of now regions and units are tracked as versioned objects, but not rolled back with a run.
[…]
What do you think?

This makes sense.

Automatic Rollback
[…]
Is that right or should the data stay?

Would it make sense that the transact()-context ends with setting a checkpoint, and the checkpoint-description is part of the transact-arguments?

Client-Side Locking Mechanism
[…]
A more sophisticated and secure locking mechanism is a lot more work, as it requires me to keep track of the users that have locked runs. Postponed for an indefinite time, alright?

Agree with @khaeru, that's ok for our use case.

LockRequired Exception and Run Meta Indicators/Data
[…]
Are not covered by the rollback logic or the LockRequired exception, should this be the case?

Please add creation and deletion of meta indicators to the operations that are change-logged and require a lock.

Adding the optimization-items to the change-log-and-locks-logic can be left for a follow-up PR (led by @glatterf42), in my opinion.

…ly on exceptions

meksor · 2025-03-31T13:48:34Z

Automatic Rollback
[…]
Is that right or should the data stay?

Would it make sense that the transact()-context ends with setting a checkpoint, and the checkpoint-description is part of the transact-arguments?

Yes, sounds good. The rollback now only happens upon an exception.

LockRequired Exception and Run Meta Indicators/Data
[…]
Are not covered by the rollback logic or the LockRequired exception, should this be the case?

Please add creation and deletion of meta indicators to the operations that are change-logged and require a lock.

Understood.

meksor · 2025-04-09T15:26:40Z

Ok, a few updates on the points from above:

exchange sqlachemy-history for sqlachemy-continuum in order to use its sql trigger generation functionality. This will (hopefully, presumably) get rid of some of the manual version object generation code (f.e. in bulk_insert_versions) and speed up this PR so the regression is acceptable.

I did this and it worked like a drop-in replacement, no changes besides adjusting the imports necessary.
However this does not really matter because I could not get the next objective to work:

Disable versioning for specific platforms as the sql trigger generation will only work for postgres. This is tricky since I assume I need to find a way to run the sqlalchemy mapper in a "reentrant" way. It also means throwing and testing a new exception.

Unfortunately the libraries I used here are not made for this and enabling/disabling versioning is only possible globally with our current code-base. This also means that I cannot use the "native versioning" feature from sqla_continuum because the trigger setup will fail every time on sqlite databases.

Making all of this work will take a lot more effort and might mean that I have to implement large parts of the library myself. (In which case I will try to get rid of the dependency entirely)
More specifically: I need to find a way to run the sqlalchemy mapper when a backend/platform is created as opposed to how its done currently (when the classes are imported). This means rebuilding the large parts of the Model/Repository code in the data layer. Its probably wise to consider a bigger refactor if im getting my hands dirty with something like that already.

What does this mean for this PR?
I suggest finishing up here by making sure the tests cover the desired use-cases and living with the performance regression (pretty much exactly 2x on my setup). The big thing missing here are the meta-indicators afaics.
This allows fridolin to continue with his part in the optimization modules.
Next, Ill try to focus on a few bug-fixes and the scse-toolkit integration since that has been on my back-log for ages. Ill have to make do without the test-container registry (which would be nice for this change) as ICT told us that they are "going to ignore all my tickets for 2 months". Hopefully that also fixes any issues that @danielhuppmann has encountered in the meantime.

Then, I will try to reduce the amount of code in this project while keeping the versioning performance improvement in mind. Right now we have some complicated inheritance hierarchies, a lot of boilerplate code and not a lot of internal documentation which can all be improved. A big improvement here would be to combine "abstract" and "db" layers (since the interface is pretty much completely defined now) and to generate the API server and client code with less copy-pasta and boilerplate-o-rama. Getting rid of the filter classes so we can use the TypedDicts introduced in the mypy PR as the sole source-of-truth for the possible arguments is also on my list.

After that Ill look at reworking this feature, at that point it should be a lot more straight forward (i might eat my words later, but ill stay optimistic for now).
This is a few months in the future. OK @danielhuppmann ?

Side note: For some reason one single CI test across all the jobs is failing:
py3.10 | backend=sqlite,rest-sqlite | with-pyarrow=true | pgsql=16 | pandas=2.1.0
the test in question is "test_mp_tabulate_big[rest-sqlite]". Ive been trying to reproduce this locally for a few days now with no luck (I installed the relevant deps and python versions, tried gh act pull_request, the test locally always pass). Has anyone ever encountered something like this before? @glatterf42 @pmussak @phackstock

meksor · 2025-04-10T09:32:18Z

Revert logic and LockRequired exceptions are implemented for the meta indicators now, FYI.

pmussak · 2025-04-10T09:52:45Z

Side note: For some reason one single CI test across all the jobs is failing: py3.10 | backend=sqlite,rest-sqlite | with-pyarrow=true | pgsql=16 | pandas=2.1.0 the test in question is "test_mp_tabulate_big[rest-sqlite]". Ive been trying to reproduce this locally for a few days now with no luck (I installed the relevant deps and python versions, tried gh act pull_request, the test locally always pass). Has anyone ever encountered something like this before? @glatterf42 @pmussak @phackstock

I did not encounter this yet, but I looked into it

pmussak · 2025-04-10T10:56:58Z

sorry I accidentally closed this PR.

meksor added 30 commits November 19, 2024 15:36

add sqlalchemy_history dependency

5356616

call make_versioned before constr. BaseModel

c7fd532

call configure_mappers after model collection

d01bf4e

make all non-optimization classes __versioned__

0568353

Merge branch 'main' into feature/versioning

24b941b

add null to proxy mod.

245e8bd

ignore sqla_hist submodules in mypy

7e81ba5

add version updates for bulk operations

ca304ed

add version migrations

e311006

add type hints to alembics env.py

d59ce1a

refactor query handling and add VersionManager mixin

0de8346

add VersionManager interface

a1a1091

move __versioned__ = {} to correct class

b1a3e6b

add Versioning to Iamc Variables

1ffec43

adjust tests

c303416

add VersionManager to abstract repositories

dfc6cab

send all values via rest api distinctly

97523bf

fix version generation in bulk operators

b05c5c6

add and_() combinator to db module

28103c2

add VersionManager to db repositories

7b9dd31

add version log to meta entry tests

842ae72

Merge branch 'main' into feature/versioning

396c9a3

remove sloppily placed leftovers

64a3993

adjust version test data

5141be8

add version transaction and tabulation endpoints

e55845b

refactor some versioning features into base repository

6c9257c

add version manager mixin to api client repositories

b56c2ad

add PaginationArgs TypedDict for Arg hints

abed26c

remove TODOs for audit info

15def34

remove another two TODOs

b019f2c

coerce more datetime cols in merge function

7c51f65

meksor added 2 commits March 31, 2025 14:00

automatically create checkpoint after exiting transact(), rollback on…

d5d6723

…ly on exceptions

shorten lines

aeab0a2

meksor added 9 commits April 2, 2025 17:18

replace sqlalchemy_history with sqlalchemy_continuum

6da4167

remove unnecessary transaction tabulation endpoints

4147784

add datapoint versioning test

44869f6

join parameters in DataPointRepo.select_versions

2d8db60

add rollback test

0810fd7

fix mypy errors

a971d28

fix expected data columns

a5ca4d6

make rollback test more solid

090a232

restrict gha paralellism

d9cb68d

meksor added 3 commits April 10, 2025 10:20

add test for iamc RunLockRequired exceptions

e798a61

add revert functionality for run metadata

9b7ceb6

add tests for meta rollback and lock

5d905da

pmussak closed this Apr 10, 2025

remove some of the medium fixture data

d35389d

meksor reopened this Apr 10, 2025

meksor added 2 commits April 10, 2025 15:40

add a few datapoints back to the test fixture

6341754

remove max-parallel from gha config

f6f112e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Log, Checkpoints and Run-Locks #166

Change Log, Checkpoints and Run-Locks #166

meksor commented Mar 4, 2025 •

edited

Loading

codecov bot commented Mar 20, 2025 •

edited

Loading

meksor commented Mar 20, 2025

meksor commented Mar 24, 2025

khaeru commented Mar 24, 2025

danielhuppmann commented Mar 25, 2025

meksor commented Mar 31, 2025

meksor commented Apr 9, 2025

meksor commented Apr 10, 2025

pmussak commented Apr 10, 2025

pmussak commented Apr 10, 2025

Change Log, Checkpoints and Run-Locks #166

Are you sure you want to change the base?

Change Log, Checkpoints and Run-Locks #166

Conversation

meksor commented Mar 4, 2025 • edited Loading

Change Log

Run Checkpoints

Run Locks

codecov bot commented Mar 20, 2025 • edited Loading

Codecov Report

meksor commented Mar 20, 2025

meksor commented Mar 24, 2025

Regions and Units

Automatic Rollback

Client-Side Locking Mechanism

LockRequired Exception

Run Meta Indicators/Data

khaeru commented Mar 24, 2025

danielhuppmann commented Mar 25, 2025

meksor commented Mar 31, 2025

meksor commented Apr 9, 2025

meksor commented Apr 10, 2025

pmussak commented Apr 10, 2025

pmussak commented Apr 10, 2025

meksor commented Mar 4, 2025 •

edited

Loading

codecov bot commented Mar 20, 2025 •

edited

Loading