Skip to content

Remove Column class entirely #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 6, 2025
Merged

Conversation

glatterf42
Copy link
Member

@glatterf42 glatterf42 commented Jan 31, 2025

As discovered in #151, using sqlalchemy's associationproxies is both (a bit) more performant and (considerably) less confusing than having a DB table called Column. Thus, this PR completes the removal of this construct.

TODO

On top of this, the PR includes a few minor fixes:

  • Make type hints of .data in the core layer more accurate
  • Fix some type: ignores by casting appropriately
  • Fix some docstrings wrt the errors they raise
  • Make Variable.indexsets None when appropriate
  • Fix missing .set_creation_info() calls
  • Skip data validation iif appropriate (we didn't skip checks for e.g. variable.data with only "levels" and "marginals" after Introduce run.optimization.remove_solution() #139 and never skipped for tables, even when {} was added)

@glatterf42 glatterf42 added the enhancement New feature or request label Jan 31, 2025
@glatterf42 glatterf42 requested a review from meksor January 31, 2025 14:51
@glatterf42 glatterf42 self-assigned this Jan 31, 2025
@@ -25,7 +23,7 @@ class VariableCreateInput(BaseModel):


class DataInput(BaseModel):
data: dict[str, Any]
data: dict[str, list[float] | list[int] | list[str]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to reflect the other changes, then run into the thing that the tests on the REST API platforms always converted data to dict[str, list[float]] for some reason. I don't understand why, but it doesn't seem to be this change because the same happens for equations. Still, to show that, I didn't also change the other type hints in this layer, though I suppose I still could.

@glatterf42 glatterf42 force-pushed the remove-column-class-entirely branch from aec6d3a to 9c36f7a Compare January 31, 2025 14:54
Copy link

codecov bot commented Jan 31, 2025

Codecov Report

Attention: Patch coverage is 98.87640% with 2 lines in your changes missing coverage. Please review.

Project coverage is 89.0%. Comparing base (0249e7a) to head (8335b93).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ixmp4/data/db/optimization/parameter/model.py 91.6% 1 Missing ⚠️
ixmp4/data/db/optimization/table/model.py 95.2% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main    #157     +/-   ##
=======================================
- Coverage   89.0%   89.0%   -0.1%     
=======================================
  Files        236     228      -8     
  Lines       8723    8646     -77     
=======================================
- Hits        7770    7698     -72     
+ Misses       953     948      -5     
Files with missing lines Coverage Δ
ixmp4/core/optimization/equation.py 93.3% <100.0%> (-0.3%) ⬇️
ixmp4/core/optimization/parameter.py 93.0% <100.0%> (-0.2%) ⬇️
ixmp4/core/optimization/table.py 92.4% <100.0%> (-0.2%) ⬇️
ixmp4/core/optimization/variable.py 93.4% <100.0%> (-0.3%) ⬇️
ixmp4/data/abstract/optimization/__init__.py 100.0% <ø> (ø)
ixmp4/data/abstract/optimization/equation.py 97.0% <100.0%> (+<0.1%) ⬆️
ixmp4/data/abstract/optimization/table.py 96.9% <100.0%> (+<0.1%) ⬆️
ixmp4/data/abstract/optimization/variable.py 97.0% <100.0%> (+<0.1%) ⬆️
ixmp4/data/api/optimization/equation.py 94.2% <100.0%> (ø)
ixmp4/data/api/optimization/table.py 94.0% <100.0%> (ø)
... and 18 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 63 to 68
@property
def indexsets(self) -> list[str]:
return [indexset.name for indexset in self._indexsets]

@property
def column_names(self) -> list[str] | None:
return cast(list[str], self._column_names) if any(self._column_names) else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind the one list being None if empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list(self._column_names.all_()) could work for the list conversion as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ._column_names has a .all_() attribute. There is an all_() function in sqlalchemy, but that seems to produce an SQL ALL expression, so I'm not sure how that's useful here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.sqlalchemy.org/en/20/orm/extensions/associationproxy.html#sqlalchemy.ext.associationproxy.AssociationProxyInstance.all_

It should, but looking at this again i have to retract my statement since it's kind of unnecessary to use this.
I'm just wondering why we are casting this to a list instead of converting it. I'm not sure how AssociationProxyInstance works exactly. Is this a generator that uses a cursor to yield items from the database?
In any case it doesn't have an append function, nor will it work with len() (educated guess) so essentially we are misleading anyone who tries to use these functions on this so called "list" ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I have to reiterate my concern about the possibility of this being None, it seems spurious to me that one has to constantly check if X is None so the type checker doesn't complain instead of always treating it like list...

Copy link
Contributor

@meksor meksor May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case it doesn't have an append function, nor will it work with len() (educated guess) so essentially we are misleading anyone who tries to use these functions on this so called "list" ;)

Sorry, just for clarification, this is wrong. I think I am getting mixed up between the functionalities of the django version of this type of mapping and misreading the sqla docs.

However, being confused by the new version I have to admit that my feedback was uneducated, since if I now understand correctly:
column_names is optionally provided when making a new equation/parameter/... object. If column_names is supplied it must be of the same length as constrained_to_indexsets and will act as a "key" for the columns instead of the "indexset name". From this perspective returning the value supplied to the creation function kind of makes sense, however pulling this info back out of the association table like this was confusing me and I wasnt able to make the connection.
I am wondering if it is necessary to be able to distinguish if an equation has been supplied with overridden "column_names" later. Depending on if thats the case it might be beneficial to /always/ store something in column_name on the association table. For one, this eliminates the need for two properties, and it will also address another potential problem:
It is unclear to me how the association_proxy orders list items. By default postgres will return items in a practically random ordering, which /could mean/ that when cloning via:

self.backend.optimization.equations.create(
      run_id=run.id,
      name=equation.name,
      constrained_to_indexsets=equation.indexset_names,
      column_names=equation.column_names,
  )

...the indexsets and column names are suddenly switched! I can unfortunately not find out if this is a problem in practice since there does not seem to be a test case that clones a run with an equation like that. Trying to adjust test_run_clone to cover the case reveals that adding data to an equation with column_names throws an error and in turn that there is no test covering that either:

ixmp4/data/db/optimization/equation/repository.py:148: in add_data
    data.set_index(index_list).combine_first(existing_data).reset_index()
TypeError: The parameter "keys" may be a column key, one-dimensional array, or a list containing only valid column keys and one-dimensional arrays.. Received column of type <class 'sqlalchemy.ext.associationproxy._AssociationList'>

Which brings me back to a complaint from earlier in this thread: This cast() call is not enough for any run-time type checks and pandas seems to not like that very much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the thorough feedback :)

column_names is optionally provided when making a new equation/parameter/... object. If column_names is supplied it must be of the same length as constrained_to_indexsets and will act as a "key" for the columns instead of the "indexset name".

This understanding is correct :)

I agree that the cast() call is not ideal, so I adapted that now to read

    @property
    def column_names(self) -> list[str]:
        return (
            [] if self._column_names[0] is None else list(map(str, self._column_names))
        )

instead. This always returns a true list object :)

This change also already enables a test like this:

    def test_order_column_names(self, platform: ixmp4.Platform) -> None:
        run = platform.backend.runs.create("Model", "Scenario")

        # Test normal creation
        indexsets = create_indexsets_for_run(platform=platform, run_id=run.id, amount=5)
        indexset_names = [iset.name for iset in indexsets]
        column_names = [f"Column {i}" for i in range(len(indexset_names))]
        equation = platform.backend.optimization.equations.create(
            run_id=run.id,
            name="Equation",
            constrained_to_indexsets=indexset_names,
            column_names=column_names,
        )

        assert equation.column_names == column_names
        assert equation.indexset_names == indexset_names

        equation_2 = platform.backend.optimization.equations.create(
            run_id=run.id,
            name="Equation 2",
            constrained_to_indexsets=equation.indexset_names,
            column_names=equation.column_names,
        )

        assert equation_2.column_names == column_names
        assert equation_2.indexset_names == indexset_names

(Note that cloning with run_id=run.id, name=equation.name could not work because of our UniqueConstraint on the run.id/equation.id columns.)
This test passes and I don't think this is a coincidence. Unfortunately, I can't find an explicit statement in sqlalchemy's docs for this, but here's what I gather:

  • association_proxy() is applied to the User class to produce a “view” of the kw relationship, which exposes the string value of .keyword associated with each Keyword object.

    from here

  • Details such as if the locally proxied attribute is a collection (as is typical) or a scalar reference, as well as if the collection acts like a set, list, or dictionary is taken into account, so that the proxy should act just like the underlying collection or attribute does.

    from the same page

  • the type of collection used for the relationship() is derived from the collection type passed to the Mapped container type

    from here

  • and finally, there's this page on collection details

As I said, no explicit statement, but the there are two default collection styles for relationships: set and list. This alone indicates to me that one of that is ordered, while the other one isn't.
For all our (optimization) relationships, we use list, so I expect the order to remain the same.
I'm not sure if you think the above test is sufficient to confirm this, but it is passing. I'm happy to write the same test for the other data-layer optimization items, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the thorough feedback :)

column_names is optionally provided when making a new equation/parameter/... object. If column_names is supplied it must be of the same length as constrained_to_indexsets and will act as a "key" for the columns instead of the "indexset name".

This understanding is correct :)

Good, it would be helpful to get more context about stuff like this during reviews even if I'm questioning something you (in this case correctly) think is obvious.

I agree that the cast() call is not ideal, so I adapted that now to read

    @property
    def column_names(self) -> list[str]:
        return (
            [] if self._column_names[0] is None else list(map(str, self._column_names))
        )

instead. This always returns a true list object :)

Ok, so now when creating an equation like this:

equation = platform.backend.optimization.equations.create(
    run_id=run.id,
    name="Equation",
    constrained_to_indexsets=[],
)

(Which I did not expect to work, but it does) equation.column_names throws an IndexError

This change also already enables a test like this:

    def test_order_column_names(self, platform: ixmp4.Platform) -> None:
        run = platform.backend.runs.create("Model", "Scenario")

        # Test normal creation
        indexsets = create_indexsets_for_run(platform=platform, run_id=run.id, amount=5)
        indexset_names = [iset.name for iset in indexsets]
        column_names = [f"Column {i}" for i in range(len(indexset_names))]
        equation = platform.backend.optimization.equations.create(
            run_id=run.id,
            name="Equation",
            constrained_to_indexsets=indexset_names,
            column_names=column_names,
        )

        assert equation.column_names == column_names
        assert equation.indexset_names == indexset_names

        equation_2 = platform.backend.optimization.equations.create(
            run_id=run.id,
            name="Equation 2",
            constrained_to_indexsets=equation.indexset_names,
            column_names=equation.column_names,
        )

        assert equation_2.column_names == column_names
        assert equation_2.indexset_names == indexset_names

(Note that cloning with run_id=run.id, name=equation.name could not work because of our UniqueConstraint on the run.id/equation.id columns.)

run in this case refers to a new, cloned run, so it should work. The RunRepository.clone() method does exactly that.

This test passes and I don't think this is a coincidence. Unfortunately, I can't find an explicit statement in sqlalchemy's docs for this, but here's what I gather:

* > [association_proxy()](https://docs.sqlalchemy.org/en/20/orm/extensions/associationproxy.html#sqlalchemy.ext.associationproxy.association_proxy) is applied to the User class to produce a “view” of the kw relationship, which exposes the string value of .keyword associated with each Keyword object.
  
  
  from [here](https://docs.sqlalchemy.org/en/20/orm/extensions/associationproxy.html#simplifying-scalar-collections)

* > Details such as if the locally proxied attribute is a collection (as is typical) or a scalar reference, as well as if the collection acts like a set, list, or dictionary is taken into account, so that the proxy should act just like the underlying collection or attribute does.
  
  
  from the same page

* > the type of collection used for the [relationship()](https://docs.sqlalchemy.org/en/20/orm/relationship_api.html#sqlalchemy.orm.relationship) is derived from the collection type passed to the [Mapped](https://docs.sqlalchemy.org/en/20/orm/internals.html#sqlalchemy.orm.Mapped) container type
  
  
  from [here](https://docs.sqlalchemy.org/en/20/orm/basic_relationships.html#relationship-patterns-o2m-collection)

* and finally, there's [this page on collection details](https://docs.sqlalchemy.org/en/20/orm/collection_api.html#custom-collections)

As I said, no explicit statement, but the there are two default collection styles for relationships: set and list. This alone indicates to me that one of that is ordered, while the other one isn't. For all our (optimization) relationships, we use list, so I expect the order to remain the same. I'm not sure if you think the above test is sufficient to confirm this, but it is passing. I'm happy to write the same test for the other data-layer optimization items, too.

This sounds to me more like statements referring to the fact that the _AssociationList class (which is /actually/ used for Equation._column_names, not AssociationProxy(Instance) or whatever) is supposed to behave exactly like a list would. I'm pretty sure sqlalchemy would not mention the order in which items are returned from the database since this is database-specific unless explicitly counteracted. For this reason we have these lines in the codebase for example:

ixmp4/data/db/base.py:

class Lister(QueryMixin[ModelType], Selecter[ModelType]):
    def list(self, *args: Any, **kwargs: Any) -> list[ModelType]:
        _exc = self.select(*args, **kwargs)
        _exc = _exc.order_by(self.model_class.id.asc()) # << !!!
        return self.list_query(_exc)

class Tabulator(QueryMixin[ModelType], Selecter[ModelType]):
    def tabulate(self, *args: Any, _raw: bool = False, **kwargs: Any) -> pd.DataFrame:
        _exc = self.select(*args, **kwargs)
        _exc = _exc.order_by(self.model_class.id.asc()) # << !!!

        return self.tabulate_query(_exc)

Additionally, sqlalchemy provides an ordering_list which can be used to go even further and save an index/key column to have finer control over the positioning of items. It can also be used in conjunction with association_proxy to get a list of values ordered by another value on the association table:

Reading through the docs I suspect that the association proxies always use the data loaded by the relationship field. Maybe the relationship field orders items in a consistent way and it is not documented, maybe it depends on the loading technique (join, select, etc. (more info)[https://docs.sqlalchemy.org/en/20/orm/relationship_api.html#sqlalchemy.orm.relationship.params.lazy]), maybe something else is going on. I wouldn't rely on this behavior though and there seems to be an easy way to make sure it is consistent explicitly:
The relationship field has an order_by parameter so that's what I would use just to be safe since ordering problems with the postgres cluster happen everywhere across our infrastructure.

Copy link
Contributor

@meksor meksor May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also just noticed that there is a relatively old PR already addressing this problem! #162

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've opened a new issue about the test case you reported, and I'll push a commit that adds order_by to the *_indexset_associations relationship. Please check to see if that's all we need for now :)

Not sure why the tests are not running for 7512142, but the checks for 818b782 and 098f586 are passing (and 7512142 fixes the one code style test that was broken somehow).

@meksor
Copy link
Contributor

meksor commented Apr 17, 2025

Looks alright to me, does this depend on #156 ?

@glatterf42 glatterf42 force-pushed the enh/introduce-has-solution branch from 0733b76 to b35225d Compare April 29, 2025 10:27
Base automatically changed from enh/introduce-has-solution to main April 29, 2025 10:50
@glatterf42 glatterf42 force-pushed the remove-column-class-entirely branch from 9c36f7a to 6ce3ef1 Compare April 29, 2025 11:25
@glatterf42
Copy link
Member Author

glatterf42 commented Apr 29, 2025

It doesn't depend on #156, but it's just as well that that was merged first :)

EDIT: From the last time you reviewed, I essentially only added the migrations and renamed the .indexsets attributes as you suggested for Parameter.

@glatterf42 glatterf42 force-pushed the remove-column-class-entirely branch from 6ce3ef1 to eeb0f81 Compare April 29, 2025 11:30
@glatterf42 glatterf42 requested a review from meksor April 30, 2025 09:56
@glatterf42 glatterf42 force-pushed the remove-column-class-entirely branch from aff692d to 818b782 Compare May 6, 2025 10:08
@glatterf42 glatterf42 merged commit b4e10c9 into main May 6, 2025
13 checks passed
@glatterf42 glatterf42 deleted the remove-column-class-entirely branch May 6, 2025 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants