-
Notifications
You must be signed in to change notification settings - Fork 7
Remove Column
class entirely
#157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -25,7 +23,7 @@ class VariableCreateInput(BaseModel): | |||
|
|||
|
|||
class DataInput(BaseModel): | |||
data: dict[str, Any] | |||
data: dict[str, list[float] | list[int] | list[str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to reflect the other changes, then run into the thing that the tests on the REST API platforms always converted data to dict[str, list[float]]
for some reason. I don't understand why, but it doesn't seem to be this change because the same happens for equations. Still, to show that, I didn't also change the other type hints in this layer, though I suppose I still could.
aec6d3a
to
9c36f7a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #157 +/- ##
=======================================
- Coverage 89.0% 89.0% -0.1%
=======================================
Files 236 228 -8
Lines 8723 8646 -77
=======================================
- Hits 7770 7698 -72
+ Misses 953 948 -5
🚀 New features to boost your workflow:
|
@property | ||
def indexsets(self) -> list[str]: | ||
return [indexset.name for indexset in self._indexsets] | ||
|
||
@property | ||
def column_names(self) -> list[str] | None: | ||
return cast(list[str], self._column_names) if any(self._column_names) else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale behind the one list being None
if empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list(self._column_names.all_())
could work for the list conversion as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ._column_names
has a .all_()
attribute. There is an all_()
function in sqlalchemy, but that seems to produce an SQL ALL
expression, so I'm not sure how that's useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should, but looking at this again i have to retract my statement since it's kind of unnecessary to use this.
I'm just wondering why we are casting this to a list instead of converting it. I'm not sure how AssociationProxyInstance works exactly. Is this a generator that uses a cursor to yield items from the database?
In any case it doesn't have an append
function, nor will it work with len()
(educated guess) so essentially we are misleading anyone who tries to use these functions on this so called "list" ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I have to reiterate my concern about the possibility of this being None, it seems spurious to me that one has to constantly check if X is None
so the type checker doesn't complain instead of always treating it like list...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case it doesn't have an append function, nor will it work with len() (educated guess) so essentially we are misleading anyone who tries to use these functions on this so called "list" ;)
Sorry, just for clarification, this is wrong. I think I am getting mixed up between the functionalities of the django version of this type of mapping and misreading the sqla docs.
However, being confused by the new version I have to admit that my feedback was uneducated, since if I now understand correctly:
column_names
is optionally provided when making a new equation/parameter/... object. If column_names
is supplied it must be of the same length as constrained_to_indexsets
and will act as a "key" for the columns instead of the "indexset name". From this perspective returning the value supplied to the creation function kind of makes sense, however pulling this info back out of the association table like this was confusing me and I wasnt able to make the connection.
I am wondering if it is necessary to be able to distinguish if an equation has been supplied with overridden "column_names" later. Depending on if thats the case it might be beneficial to /always/ store something in column_name
on the association table. For one, this eliminates the need for two properties, and it will also address another potential problem:
It is unclear to me how the association_proxy orders list items. By default postgres will return items in a practically random ordering, which /could mean/ that when cloning via:
self.backend.optimization.equations.create(
run_id=run.id,
name=equation.name,
constrained_to_indexsets=equation.indexset_names,
column_names=equation.column_names,
)
...the indexsets and column names are suddenly switched! I can unfortunately not find out if this is a problem in practice since there does not seem to be a test case that clones a run with an equation like that. Trying to adjust test_run_clone
to cover the case reveals that adding data to an equation with column_names
throws an error and in turn that there is no test covering that either:
ixmp4/data/db/optimization/equation/repository.py:148: in add_data
data.set_index(index_list).combine_first(existing_data).reset_index()
TypeError: The parameter "keys" may be a column key, one-dimensional array, or a list containing only valid column keys and one-dimensional arrays.. Received column of type <class 'sqlalchemy.ext.associationproxy._AssociationList'>
Which brings me back to a complaint from earlier in this thread: This cast()
call is not enough for any run-time type checks and pandas seems to not like that very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for the thorough feedback :)
column_names is optionally provided when making a new equation/parameter/... object. If column_names is supplied it must be of the same length as constrained_to_indexsets and will act as a "key" for the columns instead of the "indexset name".
This understanding is correct :)
I agree that the cast()
call is not ideal, so I adapted that now to read
@property
def column_names(self) -> list[str]:
return (
[] if self._column_names[0] is None else list(map(str, self._column_names))
)
instead. This always returns a true list object :)
This change also already enables a test like this:
def test_order_column_names(self, platform: ixmp4.Platform) -> None:
run = platform.backend.runs.create("Model", "Scenario")
# Test normal creation
indexsets = create_indexsets_for_run(platform=platform, run_id=run.id, amount=5)
indexset_names = [iset.name for iset in indexsets]
column_names = [f"Column {i}" for i in range(len(indexset_names))]
equation = platform.backend.optimization.equations.create(
run_id=run.id,
name="Equation",
constrained_to_indexsets=indexset_names,
column_names=column_names,
)
assert equation.column_names == column_names
assert equation.indexset_names == indexset_names
equation_2 = platform.backend.optimization.equations.create(
run_id=run.id,
name="Equation 2",
constrained_to_indexsets=equation.indexset_names,
column_names=equation.column_names,
)
assert equation_2.column_names == column_names
assert equation_2.indexset_names == indexset_names
(Note that cloning with run_id=run.id, name=equation.name
could not work because of our UniqueConstraint on the run.id/equation.id
columns.)
This test passes and I don't think this is a coincidence. Unfortunately, I can't find an explicit statement in sqlalchemy's docs for this, but here's what I gather:
-
association_proxy() is applied to the User class to produce a “view” of the kw relationship, which exposes the string value of .keyword associated with each Keyword object.
from here
-
Details such as if the locally proxied attribute is a collection (as is typical) or a scalar reference, as well as if the collection acts like a set, list, or dictionary is taken into account, so that the proxy should act just like the underlying collection or attribute does.
from the same page
-
the type of collection used for the relationship() is derived from the collection type passed to the Mapped container type
from here
-
and finally, there's this page on collection details
As I said, no explicit statement, but the there are two default collection styles for relationships: set
and list
. This alone indicates to me that one of that is ordered, while the other one isn't.
For all our (optimization) relationships, we use list
, so I expect the order to remain the same.
I'm not sure if you think the above test is sufficient to confirm this, but it is passing. I'm happy to write the same test for the other data
-layer optimization items, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks for the thorough feedback :)
column_names is optionally provided when making a new equation/parameter/... object. If column_names is supplied it must be of the same length as constrained_to_indexsets and will act as a "key" for the columns instead of the "indexset name".
This understanding is correct :)
Good, it would be helpful to get more context about stuff like this during reviews even if I'm questioning something you (in this case correctly) think is obvious.
I agree that the
cast()
call is not ideal, so I adapted that now to read@property def column_names(self) -> list[str]: return ( [] if self._column_names[0] is None else list(map(str, self._column_names)) )instead. This always returns a true list object :)
Ok, so now when creating an equation like this:
equation = platform.backend.optimization.equations.create(
run_id=run.id,
name="Equation",
constrained_to_indexsets=[],
)
(Which I did not expect to work, but it does) equation.column_names
throws an IndexError
This change also already enables a test like this:
def test_order_column_names(self, platform: ixmp4.Platform) -> None: run = platform.backend.runs.create("Model", "Scenario") # Test normal creation indexsets = create_indexsets_for_run(platform=platform, run_id=run.id, amount=5) indexset_names = [iset.name for iset in indexsets] column_names = [f"Column {i}" for i in range(len(indexset_names))] equation = platform.backend.optimization.equations.create( run_id=run.id, name="Equation", constrained_to_indexsets=indexset_names, column_names=column_names, ) assert equation.column_names == column_names assert equation.indexset_names == indexset_names equation_2 = platform.backend.optimization.equations.create( run_id=run.id, name="Equation 2", constrained_to_indexsets=equation.indexset_names, column_names=equation.column_names, ) assert equation_2.column_names == column_names assert equation_2.indexset_names == indexset_names(Note that cloning with
run_id=run.id, name=equation.name
could not work because of our UniqueConstraint on therun.id/equation.id
columns.)
run
in this case refers to a new, cloned run, so it should work. The RunRepository.clone()
method does exactly that.
This test passes and I don't think this is a coincidence. Unfortunately, I can't find an explicit statement in sqlalchemy's docs for this, but here's what I gather:
* > [association_proxy()](https://docs.sqlalchemy.org/en/20/orm/extensions/associationproxy.html#sqlalchemy.ext.associationproxy.association_proxy) is applied to the User class to produce a “view” of the kw relationship, which exposes the string value of .keyword associated with each Keyword object. from [here](https://docs.sqlalchemy.org/en/20/orm/extensions/associationproxy.html#simplifying-scalar-collections) * > Details such as if the locally proxied attribute is a collection (as is typical) or a scalar reference, as well as if the collection acts like a set, list, or dictionary is taken into account, so that the proxy should act just like the underlying collection or attribute does. from the same page * > the type of collection used for the [relationship()](https://docs.sqlalchemy.org/en/20/orm/relationship_api.html#sqlalchemy.orm.relationship) is derived from the collection type passed to the [Mapped](https://docs.sqlalchemy.org/en/20/orm/internals.html#sqlalchemy.orm.Mapped) container type from [here](https://docs.sqlalchemy.org/en/20/orm/basic_relationships.html#relationship-patterns-o2m-collection) * and finally, there's [this page on collection details](https://docs.sqlalchemy.org/en/20/orm/collection_api.html#custom-collections)
As I said, no explicit statement, but the there are two default collection styles for relationships:
set
andlist
. This alone indicates to me that one of that is ordered, while the other one isn't. For all our (optimization) relationships, we uselist
, so I expect the order to remain the same. I'm not sure if you think the above test is sufficient to confirm this, but it is passing. I'm happy to write the same test for the otherdata
-layer optimization items, too.
This sounds to me more like statements referring to the fact that the _AssociationList
class (which is /actually/ used for Equation._column_names
, not AssociationProxy(Instance)
or whatever) is supposed to behave exactly like a list would. I'm pretty sure sqlalchemy would not mention the order in which items are returned from the database since this is database-specific unless explicitly counteracted. For this reason we have these lines in the codebase for example:
ixmp4/data/db/base.py
:
class Lister(QueryMixin[ModelType], Selecter[ModelType]):
def list(self, *args: Any, **kwargs: Any) -> list[ModelType]:
_exc = self.select(*args, **kwargs)
_exc = _exc.order_by(self.model_class.id.asc()) # << !!!
return self.list_query(_exc)
class Tabulator(QueryMixin[ModelType], Selecter[ModelType]):
def tabulate(self, *args: Any, _raw: bool = False, **kwargs: Any) -> pd.DataFrame:
_exc = self.select(*args, **kwargs)
_exc = _exc.order_by(self.model_class.id.asc()) # << !!!
return self.tabulate_query(_exc)
Additionally, sqlalchemy provides an ordering_list
which can be used to go even further and save an index/key column to have finer control over the positioning of items. It can also be used in conjunction with association_proxy
to get a list of values ordered by another value on the association table:
- https://docs.sqlalchemy.org/en/20/orm/extensions/orderinglist.html
- https://stackoverflow.com/questions/21292726/how-to-properly-use-association-proxy-and-ordering-list-together-with-sqlalchemy
Reading through the docs I suspect that the association proxies always use the data loaded by the relationship field. Maybe the relationship field orders items in a consistent way and it is not documented, maybe it depends on the loading technique (join, select, etc. (more info)[https://docs.sqlalchemy.org/en/20/orm/relationship_api.html#sqlalchemy.orm.relationship.params.lazy]), maybe something else is going on. I wouldn't rely on this behavior though and there seems to be an easy way to make sure it is consistent explicitly:
The relationship field has an order_by
parameter so that's what I would use just to be safe since ordering problems with the postgres cluster happen everywhere across our infrastructure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also just noticed that there is a relatively old PR already addressing this problem! #162
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've opened a new issue about the test case you reported, and I'll push a commit that adds order_by
to the *_indexset_associations
relationship. Please check to see if that's all we need for now :)
Not sure why the tests are not running for 7512142, but the checks for 818b782 and 098f586 are passing (and 7512142 fixes the one code style test that was broken somehow).
Looks alright to me, does this depend on #156 ? |
0733b76
to
b35225d
Compare
9c36f7a
to
6ce3ef1
Compare
It doesn't depend on #156, but it's just as well that that was merged first :) EDIT: From the last time you reviewed, I essentially only added the migrations and renamed the |
6ce3ef1
to
eeb0f81
Compare
aff692d
to
818b782
Compare
As discovered in #151, using sqlalchemy's associationproxies is both (a bit) more performant and (considerably) less confusing than having a DB table called
Column
. Thus, this PR completes the removal of this construct.TODO
Column
class forParameter
#156 gets its own migrations firstOn top of this, the PR includes a few minor fixes:
.data
in the core layer more accuratetype: ignores
by casting appropriatelyVariable.indexsets
None
when appropriate.set_creation_info()
callsrun.optimization.remove_solution()
#139 and never skipped for tables, even when{}
was added)