-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix several accidentally quadratic functions #792
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jon Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that hashing does not work well with mutable objects. It can cause inconsistencies and misses on lookup.
Also: not all properties are included in the hashing which causes false positive collisions.
@@ -81,3 +81,6 @@ def __init__( | |||
relationships = [] if relationships is None else relationships | |||
extracted_licensing_info = [] if extracted_licensing_info is None else extracted_licensing_info | |||
check_types_and_set_values(self, locals()) | |||
|
|||
def __hash__(self): | |||
return id(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documents are not immutable and therefore hashing might run into inconsistencies. Most use-cases would probably expect a content dependent hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's a shame.
I wouldn't expect for the document to get mutated during validation. Would it make sense to have some frozen/immutable variants of document and relationship that could be passed around the validators?
Or do you have any ideas that would be a better approach to take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that a hash should only be implemented on immutable objects when it uses mutable properties of the object - most "useful" hashes do. This hash is not based on a mutable property which would make it safe and consistent although not very useful (The default implementation of hash is just id(self) >> 4). Also note that eq should always be defined when hash is [see here]. Why define hash here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used functools.lru_cache
in the past to speed up slow loops; it watches the parameters and replaces a function call with a table lookup when that makes sense. The nice thing is that it can catch mutations and know to call the function again without calling it every single time. Not sure if that would help in this case; I'll try to take a look this weekend.
@@ -73,3 +73,6 @@ def __init__( | |||
comment: Optional[str] = None, | |||
): | |||
check_types_and_set_values(self, locals()) | |||
|
|||
def __hash__(self): | |||
return hash("{} -> {} ({})".format(self.spdx_element_id, str(self.related_spdx_element_id), str(self.relationship_type))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our Relationships
are mutable and that does not work well with hashing. This can cause inconsistencies when used in hashsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class needs a definition of eq as well.
…s in parsing to remove hash methods. Signed-off-by: paulgibert <[email protected]>
Signed-off-by: Jon Johnson <[email protected]>
I've update the PR with changes from @paulgibert that seem better than what I came up with :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the astuple
approach looks promising.
|
||
def cached_function(document: Document): | ||
key = id(document) | ||
if key in cache.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns an outdated cache, if the document was modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or in other words: does the cache key id(document)
change, if I mutate some stuff inside the document?
Maybe a less global caching would be more appropriate. So that its scope is controlled in a way so that no mutation can happen in between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. From the docs:
id(object)
Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime.
Two objects with non-overlapping lifetimes may have the same [id()](https://docs.python.org/3/library/functions.html#id) value.
``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is exactly the problem I am afraid of
>>> a = {1}
>>> id(a)
139842939239360
>>> a.add(2)
>>> a
{1, 2}
>>> id(a)
139842939239360
The object has the same ID and it does not change even if its content is mutated. So it is maybe not a good cache key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can adjust the cache key to be based on the fields you want but once the object is added to the cache it will still not reflect changes. If the goal is to build a fast search over mutable objects, a more complex approach/data structure may be necessary for representing collections of Documents and Relationships.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the library should allow for the following usecase:
- Validation of the document
- mutation of that document, or some of the contained elements
- validation of the new state, without false values due to caching
Hey, maybe it would be worth to connect in a video call? |
How would you feel about something like this instead? #800 I suspect just changing those function signatures is probably not what we want to do (I assume it's a breaking change), but what if we kept the same signatures and introduced a parallel set of functions that takes a set instead of a list (list variant can just call set variant). |
Fixes #790
I am by no means a python expert, please let me know if there are more idiomatic ways to solve this.
After this change:
Note that this document previously took almost 9 minutes to validate, so this is ~300x faster for that particular document.