-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash provided block identifier before storing in DB #633
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
512 bytes for a block?
Say we have 15 Million entities with about 1.5 Million blocks. If we use 512 bytes for every block, than we have to insert 6GB worth of block IDs into the database.
However, the server only cares about matching up same block IDs. It does not care about the particular format. The minimum amount of bits to fully describe 1.5 Million different blocks is 21 bits. That's only about 30 MB of data.
Why don't we do the following:
We let the user upload whatever they want. Then, when the server writes the block information into the database, it will map the user provided block IDs to some fixed-sized hash values and uses that as the internal block IDs. 128 bits should be plenty to all but eliminate the chance of a collision.
On the plus side we would have less load on the database, as there is less data and plain integers instead of strings. What do you think?
Oh I like that proposal a lot. So we will keep the DB constraint at 64 bytes, but hash the provided block id during import. As for implementation, hashed via hmac with a run specific key? Key generated by the service on run creation and stored in the DB run table?
|
b9e4624
to
61ecd6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a keyed hash function is necessary.
We see the server as a semi-trusted third party. Thus, if the block IDs, as generated by the data providers, is leaking info about the PII, then the providers shouldn't upload it like that in the first place. The server must not be allowed to see that.
All the privacy protection has to happen on the client side. (like with the CLKs)
That means, if we have "leaky" block IDs, then the clients have to do something about it. Either encrypt, or use a keyed hash function. And only upload 'save' block IDs.
For the server part, the proposed hashing had the only purpose to create fixed sized indices, thus allowing clients to upload arbitrary block IDs. You could even just use the python inbuilt hash
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good now
@@ -277,6 +277,7 @@ def compute_filter_similarity(package, project_id, run_id, threshold, encoding_s | |||
task_span = compute_filter_similarity.span | |||
|
|||
def new_child_span(name, parent_scope=None): | |||
log.debug(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's not a good debug message.
Related to issue #632
df7cd53
to
9ebcbe9
Compare
Adds a test to ensure hashing the block names is efficient
Sorry, I forgot about this PR. Just rebased after adding another test and combined two loops into one (importing blocks was hashing the names then counting the sizes in two separate loops). If/once the CI passes I'll merge |
Related to issue #632.
Block names were limited to 64 chars, for P-Sig we may very well want to use block ids that are based on bloom filters... a 64 char limit would impose a max bloom filter size of 512 bits (assuming a perfect encoding unlike json).
This small change increases the column size to 512 chars, although happy to increase further or remove the character limitation.