Hash provided block identifier before storing in DB #633

hardbyte · 2021-03-22T01:57:31Z

Related to issue #632.

Block names were limited to 64 chars, for P-Sig we may very well want to use block ids that are based on bloom filters... a 64 char limit would impose a max bloom filter size of 512 bits (assuming a perfect encoding unlike json).

This small change increases the column size to 512 chars, although happy to increase further or remove the character limitation.

wilko77

512 bytes for a block?
Say we have 15 Million entities with about 1.5 Million blocks. If we use 512 bytes for every block, than we have to insert 6GB worth of block IDs into the database.
However, the server only cares about matching up same block IDs. It does not care about the particular format. The minimum amount of bits to fully describe 1.5 Million different blocks is 21 bits. That's only about 30 MB of data.

Why don't we do the following:
We let the user upload whatever they want. Then, when the server writes the block information into the database, it will map the user provided block IDs to some fixed-sized hash values and uses that as the internal block IDs. 128 bits should be plenty to all but eliminate the chance of a collision.
On the plus side we would have less load on the database, as there is less data and plain integers instead of strings. What do you think?

hardbyte · 2021-03-22T05:53:49Z

Oh I like that proposal a lot. So we will keep the DB constraint at 64 bytes, but hash the provided block id during import.

As for implementation, hashed via hmac with a run specific key? Key generated by the service on run creation and stored in the DB run table? hmac_instance.hexdigest() to convert to string for storage in the existing 64 char block_id column?

>>> from hashlib import blake2b
>>> h = blake2b(key=b'pseudorandom key stored in run table', digest_size=16)
>>> h.update(b'provided block id')
>>> h.hexdigest()
'3d363ff7401e02026f4a4687d4863ced'

backend/entityservice/encoding_storage.py

wilko77

I don't think a keyed hash function is necessary.
We see the server as a semi-trusted third party. Thus, if the block IDs, as generated by the data providers, is leaking info about the PII, then the providers shouldn't upload it like that in the first place. The server must not be allowed to see that.
All the privacy protection has to happen on the client side. (like with the CLKs)
That means, if we have "leaky" block IDs, then the clients have to do something about it. Either encrypt, or use a keyed hash function. And only upload 'save' block IDs.

For the server part, the proposed hashing had the only purpose to create fixed sized indices, thus allowing clients to upload arbitrary block IDs. You could even just use the python inbuilt hash function.

wilko77

looks good now

wilko77 · 2021-03-26T04:12:49Z

backend/entityservice/tasks/comparing.py

@@ -277,6 +277,7 @@ def compute_filter_similarity(package, project_id, run_id, threshold, encoding_s
    task_span = compute_filter_similarity.span

    def new_child_span(name, parent_scope=None):
+        log.debug(name)


that's not a good debug message.

Related to issue #632

Adds a test to ensure hashing the block names is efficient

hardbyte · 2021-04-18T00:53:25Z

Sorry, I forgot about this PR.

Just rebased after adding another test and combined two loops into one (importing blocks was hashing the names then counting the sizes in two separate loops). If/once the CI passes I'll merge

hardbyte requested a review from wilko77 March 22, 2021 01:57

wilko77 reviewed Mar 22, 2021

View reviewed changes

hardbyte force-pushed the increase-block-id-size branch from b9e4624 to 61ecd6c Compare March 22, 2021 20:43

hardbyte commented Mar 22, 2021

View reviewed changes

backend/entityservice/encoding_storage.py Show resolved Hide resolved

hardbyte requested a review from wilko77 March 22, 2021 20:49

wilko77 reviewed Mar 22, 2021

View reviewed changes

wilko77 approved these changes Mar 26, 2021

View reviewed changes

hardbyte changed the title ~~Increase block identifier size~~ Hash provided block identifier before storing in DB Mar 28, 2021

Map the user provided block IDs to fixed-sized hash values.

9ebcbe9

Related to issue #632

hardbyte force-pushed the increase-block-id-size branch from df7cd53 to 9ebcbe9 Compare April 17, 2021 21:55

hardbyte added 3 commits April 18, 2021 11:01

Hash block names while iterating through input data

b621fab

Add additional tracing

6f462a7

Count block sizes while importing them

1e60601

Adds a test to ensure hashing the block names is efficient

hardbyte merged commit 0e93989 into develop Apr 18, 2021

hardbyte deleted the increase-block-id-size branch April 18, 2021 00:58

hardbyte mentioned this pull request Apr 18, 2021

Implement P-Sig key filtering #632

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash provided block identifier before storing in DB #633

Hash provided block identifier before storing in DB #633

hardbyte commented Mar 22, 2021

wilko77 left a comment

hardbyte commented Mar 22, 2021 •

edited

Loading

wilko77 left a comment

wilko77 left a comment

wilko77 Mar 26, 2021

hardbyte commented Apr 18, 2021

Hash provided block identifier before storing in DB #633

Hash provided block identifier before storing in DB #633

Conversation

hardbyte commented Mar 22, 2021

wilko77 left a comment

Choose a reason for hiding this comment

hardbyte commented Mar 22, 2021 • edited Loading

wilko77 left a comment

Choose a reason for hiding this comment

wilko77 left a comment

Choose a reason for hiding this comment

wilko77 Mar 26, 2021

Choose a reason for hiding this comment

hardbyte commented Apr 18, 2021

hardbyte commented Mar 22, 2021 •

edited

Loading