Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batched-parallel Updates for Occurrence Consumer #82044

Open
roggenkemper opened this issue Dec 12, 2024 · 0 comments
Open

batched-parallel Updates for Occurrence Consumer #82044

roggenkemper opened this issue Dec 12, 2024 · 0 comments

Comments

@roggenkemper
Copy link
Member

batched-parallel mode has performance issues because it uses a threadpool. even with rate limiting, there are still performance limitations to it in its current form

there are some ideas for how to fix this:
Add a prestep that partitions messages by fingerprint and passes them to run_task_with_multiprocessing

  • similar to
    batch_processor = RunTask(
    function=batch_write_to_redis,
    next_step=commit_step,
    )
    batch_step = BatchStep(
    max_batch_size=self.max_batch_size,
    max_batch_time=self.max_batch_time,
    next_step=batch_processor,
    )
    , we could do something along the lines of https://gist.github.com/roggenkemper/a782981eed3739d9ee1f4b36160365a4
  • BatchStep to process a batch of messages, and produce a list of batches of messages, where each sublist is a list of messages with the same fingerprint
  • Unbatch returns each of those individually, so that the multiprocessing step gets batches of messages for each fingerprint, rather than individual messages
  • First, parallel step to deserialize, then batch, then process

Adding a timer to

payload = orjson.loads(item.payload.value)
could be useful too to gain insight into performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant