Skip to content

feat: support unknown_length for virtual arrays in order to read without any materialization #3475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

ikrommyd
Copy link
Collaborator

@ikrommyd ikrommyd commented Apr 21, 2025

This PR introduces unknown_length support for virtual arrays.

Because multiple arrays can have the same shape (coming from the same offsets for example) and because the awkward codebase wants us to know shapes/length very often, the most consistent way to add unknown_length support is via introducing a separate shape generator for virtual arrays that returns a shape tuple when called. This is in order to be able to generate the shape of something without generating its data.

We make the distinction between private ._shape and ._length versus public .shape and .length properties.
The public ones materialize the shape while the private ones don't. We extend this logic to the layouts and for the layouts that define a private self._length property as a function of the content, we instantiate that with unknown_length in the virtual array case in the __init__ method and we actually calculate it the first time .length of that layout is called in order to be able to instantiate layouts without materializing shapes. We also avoid materializing shapes for the __repr__ of the layouts.
To make our life easier, we introduce two helper utils maybe_shape_of and maybe_length_of.

Finally, we make the necessary changes in from_buffers in order to be able to to construct the proper data and shape generators to pass down to the VirtualArray buffers.

This has been tested through coffea as well with the ADL benchmarks, the coffea processors example and the AGC where we observe no materialization when reading with nanoevents and proper materialization of exactly the right buffers when running the analyses snippets.

@ikrommyd ikrommyd marked this pull request as ready for review April 22, 2025 05:22
@ikrommyd ikrommyd requested review from pfackeldey and ianna April 22, 2025 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants