You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TLDR: HEAD requests are the correct way to check content-length, but object stores (and overly restrictive policies) don't play nice.
The problem:
Determine the content length of the target url
Attempt to call read_metadata_async(signedUrl, contentLength) - works.
Call read_row_group(signedUrl, /* etc sans contentLength */).
Receive a 4xx response on the unavoidable HEAD request (because signed urls are only good for one method at a time).
Obviously the biggest contributor to this is S3 (it's the motivating example), but there are also plenty of servers in the wild configured to accept GET requests but deny HEAD requests (for whatever inane reason).
Since range requests support is mandatory for async reads, there's the option of falling back to a GET with bytes=0-0 to get the Content-Length header. The only question really is whether to do this via a reader option, via a try catch fallback (incurring an additional request), or restore direct contentLength as an option on read_row_group.
The text was updated successfully, but these errors were encountered:
Ideally we wouldn't need to know the content length at all; it should be possible to fetch the last bytes of a parquet file to get the byte range of the metadata, and from there get the byte ranges of the columns. But arrow2 doesn't support that and I think arrow2 development mostly stopped.
Your suggestion is mostly to use a get request instead of a head request?
TLDR: HEAD requests are the correct way to check content-length, but object stores (and overly restrictive policies) don't play nice.
The problem:
read_metadata_async(signedUrl, contentLength)
- works.read_row_group(signedUrl, /* etc sans contentLength */)
.Obviously the biggest contributor to this is S3 (it's the motivating example), but there are also plenty of servers in the wild configured to accept GET requests but deny HEAD requests (for whatever inane reason).
Since range requests support is mandatory for async reads, there's the option of falling back to a GET with bytes=0-0 to get the Content-Length header. The only question really is whether to do this via a reader option, via a try catch fallback (incurring an additional request), or restore direct contentLength as an option on read_row_group.
The text was updated successfully, but these errors were encountered: