HEAD-ache requests #272

H-Plus-Time · 2023-08-01T16:46:16Z

TLDR: HEAD requests are the correct way to check content-length, but object stores (and overly restrictive policies) don't play nice.

The problem:

Determine the content length of the target url
Attempt to call read_metadata_async(signedUrl, contentLength) - works.
Call read_row_group(signedUrl, /* etc sans contentLength */).
Receive a 4xx response on the unavoidable HEAD request (because signed urls are only good for one method at a time).

Obviously the biggest contributor to this is S3 (it's the motivating example), but there are also plenty of servers in the wild configured to accept GET requests but deny HEAD requests (for whatever inane reason).

Since range requests support is mandatory for async reads, there's the option of falling back to a GET with bytes=0-0 to get the Content-Length header. The only question really is whether to do this via a reader option, via a try catch fallback (incurring an additional request), or restore direct contentLength as an option on read_row_group.

The text was updated successfully, but these errors were encountered:

kylebarron · 2023-08-01T22:29:52Z

Ideally we wouldn't need to know the content length at all; it should be possible to fetch the last bytes of a parquet file to get the byte range of the metadata, and from there get the byte ranges of the columns. But arrow2 doesn't support that and I think arrow2 development mostly stopped.

Your suggestion is mostly to use a get request instead of a head request?

kylebarron · 2023-08-01T22:30:45Z

(arrow-rs might support async reads without knowing the content length; I haven't checked)

kylebarron mentioned this issue Sep 19, 2023

Perform HEAD request for HttpStore::head apache/arrow-rs#4837

Merged

kylebarron mentioned this issue Nov 20, 2023

WIP: Improved async api for arrow1 #393

Closed

3 tasks

kylebarron mentioned this issue Nov 30, 2023

Request batching #392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HEAD-ache requests #272

HEAD-ache requests #272

H-Plus-Time commented Aug 1, 2023

kylebarron commented Aug 1, 2023

kylebarron commented Aug 1, 2023

HEAD-ache requests #272

HEAD-ache requests #272

Comments

H-Plus-Time commented Aug 1, 2023

kylebarron commented Aug 1, 2023

kylebarron commented Aug 1, 2023