-
Notifications
You must be signed in to change notification settings - Fork 224
Added buffer interoperability with arrow-rs #1437
Added buffer interoperability with arrow-rs #1437
Conversation
} | ||
|
||
pub(crate) type Bytes<T> = foreign_vec::ForeignVec<BytesAllocator, T>; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the real meat of the conversion
@@ -14,6 +15,7 @@ pub trait NativeType: | |||
+ Send | |||
+ Sync | |||
+ Sized | |||
+ RefUnwindSafe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary to ensure that arrow2::Bytes<T>
is RefUnwindSafe
which is important to arrow::Buffer
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1437 +/- ##
==========================================
+ Coverage 83.76% 83.78% +0.02%
==========================================
Files 375 376 +1
Lines 41024 41074 +50
==========================================
+ Hits 34364 34415 +51
+ Misses 6660 6659 -1
... and 4 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Is there anything about this API that would preclude an (eventual) unification of the underlying buffer types? If not, it then seems quite reasonable to me to introduce an (optional) migration path and then work on the unifying buffer types to get |
No, although if this approach is given the green light, it is unclear that such a unification would be worth the fairly significant effort, I certainly would not be intending to undertake it. |
Integration test failure does not appear to be related to this PR |
I do think it is a regression if we cannot get back to
Couldn't we already do this with arrow FFI spec? What are the pro's and cons against this route? As we would still need to compile both libraries if we convert between the two. |
It's only you can't go back to Vec from an array created initially by the other library and then converted, i.e. the conversion loses the ability to go back to a vec. Arrow-rs arrays created from vec can still be converted back, and the same for arrow2
The conversion is safe and ergonomic, ffi is neither 😅 This approach should also be marginally faster as it doesn't need to marshal back and forth from the c data layout (which may need to recompute null buffers)
You only need to compile an extremely small part of arrow-rs, it won't register in the compile times at all |
Maybe we can add a test demonstrating going back/forth to vec (and when it doesn't work) as a way to document the limitiation? |
Right, I misunderstood that part. In that case this looks great! 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR makes sense to me -- thank you @tustvold
However, I am not super familiar with the arrow2 codebase so I defer to @ritchie46 / @jorgecarleitao / @sundy-li for final approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed and this is very ingenious and well implemented. Thank you @tustvold !
My only minor concern is that because arrow-buffer bumps major version every 2 weeks, we need to update this repo every 2 weeks, but this is only a procedural issue as the crate is not changing much.
Thank you again, @tustvold 🙇
We might be able to publish new versions of arrow2 with minor (e.g. |
As part of #1429 we want to provide an interoperability story between arrow2 and arrow-rs.
The original proposal involved porting arrow-rs and arrow2 to have a common base array representation. This was to preserve the original spirit of @jorgecarleitao 's proposal in apache/arrow-rs#1176 (comment). However, doing this in an incremental fashion whilst not introducing performance regressions or major breaking changes is complicated and extremely time consuming.
Taking a step-back, all we really want is a reasonably fast way to convert between array representations, to facilitate interoperability and potentially incremental migration of codebases. Whilst perhaps less "pure", simply providing a safe API to convert between
ArrayData
andBox<dyn arrow2::Array>
is likely sufficient.The major things this would change are:
Vec
as they would be opaque allocationsHowever, it would allow us to provide an interoperability story in a matter of days instead of weeks/months.
In this vein, this PR adds zero-copy conversion between the buffer representations, as this is all that is really necessary to permit this. The rest of the conversion logic is fairly mechanical, I already have it mostly implemented but wanted to get feedback first.