-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-copy Vec conversion (#3516) (#1176) #3756
Conversation
@@ -69,6 +71,21 @@ impl Buffer { | |||
} | |||
} | |||
|
|||
/// Create a [`Buffer`] from the provided `Vec` without copying | |||
#[inline] | |||
pub fn from_vec<T: ArrowNativeType>(vec: Vec<T>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't implement From
as this would conflict with the blanket From<AsRef<[u8]>>
@@ -632,4 +690,83 @@ mod tests { | |||
let buffer = Buffer::from(MutableBuffer::from_len_zeroed(12)); | |||
buffer.slice_with_length(2, usize::MAX); | |||
} | |||
|
|||
#[test] | |||
fn test_vec_interop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding Vec conversion at the Buffer
level does allow for transmuting types provided they have the same layout, this is perhaps not ideal but isn't harmful.
Long-term I hope to deprecate and remove Buffer
and MutableBuffer
, just leaving ScalarBuffer
which statically prevents this sort of type-conversion, but doing it this way allows for a gradual migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is perhaps not ideal but isn't harmful.
Is it a problem for something like converting integers to (invalid) floats 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, there is a good justification for why this is perfectly fine here. The only place where this becomes problematic is where types have bit sequences that illegal, e.g. NonZeroU32 or bool, all ArrowNativeType
are valid for all bit sequences.
arrow-buffer/src/buffer/immutable.rs
Outdated
@@ -269,6 +289,43 @@ impl Buffer { | |||
length, | |||
}) | |||
} | |||
|
|||
/// Returns `Vec` for mutating the buffer if this buffer is not offset and was | |||
/// allocated with the correct layout for `Vec<T>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to explicitly say here an error is returned if the buffer can't be converted to a Vec (and ideally hint how to get a Vec out of it anyways (perhaps by copying)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this code and the tests carefully and it looked good to me. I have a few minor doc and test suggestions but all in all 👍
Nice work @tustvold
arrow-buffer/src/alloc/mod.rs
Outdated
/// See [allocate_aligned] and [free_aligned]. | ||
Arrow(usize), | ||
/// An allocation using [`std::alloc`] | ||
Standard(Layout), | ||
/// An allocation from an external source like the FFI interface or a Rust Vec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the comment for Custom
should be updated to say like "external Rust Vec" or maybe remove the reference to Rust Vec entirely
pub fn from_vec<T: ArrowNativeType>(vec: Vec<T>) -> Self { | ||
// Safety | ||
// Vec::as_ptr guaranteed to not be null and ArrowNativeType are trivially transmutable | ||
let ptr = unsafe { NonNull::new_unchecked(vec.as_ptr() as _) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like into_raw_parts
would be ideal here, but sadly it appears to not yet be stabalized
https://doc.rust-lang.org/beta/std/vec/struct.Vec.html#method.into_raw_parts
I note that the docs say
After calling this function, the caller is responsible for the memory previously managed by the Vec. The only way to do this is to convert the raw pointer, length, and capacity back into a Vec with the from_raw_parts function, allowing the destructor to perform the cleanup.
However, this code uses Deallocation::Standard(layout)
to deallocate memory, which seems ok, I just wanted to point out it doesn't match the docs (though maybe this is fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for dealloc
are authoritative here
ptr must denote a block of memory currently allocated via this allocator
layout must be the same layout that was used to allocate that block of memory
The layout of arrays, vecs, etc... is fixed and defined by Layout::array
. As such we are fine provided we obey the same logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to store/create from layouts other than Vec
? Otherwise we can just create from Vec
, forget the owned Vector.
On drop
we recreate the Vec
so that the drop sequence is executed.
This would defer all this logic to the implementation in std
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to store/create from layouts other than Vec? Otherwise we can just create from Vec, forget the owned Vector.
Eventually we may deprecate and remove support for other layouts, but at least for a period we need to support aligned layouts such as those created by MutableBuffer so that we can avoid stop-the-world changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, maybe we could add a comment that once we remove that support we could remove the unsafe layout related code and do something similar to what we do here: https://github.com/DataEngineeringLabs/foreign_vec/blob/0d38968facee8a81748ec380fad78379d806fe1d/src/lib.rs#L25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsafe layout related code and do something similar to what we do here
What would be the advantage of this? Does polars depend on the foreign vec interface? It doesn't seem to be doing anything materially different?
Edit: In fact I'm fairly sure this line is technically unsound - https://github.com/DataEngineeringLabs/foreign_vec/blob/0d38968facee8a81748ec380fad78379d806fe1d/src/lib.rs#L49, there are a lot of safety constraints on from_raw_parts that appear to be not being checked or documented?
let len = vec.len() * std::mem::size_of::<T>(); | ||
// Safety | ||
// Layout guaranteed to be valid | ||
let layout = unsafe { Layout::array::<T>(vec.capacity()).unwrap_unchecked() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use unwrap_unchecked here? Maybe it would be better to panic if the layout was messed up somehow?
Looks like it errors on overflow https://doc.rust-lang.org/beta/std/alloc/struct.Layout.html#method.array
Like maybe the size overflows because someone passed in a giant Vec or something?
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to use unwrap_unchecked here
This is a clone of Vec::current_memory
, I will add a link.
The TLDR is that if the Vec is valid, it must have a valid layout (otherwise among other issues Vec
wouldn't be able to deallocate itself).
@@ -632,4 +690,83 @@ mod tests { | |||
let buffer = Buffer::from(MutableBuffer::from_len_zeroed(12)); | |||
buffer.slice_with_length(2, usize::MAX); | |||
} | |||
|
|||
#[test] | |||
fn test_vec_interop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is perhaps not ideal but isn't harmful.
Is it a problem for something like converting integers to (invalid) floats 🤔
let mut b: Vec<i128> = Vec::with_capacity(4); | ||
b.extend_from_slice(&[1, 2, 3]); | ||
let b = Buffer::from_vec(b); | ||
let back = b.into_vec::<i256>().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is also worth a test when the capacity doesn't equally divide equally into the transmuted size -- like maybe change this test to have capacity 5
with i128
and verify the i256
output Vec only has capacity 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A capacity of 5 is tested above and results in an error, as the layout of the underlying allocation is invalid
let slice = b.slice_with_length(0, 34); | ||
drop(b); | ||
|
||
let back = slice.into_vec::<i128>().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is slicing with a non zero offset also covered? Maybe slice(2)
below covers that 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah slice(2)
covers this
@@ -95,12 +102,14 @@ impl MutableBuffer { | |||
|
|||
/// Allocates a new [MutableBuffer] from given `Bytes`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is worth updating these docs to explain when an Err is returned
@@ -90,6 +90,15 @@ impl<T: ArrowNativeType> From<Buffer> for ScalarBuffer<T> { | |||
} | |||
} | |||
|
|||
impl<T: ArrowNativeType> From<Vec<T>> for ScalarBuffer<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
It might be worth it to have @viirya give this PR a once over before merge. Not sure if @ritchie46 or @jorgecarleitao are interested or want to comment either. |
Integration failure is unrelated - apache/arrow#34367 |
Unless I hear anything further, I plan to merge this tomorrow morning. Please let me know if you need more time to review |
Benchmark runs are scheduled for baseline = 5edc954 and contender = d440c24. d440c24 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3516
Relates to #1176
Rationale for this change
This starts the process of migrating away from custom mutable buffer abstractions, in favour of
Vec
. This will allow eventually deprecating and removingMutableBuffer
andBufferBuilder
, and eventuallyBuffer
itself.What changes are included in this PR?
Are there any user-facing changes?