-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-copy Vec conversion (#3516) (#1176) #3756
Changes from all commits
4cbad4d
d84e499
a4f1435
0abef5f
f07f396
4a882f5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,13 +15,13 @@ | |
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
use std::convert::AsRef; | ||
use std::alloc::Layout; | ||
use std::fmt::Debug; | ||
use std::iter::FromIterator; | ||
use std::ptr::NonNull; | ||
use std::sync::Arc; | ||
|
||
use crate::alloc::{Allocation, Deallocation}; | ||
use crate::alloc::{Allocation, Deallocation, ALIGNMENT}; | ||
use crate::util::bit_chunk_iterator::{BitChunks, UnalignedBitChunk}; | ||
use crate::{bytes::Bytes, native::ArrowNativeType}; | ||
|
||
|
@@ -42,6 +42,8 @@ pub struct Buffer { | |
ptr: *const u8, | ||
|
||
/// Byte length of the buffer. | ||
/// | ||
/// Must be less than or equal to `data.len()` | ||
length: usize, | ||
} | ||
|
||
|
@@ -69,6 +71,22 @@ impl Buffer { | |
} | ||
} | ||
|
||
/// Create a [`Buffer`] from the provided `Vec` without copying | ||
#[inline] | ||
pub fn from_vec<T: ArrowNativeType>(vec: Vec<T>) -> Self { | ||
// Safety | ||
// Vec::as_ptr guaranteed to not be null and ArrowNativeType are trivially transmutable | ||
let ptr = unsafe { NonNull::new_unchecked(vec.as_ptr() as _) }; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like https://doc.rust-lang.org/beta/std/vec/struct.Vec.html#method.into_raw_parts I note that the docs say
However, this code uses There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The docs for
The layout of arrays, vecs, etc... is fixed and defined by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we plan to store/create from layouts other than On This would defer all this logic to the implementation in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Eventually we may deprecate and remove support for other layouts, but at least for a period we need to support aligned layouts such as those created by MutableBuffer so that we can avoid stop-the-world changes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, maybe we could add a comment that once we remove that support we could remove the unsafe layout related code and do something similar to what we do here: https://github.com/DataEngineeringLabs/foreign_vec/blob/0d38968facee8a81748ec380fad78379d806fe1d/src/lib.rs#L25 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What would be the advantage of this? Does polars depend on the foreign vec interface? It doesn't seem to be doing anything materially different? Edit: In fact I'm fairly sure this line is technically unsound - https://github.com/DataEngineeringLabs/foreign_vec/blob/0d38968facee8a81748ec380fad78379d806fe1d/src/lib.rs#L49, there are a lot of safety constraints on from_raw_parts that appear to be not being checked or documented? |
||
let len = vec.len() * std::mem::size_of::<T>(); | ||
// Safety | ||
// Vec guaranteed to have a valid layout matching that of `Layout::array` | ||
// This is based on `RawVec::current_memory` | ||
let layout = unsafe { Layout::array::<T>(vec.capacity()).unwrap_unchecked() }; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason to use unwrap_unchecked here? Maybe it would be better to panic if the layout was messed up somehow? Looks like it errors on overflow https://doc.rust-lang.org/beta/std/alloc/struct.Layout.html#method.array Like maybe the size overflows because someone passed in a giant Vec or something? 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is a clone of The TLDR is that if the Vec is valid, it must have a valid layout (otherwise among other issues |
||
std::mem::forget(vec); | ||
let b = unsafe { Bytes::new(ptr, len, Deallocation::Standard(layout)) }; | ||
Self::from_bytes(b) | ||
} | ||
|
||
/// Initializes a [Buffer] from a slice of items. | ||
pub fn from_slice_ref<U: ArrowNativeType, T: AsRef<[U]>>(items: T) -> Self { | ||
let slice = items.as_ref(); | ||
|
@@ -78,7 +96,7 @@ impl Buffer { | |
buffer.into() | ||
} | ||
|
||
/// Creates a buffer from an existing memory region (must already be byte-aligned), this | ||
/// Creates a buffer from an existing aligned memory region (must already be byte-aligned), this | ||
/// `Buffer` will free this piece of memory when dropped. | ||
/// | ||
/// # Arguments | ||
|
@@ -91,9 +109,11 @@ impl Buffer { | |
/// | ||
/// This function is unsafe as there is no guarantee that the given pointer is valid for `len` | ||
/// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed. | ||
#[deprecated(note = "Use From<Vec<T>>")] | ||
pub unsafe fn from_raw_parts(ptr: NonNull<u8>, len: usize, capacity: usize) -> Self { | ||
assert!(len <= capacity); | ||
Buffer::build_with_arguments(ptr, len, Deallocation::Arrow(capacity)) | ||
let layout = Layout::from_size_align(capacity, ALIGNMENT).unwrap(); | ||
Buffer::build_with_arguments(ptr, len, Deallocation::Standard(layout)) | ||
} | ||
|
||
/// Creates a buffer from an existing memory region. Ownership of the memory is tracked via reference counting | ||
|
@@ -253,7 +273,8 @@ impl Buffer { | |
} | ||
|
||
/// Returns `MutableBuffer` for mutating the buffer if this buffer is not shared. | ||
/// Returns `Err` if this is shared or its allocation is from an external source. | ||
/// Returns `Err` if this is shared or its allocation is from an external source or | ||
/// it is not allocated with alignment [`ALIGNMENT`] | ||
pub fn into_mutable(self) -> Result<MutableBuffer, Self> { | ||
let ptr = self.ptr; | ||
let length = self.length; | ||
|
@@ -269,6 +290,45 @@ impl Buffer { | |
length, | ||
}) | ||
} | ||
|
||
/// Returns `Vec` for mutating the buffer | ||
/// | ||
/// Returns `Err(self)` if this buffer does not have the same [`Layout`] as | ||
/// the destination Vec or contains a non-zero offset | ||
pub fn into_vec<T: ArrowNativeType>(self) -> Result<Vec<T>, Self> { | ||
let layout = match self.data.deallocation() { | ||
Deallocation::Standard(l) => l, | ||
_ => return Err(self), // Custom allocation | ||
}; | ||
|
||
if self.ptr != self.data.as_ptr() { | ||
return Err(self); // Data is offset | ||
} | ||
|
||
let v_capacity = layout.size() / std::mem::size_of::<T>(); | ||
match Layout::array::<T>(v_capacity) { | ||
Ok(expected) if layout == &expected => {} | ||
_ => return Err(self), // Incorrect layout | ||
} | ||
|
||
let length = self.length; | ||
let ptr = self.ptr; | ||
let v_len = self.length / std::mem::size_of::<T>(); | ||
|
||
Arc::try_unwrap(self.data) | ||
.map(|bytes| unsafe { | ||
let ptr = bytes.ptr().as_ptr() as _; | ||
std::mem::forget(bytes); | ||
// Safety | ||
// Verified that bytes layout matches that of Vec | ||
Vec::from_raw_parts(ptr, v_len, v_capacity) | ||
}) | ||
.map_err(|bytes| Buffer { | ||
data: bytes, | ||
ptr, | ||
length, | ||
}) | ||
} | ||
} | ||
|
||
/// Creating a `Buffer` instance by copying the memory from a `AsRef<[u8]>` into a newly | ||
|
@@ -378,6 +438,7 @@ impl<T: ArrowNativeType> FromIterator<T> for Buffer { | |
|
||
#[cfg(test)] | ||
mod tests { | ||
use crate::i256; | ||
use std::panic::{RefUnwindSafe, UnwindSafe}; | ||
use std::thread; | ||
|
||
|
@@ -632,4 +693,125 @@ mod tests { | |
let buffer = Buffer::from(MutableBuffer::from_len_zeroed(12)); | ||
buffer.slice_with_length(2, usize::MAX); | ||
} | ||
|
||
#[test] | ||
fn test_vec_interop() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding Vec conversion at the Long-term I hope to deprecate and remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is it a problem for something like converting integers to (invalid) floats 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nope, there is a good justification for why this is perfectly fine here. The only place where this becomes problematic is where types have bit sequences that illegal, e.g. NonZeroU32 or bool, all |
||
// Test empty vec | ||
let a: Vec<i128> = Vec::new(); | ||
let b = Buffer::from_vec(a); | ||
b.into_vec::<i128>().unwrap(); | ||
|
||
// Test vec with capacity | ||
let a: Vec<i128> = Vec::with_capacity(20); | ||
let b = Buffer::from_vec(a); | ||
let back = b.into_vec::<i128>().unwrap(); | ||
assert_eq!(back.len(), 0); | ||
assert_eq!(back.capacity(), 20); | ||
|
||
// Test vec with values | ||
let mut a: Vec<i128> = Vec::with_capacity(3); | ||
a.extend_from_slice(&[1, 2, 3]); | ||
let b = Buffer::from_vec(a); | ||
let back = b.into_vec::<i128>().unwrap(); | ||
assert_eq!(back.len(), 3); | ||
assert_eq!(back.capacity(), 3); | ||
|
||
// Test vec with values and spare capacity | ||
let mut a: Vec<i128> = Vec::with_capacity(20); | ||
a.extend_from_slice(&[1, 4, 7, 8, 9, 3, 6]); | ||
let b = Buffer::from_vec(a); | ||
let back = b.into_vec::<i128>().unwrap(); | ||
assert_eq!(back.len(), 7); | ||
assert_eq!(back.capacity(), 20); | ||
|
||
// Test incorrect alignment | ||
let a: Vec<i128> = Vec::new(); | ||
let b = Buffer::from_vec(a); | ||
let b = b.into_vec::<i32>().unwrap_err(); | ||
b.into_vec::<i8>().unwrap_err(); | ||
|
||
// Test convert between types with same alignment | ||
// This is an implementation quirk, but isn't harmful | ||
// as ArrowNativeType are trivially transmutable | ||
let a: Vec<i64> = vec![1, 2, 3, 4]; | ||
let b = Buffer::from_vec(a); | ||
let back = b.into_vec::<u64>().unwrap(); | ||
assert_eq!(back.len(), 4); | ||
assert_eq!(back.capacity(), 4); | ||
|
||
// i256 has the same layout as i128 so this is valid | ||
let mut b: Vec<i128> = Vec::with_capacity(4); | ||
b.extend_from_slice(&[1, 2, 3, 4]); | ||
let b = Buffer::from_vec(b); | ||
let back = b.into_vec::<i256>().unwrap(); | ||
assert_eq!(back.len(), 2); | ||
assert_eq!(back.capacity(), 2); | ||
|
||
// Invalid layout | ||
let b: Vec<i128> = vec![1, 2, 3]; | ||
let b = Buffer::from_vec(b); | ||
b.into_vec::<i256>().unwrap_err(); | ||
|
||
// Invalid layout | ||
let mut b: Vec<i128> = Vec::with_capacity(5); | ||
b.extend_from_slice(&[1, 2, 3, 4]); | ||
let b = Buffer::from_vec(b); | ||
b.into_vec::<i256>().unwrap_err(); | ||
|
||
// Truncates length | ||
// This is an implementation quirk, but isn't harmful | ||
let mut b: Vec<i128> = Vec::with_capacity(4); | ||
b.extend_from_slice(&[1, 2, 3]); | ||
let b = Buffer::from_vec(b); | ||
let back = b.into_vec::<i256>().unwrap(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is also worth a test when the capacity doesn't equally divide equally into the transmuted size -- like maybe change this test to have capacity There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A capacity of 5 is tested above and results in an error, as the layout of the underlying allocation is invalid |
||
assert_eq!(back.len(), 1); | ||
assert_eq!(back.capacity(), 2); | ||
|
||
// Cannot use aligned allocation | ||
let b = Buffer::from(MutableBuffer::new(10)); | ||
let b = b.into_vec::<u8>().unwrap_err(); | ||
b.into_vec::<u64>().unwrap_err(); | ||
|
||
// Test slicing | ||
let mut a: Vec<i128> = Vec::with_capacity(20); | ||
a.extend_from_slice(&[1, 4, 7, 8, 9, 3, 6]); | ||
let b = Buffer::from_vec(a); | ||
let slice = b.slice_with_length(0, 64); | ||
|
||
// Shared reference fails | ||
let slice = slice.into_vec::<i128>().unwrap_err(); | ||
drop(b); | ||
|
||
// Succeeds as no outstanding shared reference | ||
let back = slice.into_vec::<i128>().unwrap(); | ||
assert_eq!(&back, &[1, 4, 7, 8]); | ||
assert_eq!(back.capacity(), 20); | ||
|
||
// Slicing by non-multiple length truncates | ||
let mut a: Vec<i128> = Vec::with_capacity(8); | ||
a.extend_from_slice(&[1, 4, 7, 3]); | ||
|
||
let b = Buffer::from_vec(a); | ||
let slice = b.slice_with_length(0, 34); | ||
drop(b); | ||
|
||
let back = slice.into_vec::<i128>().unwrap(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is slicing with a non zero offset also covered? Maybe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah |
||
assert_eq!(&back, &[1, 4]); | ||
assert_eq!(back.capacity(), 8); | ||
|
||
// Offset prevents conversion | ||
let a: Vec<u32> = vec![1, 3, 4, 6]; | ||
let b = Buffer::from_vec(a).slice(2); | ||
b.into_vec::<u32>().unwrap_err(); | ||
|
||
let b = MutableBuffer::new(16).into_buffer(); | ||
let b = b.into_vec::<u8>().unwrap_err(); // Invalid layout | ||
let b = b.into_vec::<u32>().unwrap_err(); // Invalid layout | ||
b.into_mutable().unwrap(); | ||
|
||
let b = Buffer::from_vec(vec![1_u32, 3, 5]); | ||
let b = b.into_mutable().unwrap_err(); // Invalid layout | ||
let b = b.into_vec::<u32>().unwrap(); | ||
assert_eq!(b, &[1, 3, 5]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't implement
From
as this would conflict with the blanketFrom<AsRef<[u8]>>