Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support casting from integer to binary #5015

Merged
merged 4 commits into from
Nov 3, 2023

Conversation

viirya
Copy link
Member

@viirya viirya commented Nov 1, 2023

Which issue does this PR close?

Closes #5014.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 1, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perhaps worth noting that ToByteSlice returns the native endian encoding, but frankly arrow-rs probably has all sorts of problems on big endian architectures so I'm inclined to not hold this back on this

) -> Result<ArrayRef, ArrowError> {
let array = array.as_primitive::<FROM>();

let mut builder = GenericBinaryBuilder::<O>::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could possibly initialize the correct capacity here

Comment on lines 2350 to 2356
for i in 0..array.len() {
if array.is_null(i) {
builder.append_null();
} else {
builder.append_value(array.value(i).to_byte_slice());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i in 0..array.len() {
if array.is_null(i) {
builder.append_null();
} else {
builder.append_value(array.value(i).to_byte_slice());
}
}
for v in array {
builder.append_option(v);
}

Perhaps

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, v is not AsRef<[u8]> though.

2352 |         builder.append_option(v);
     |                 ------------- ^ the trait `AsRef<[u8]>` is not implemented for `<FROM as arrow_array::ArrowPrimitiveType>::Native`
     |                 |
     |                 required by a bound introduced by this call

But I also cannot do builder.append_option(v.map(|v| v.to_byte_slice())); too as it complains returns a value referencing data owned by the current function.

I keep it unchanged so.

@viirya
Copy link
Member Author

viirya commented Nov 1, 2023

It is perhaps worth noting that ToByteSlice returns the native endian encoding, but frankly arrow-rs probably has all sorts of problems on big endian architectures so I'm inclined to not hold this back on this

Yea, I noticed native endian encoding is used when looking into ToByteSlice. Just that I'm not sure what "not hold this back on this" means?

let array = array.as_primitive::<FROM>();

let mut builder =
GenericBinaryBuilder::<O>::with_capacity(array.len(), array.values().inner().capacity());
Copy link
Contributor

@tustvold tustvold Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually made me wonder if we could not do something even simpler like, i.e. just reuse the buffer

let array = array.as_primitive::<FROM>();
let size = std::mem::size_of::<FROM::Native>().usize_as();
let offsets = OffsetBuffer::from_lengths(std::iter::repeat(size).take(array.len());
Ok(GenericBinaryArray::<O>::new(offsets, array.values().inner().clone(), array.nulls().cloned()))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, my concern is that other type casting doesn't reuse buffer (or do we?), users expect that cast kernel returns an array with new buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reuse the buffer when converting between StringArray and BinaryArray, and possibly some other transformations. I'm not sure why you wouldn't want to do so if you're able to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already did it, it is okay. Let me update this.

@@ -2317,6 +2341,19 @@ fn value_to_string<O: OffsetSizeTrait>(
Ok(Arc::new(builder.finish()))
}

fn cast_numeric_to_binary<FROM: ArrowPrimitiveType, O: OffsetSizeTrait>(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, actually I also want to support non native endian encoding configured from cast options. But I will put it into follow ups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be better as a custom kernel then? Perhaps kernels mirroring https://doc.rust-lang.org/std/primitive.i32.html#method.to_be and https://doc.rust-lang.org/std/primitive.i32.html#method.to_le ?

I think it would be good to avoid adding data type specific options to CastOptions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to have a kernel doing to_be/to_le? It sounds good.

@viirya viirya merged commit 7705aca into apache:master Nov 3, 2023
25 checks passed
@viirya
Copy link
Member Author

viirya commented Nov 3, 2023

Thank you @tustvold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support casting from integer to binary
2 participants