Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support datatype struct reorder #4962

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions arrow-schema/src/datatype.rs
Original file line number Diff line number Diff line change
Expand Up @@ -875,4 +875,37 @@ mod tests {
UnionMode::Dense,
);
}

#[test]
fn test_struct_reorder() {
let mut datafields = DataType::Struct(Fields::from(vec![
Field::new("f1", DataType::Int32, false),
Field::new("f2", DataType::Utf8, false),
]));
let reverse_datafields = DataType::Struct(Fields::from(vec![
Field::new("f2", DataType::Utf8, false),
Field::new("f1", DataType::Int32, false),
]));
if let DataType::Struct(ref mut fields) = datafields {
fields.reverse();
}
assert_eq!(datafields, reverse_datafields);
}

#[test]
fn test_struct_push() {
let mut datafields = DataType::Struct(Fields::from(vec![
Field::new("f1", DataType::Int32, false),
Field::new("f2", DataType::Utf8, false),
]));
if let DataType::Struct(ref mut fields) = datafields {
fields.push(Field::new("f3", DataType::Boolean, false));
}
let expected_datafields = DataType::Struct(Fields::from(vec![
Field::new("f1", DataType::Int32, false),
Field::new("f2", DataType::Utf8, false),
Field::new("f3", DataType::Boolean, false),
]));
assert_eq!(datafields, expected_datafields);
}
}
24 changes: 19 additions & 5 deletions arrow-schema/src/fields.rs
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ use std::sync::Arc;
#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
#[cfg_attr(feature = "serde", serde(transparent))]
pub struct Fields(Arc<[FieldRef]>);
pub struct Fields(Arc<Vec<FieldRef>>);
Copy link
Contributor

@Folyd Folyd Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think change [FieldRef] to Vec<FieldRef] is necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think change [FieldRef] to Vec<FieldRef] is necessary

Rust [] is a fixed size vector in compile time, if we want to push a field, should use ‘Vec’?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary, as a result of unsized coercion - https://doc.rust-lang.org/std/ops/trait.CoerceUnsized.html

https://docs.rs/arrow-schema/latest/arrow_schema/struct.SchemaBuilder.html provides an easy to use interface on top of this, but Fields also implements FromIterator and so can be collected into directly


impl std::fmt::Debug for Fields {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
Expand All @@ -50,9 +50,13 @@ impl std::fmt::Debug for Fields {
}

impl Fields {
pub fn new(fields: Arc<Vec<FieldRef>>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We provide From implementations that would supercede this

Self(fields)
}

/// Returns a new empty [`Fields`]
pub fn empty() -> Self {
Self(Arc::new([]))
Self(Arc::new(vec![]))
}

/// Return size of this instance in bytes.
Expand Down Expand Up @@ -83,6 +87,16 @@ impl Fields {
.zip(other.iter())
.all(|(a, b)| Arc::ptr_eq(a, b) || a.contains(b))
}

pub fn reverse(&mut self) {
let new_fields: Vec<FieldRef> = self.iter().rev().map(|f| f.clone() as FieldRef).collect();
self.0 = Arc::new(new_fields);
}
Comment on lines +91 to +94
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here only support reverse, but should we provide more function like this?


pub fn push(&mut self, field: Field) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be extremely inefficient, performing multiple allocations each time, I would recommend using https://docs.rs/arrow-schema/latest/arrow_schema/struct.SchemaBuilder.html instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But SchemaBuilder also use Vec<FieldRef>, And Vector is not expanding too often.

Copy link
Contributor

@tustvold tustvold Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal in this PR will add an additional pointer indirection to field access, whilst still requiring atomics for every call to push.

I think the better question is, why not use SchemaBuilder, it will be faster, less complicated, and avoid having to make changes to Fields?

TLDR is I don't think any of the changes in this PR should be necessary for #4908

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the better question is, why not use SchemaBuilder, it will be faster, less complicated, and avoid having to make changes to Fields?

The problem is that the transformation of Fields into SchemaBuilder appears to be performance intensive.
Here it look like we can only use

SchemaBuilder::from(self.clone());

The result could be something like this:

    pub fn push(&mut self, field: Field) {
        let mut new_fields = SchemaBuilder::from(self.clone());
        new_fields.push(field);
        *self = new_fields.finish().fields;
    }

I am not sure why ShemaBuilder is better than old?

Copy link
Contributor

@tustvold tustvold Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting not adding these methods to Schema and instead using SchemaBuilder instead where you were intending to use them.

As you've discovered Schema, like the arrays, is not designed to be mutated in place, especially given it is normally wrapped in an Arc itself as a SchemaRef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting not adding these methods to Schema and instead using SchemaBuilder instead where you were intending to use them.

Really thanks for you. ❤️ So we need an function like

pub fn push(&mut self, builder: &mut SchemaBuilder, field: Field) { }

When the user want to push a new field, should take the builder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they'd just use https://docs.rs/arrow-schema/latest/arrow_schema/struct.SchemaBuilder.html#method.push

I'm suggesting not trying to make Schema support in place mutation, it isn't designed for it, SchemaBuilder is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchemaBuilder has support push, so what else do i need to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing, I am suggesting we close this PR, and you work on implementing #4908 using SchemaBuilder

let fields = Arc::make_mut(&mut self.0);
fields.push(Arc::new(field));
}
}

impl Default for Fields {
Expand All @@ -99,7 +113,7 @@ impl FromIterator<Field> for Fields {

impl FromIterator<FieldRef> for Fields {
fn from_iter<T: IntoIterator<Item = FieldRef>>(iter: T) -> Self {
Self(iter.into_iter().collect())
Self(Arc::new(iter.into_iter().map(|f| f as FieldRef).collect()))
}
}

Expand All @@ -117,13 +131,13 @@ impl From<Vec<FieldRef>> for Fields {

impl From<&[FieldRef]> for Fields {
fn from(value: &[FieldRef]) -> Self {
Self(value.into())
Self(Arc::new(value.to_vec()))
}
}

impl<const N: usize> From<[FieldRef; N]> for Fields {
fn from(value: [FieldRef; N]) -> Self {
Self(Arc::new(value))
Self(Arc::new(value.to_vec()))
}
}

Expand Down
Loading