Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/type preservation empty dataframes #301

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dan-corneanu
Copy link

It looks like manipulating a column in an empty data frame defaults the result to a type of :string.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a fix to this PR?

diff --git a/lib/red_amber/data_frame_variable_operation.rb b/lib/red_amber/data_frame_variable_operation.rb
index 7a5179e..62b0706 100755
--- a/lib/red_amber/data_frame_variable_operation.rb
+++ b/lib/red_amber/data_frame_variable_operation.rb
@@ -675,9 +675,18 @@ module RedAmber
           raise DataFrameArgumentError, "Data size mismatch (#{data.size} != #{size})"
         end
 
-        a = Arrow::Array.new(data.is_a?(Vector) ? data.to_a : data)
+        if data.respond_to?(:to_arrow_chunked_array)
+          chunked_array = data.to_arrow_chunked_array
+        else
+          if data.respond_to?(:to_arrow_array)
+            a = data.to_arrow_array
+          else
+            a = Arrow::Array.new(data)
+          end
+          chunked_array = Arrow::ChunkedArray.new([a])
+        end
         fields[i] = Arrow::Field.new(key, a.value_data_type)
-        arrays[i] = Arrow::ChunkedArray.new([a])
+        arrays[i] = chunked_array
       end
       [fields, arrays]
     end
diff --git a/lib/red_amber/vector.rb b/lib/red_amber/vector.rb
index 7237807..5267eb6 100644
--- a/lib/red_amber/vector.rb
+++ b/lib/red_amber/vector.rb
@@ -198,6 +198,22 @@ module RedAmber
     alias_method :values, :to_ary
     alias_method :entries, :to_ary
 
+    # Convert to an Arrow::Array.
+    #
+    # @return [Arrow::Array]
+    #   Apache Arrow array representation.
+    def to_arrow_array
+      @data.to_arrow_array
+    end
+
+    # Convert to an Arrow::ChunkedArray.
+    #
+    # @return [Arrow::ChunkedArray]
+    #   Apache Arrow chunked array representation.
+    def to_arrow_chunked_array
+      @data.to_arrow_chunked_array
+    end
+
     # Indeces from 0 to size-1 by Array.
     #
     # @return [Array]

test/test_data_frame_variable_operation.rb Outdated Show resolved Hide resolved
@dan-corneanu dan-corneanu force-pushed the fix/type-preservation-empty-dataframes branch from 9bece65 to 7ce24d6 Compare January 19, 2025 02:18
@dan-corneanu
Copy link
Author

diff --git a/lib/red_amber/data_frame_variable_operation.rb b/lib/red_amber/data_frame_variable_operation.rb
index 7a5179e..62b0706 100755
--- a/lib/red_amber/data_frame_variable_operation.rb
+++ b/lib/red_amber/data_frame_variable_operation.rb
@@ -675,9 +675,18 @@ module RedAmber
           raise DataFrameArgumentError, "Data size mismatch (#{data.size} != #{size})"
         end
 
-        a = Arrow::Array.new(data.is_a?(Vector) ? data.to_a : data)
+        if data.respond_to?(:to_arrow_chunked_array)
+          chunked_array = data.to_arrow_chunked_array
+        else
+          if data.respond_to?(:to_arrow_array)
+            a = data.to_arrow_array
+          else
+            a = Arrow::Array.new(data)
+          end
+          chunked_array = Arrow::ChunkedArray.new([a])
+        end
         fields[i] = Arrow::Field.new(key, a.value_data_type)
-        arrays[i] = Arrow::ChunkedArray.new([a])
+        arrays[i] = chunked_array
       end
       [fields, arrays]
     end

@kou if if data.respond_to?(:to_arrow_chunked_array) is true, then a will be nil in this line fields[i] = Arrow::Field.new(key, a.value_data_type).

I don't completely understand how chucked arrays are manipulated but could we replace that line with fields[i] = Arrow::Field.new(key, chunked_array.value_data_type) ?

@kou
Copy link
Member

kou commented Jan 19, 2025

Ah, sorry. We can use chunked_array.value_data_type instead of a.value_data_type.

@dan-corneanu dan-corneanu force-pushed the fix/type-preservation-empty-dataframes branch from 1b2364b to a232f14 Compare January 19, 2025 06:16
@@ -250,7 +250,7 @@ class GroupTest < Test::Unit::TestCase
Vectors : 3 numeric
# key type level data_preview
0 :i uint8 4 [0, 1, 2, nil], 1 nil
1 :count uint8 3 [2, 1, 2, 0]
1 :count int64 3 [2, 1, 2, 0]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou FYI I had to update this test after the change

@kou
Copy link
Member

kou commented Jan 19, 2025

I'll fix build failures on Linux in upstream. Please wait for a while...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants