_repr_ and _html_repr_ show '... and additional rows' message #1041

Spaarsh · 2025-03-03T14:33:18Z

…ncated outputs

Which issue does this PR close?

Rationale for this change

Current implementation of _repr_ and _html_repr_ has no indication for whether the output was truncated or not.

What changes are included in this PR?

The changes override the default output for the _repr_ and _html_repr_ functions and, instead, check if the table has more than 10 rows. If it does, then it prints the first 10 rows along with a ... and additional rows message.

A thing to note is that these changes flatten the batches of the dataframe in the _repr_ function (as already done for the _repr_html_ function). This was done primarily since using .collect twice was causing major performance degradation. Batch flattening does this at 1/10 th of the time.

Are there any user-facing changes?

The users shall now see the ... and additional rows message along with the first 10 rows if the table has more than 10 rows.

…ncated outputs

Spaarsh · 2025-03-03T14:47:28Z

This is the new output:

For `_repr_`

+---------+---------+
| letters | numbers |
+---------+---------+
| A       | 1       |
| B       | 2       |
| C       | 3       |
| D       | 4       |
| E       | 5       |
| F       | 6       |
| G       | 7       |
| H       | 8       |
| I       | 9       |
| J       | 10      |
+---------+---------+
and more...

For `_repr_html_`

<table border='1'>
<tr><th>letters</th><th>numbers</th></tr>
<tr><td>A</td><td>1</td></tr>
<tr><td>B</td><td>2</td></tr>
<tr><td>C</td><td>3</td></tr>
<tr><td>D</td><td>4</td></tr>
<tr><td>E</td><td>5</td></tr>
<tr><td>F</td><td>6</td></tr>
<tr><td>G</td><td>7</td></tr>
<tr><td>H</td><td>8</td></tr>
<tr><td>I</td><td>9</td></tr>
<tr><td>J</td><td>10</td></tr>
<tr><td colspan="100%">and more...</td></tr>
</table>

Performance Changes

Implementation	`_repr_`	`_html_repr_`
Current	3.16ms	1.35ms
New	5.5ms	4.5ms

Note: I have manually ran the command multiple times and observed this. If required, I will run a script and produce the average for both cases

kosiew · 2025-03-04T06:55:18Z

src/dataframe.rs

+        // Get 11 rows to check if there are more than 10
+        let df = self.df.as_ref().clone().limit(0, Some(11))?;
        let batches = wait_for_future(py, df.collect())?;
-        let batches_as_string = pretty::pretty_format_batches(&batches);
+        let num_rows = batches.iter().map(|batch| batch.num_rows()).sum::<usize>();
+
+        // Flatten batches into a single batch for the first 10 rows
+        let mut all_rows = Vec::new();
+        let mut total_rows = 0;
+
+        for batch in &batches {
+            let num_rows_to_take = if total_rows + batch.num_rows() > 10 {
+                10 - total_rows
+            } else {
+                batch.num_rows()
+            };
+
+            if num_rows_to_take > 0 {
+                let sliced_batch = batch.slice(0, num_rows_to_take);
+                all_rows.push(sliced_batch);
+                total_rows += num_rows_to_take;
+            }
+
+            if total_rows >= 10 {
+                break;
+            }
+        }
+
+        let batches_as_string = pretty::pretty_format_batches(&all_rows);
+


You can simplify batches_as_string to:

// First get just the first 10 rows let preview_df = self.df.as_ref().clone().limit(0, Some(10))?; let preview_batches = wait_for_future(py, preview_df.collect())?; // Check if there are more rows by trying to get the 11th row let has_more_rows = { let check_df = self.df.as_ref().clone().limit(10, Some(1))?; let check_batch = wait_for_future(py, check_df.collect())?; !check_batch.is_empty() }; let batches_as_string = pretty::pretty_format_batches(&preview_batches);

This directly retrieves just the first 10 rows, eliminating the need for manual row tracking and slicing.

I did try this initiatially but calling collect twice led to a severe performance degradation. It used to take 50ms. With the manual slicing, it dropped to 5ms.

You can check my initial suggestion for the same here

calling collect twice led to a severe performance degradation

I ran this test to compare the performance:

import pyarrow as pa from datafusion import ( SessionContext, ) import time def run_dataframe_repr_long() -> None: ctx = SessionContext() # Create a DataFrame with more than 10 rows batch = pa.RecordBatch.from_arrays( [ pa.array(list(range(15))), pa.array([x * 2 for x in range(15)]), pa.array([x * 3 for x in range(15)]), ], names=["a", "b", "c"], ) df = ctx.create_dataframe([[batch]]) output = repr(df) def average_runtime(func, runs=100): total_time = 0 for _ in range(runs): start_time = time.time() func() end_time = time.time() total_time += end_time - start_time return total_time / runs average_time = average_runtime(run_dataframe_repr_long) print(f"Average runtime over {100} runs: {average_time:.6f} seconds")

and found no significant difference:

pr_1041 - is the branch with one collect
amended_pr_1041 - is the branch with two collect

That's weird. Maybe some artifact of my system settings? If there is no performance issues than I'll use your approach. But then why was the _repr_html_ using batch manipulation at the first place? I took the idea from that function!

kosiew · 2025-03-04T06:56:26Z

src/dataframe.rs

        match batches_as_string {
-            Ok(batch) => Ok(format!("DataFrame()\n{batch}")),
+            Ok(batch) => {
+                if num_rows > 10 {


using has_more_rows from above

+ if has_more_rows {

kosiew · 2025-03-04T08:19:31Z

src/dataframe.rs

+            total_rows += batch.num_rows();
            let formatters = batch
                .columns()
                .iter()
                .map(|c| ArrayFormatter::try_new(c.as_ref(), &FormatOptions::default()))
-                .map(|c| {
-                    c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string())))
-                })
+                .map(|c| c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string()))))
                .collect::<Result<Vec<_>, _>>()?;
-
-            for row in 0..batch.num_rows() {
+
+            let num_rows_to_render = if total_rows > 10 { 10 } else { batch.num_rows() };
+
+            for row in 0..num_rows_to_render {
                let mut cells = Vec::new();
                for formatter in &formatters {
                    cells.push(format!("<td>{}</td>", formatter.value(row)));
                }
                let row_str = cells.join("");
                html_str.push_str(&format!("<tr>{}</tr>\n", row_str));
            }
-        }

+            if total_rows >= 10 {
+                break;
+            }


How about simplifying to:

let rows_remaining = 10 - total_rows; let rows_in_batch = batch.num_rows().min(rows_remaining); for row in 0..rows_in_batch { html_str.push_str("<tr>"); for col in batch.columns() { let formatter = ArrayFormatter::try_new(col.as_ref(), &FormatOptions::default())?; html_str.push_str("<td>"); html_str.push_str(&formatter.value(row).to_string()); html_str.push_str("</td>"); } html_str.push_str("</tr>\n"); } total_rows += rows_in_batch;

Reasons:

More Accurate Row Limiting:

Before: total_rows was updated before checking the row limit, which could result in processing extra rows unnecessarily.
After: rows_remaining = 10 - total_rows ensures that we never exceed the row limit.

Avoids Redundant Vec Allocation:

Before: Each row was constructed using a Vec, and format!() was used for each cell.
After: Directly appends elements to html_str, eliminating unnecessary heap allocations.

Simplified and More Efficient Row Processing:

Before:
Used .map() and .collect() to create a list of ArrayFormatters before processing rows.
After:
Retrieves and formats values inside the loop, reducing redundant processing.

Avoids Unnecessary break Condition:

Before: Explicit if total_rows >= 10 { break; } was used to stop processing.
After: The min(rows_remaining, batch.num_rows()) logic naturally prevents extra iterations

kevinjqliu · 2025-03-04T16:15:07Z

btw #1036 also changes _repr_html_

_repr_ and _html_repr_ show '... and additional rows' message for tru…

c285a09

…ncated outputs

kosiew reviewed Mar 4, 2025

View reviewed changes

kosiew suggested changes Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_repr_ and _html_repr_ show '... and additional rows' message #1041

_repr_ and _html_repr_ show '... and additional rows' message #1041

Spaarsh commented Mar 3, 2025

Spaarsh commented Mar 3, 2025

kosiew Mar 4, 2025 •

edited

Loading

Spaarsh Mar 4, 2025

kosiew Mar 5, 2025

Spaarsh Mar 5, 2025

kosiew Mar 4, 2025

kosiew Mar 4, 2025

kevinjqliu commented Mar 4, 2025

_repr_ and _html_repr_ show '... and additional rows' message #1041

Are you sure you want to change the base?

_repr_ and _html_repr_ show '... and additional rows' message #1041

Conversation

Spaarsh commented Mar 3, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Spaarsh commented Mar 3, 2025

For _repr_

For _repr_html_

Performance Changes

kosiew Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

Spaarsh Mar 4, 2025

Choose a reason for hiding this comment

kosiew Mar 5, 2025

Choose a reason for hiding this comment

Spaarsh Mar 5, 2025

Choose a reason for hiding this comment

kosiew Mar 4, 2025

Choose a reason for hiding this comment

kosiew Mar 4, 2025

Choose a reason for hiding this comment

kevinjqliu commented Mar 4, 2025

For `_repr_`

For `_repr_html_`

kosiew Mar 4, 2025 •

edited

Loading