_repr_ and _html_repr_ show '... and additional rows' message #1041

Spaarsh · 2025-03-03T14:33:18Z

…ncated outputs

Which issue does this PR close?

Rationale for this change

Current implementation of _repr_ and _html_repr_ has no indication for whether the output was truncated or not.

What changes are included in this PR?

The changes override the default output for the _repr_ and _html_repr_ functions and, instead, check if the table has more than 10 rows. If it does, then it prints the first 10 rows along with a ... and additional rows message.

A thing to note is that these changes flatten the batches of the dataframe in the _repr_ function (as already done for the _repr_html_ function). This was done primarily since using .collect twice was causing major performance degradation. Batch flattening does this at 1/10 th of the time.

Are there any user-facing changes?

The users shall now see the ... and additional rows message along with the first 10 rows if the table has more than 10 rows.

…ncated outputs

Spaarsh · 2025-03-03T14:47:28Z

This is the new output:

For `_repr_`

+---------+---------+
| letters | numbers |
+---------+---------+
| A       | 1       |
| B       | 2       |
| C       | 3       |
| D       | 4       |
| E       | 5       |
| F       | 6       |
| G       | 7       |
| H       | 8       |
| I       | 9       |
| J       | 10      |
+---------+---------+
and more...

For `_repr_html_`

<table border='1'>
<tr><th>letters</th><th>numbers</th></tr>
<tr><td>A</td><td>1</td></tr>
<tr><td>B</td><td>2</td></tr>
<tr><td>C</td><td>3</td></tr>
<tr><td>D</td><td>4</td></tr>
<tr><td>E</td><td>5</td></tr>
<tr><td>F</td><td>6</td></tr>
<tr><td>G</td><td>7</td></tr>
<tr><td>H</td><td>8</td></tr>
<tr><td>I</td><td>9</td></tr>
<tr><td>J</td><td>10</td></tr>
<tr><td colspan="100%">and more...</td></tr>
</table>

Performance Changes

Implementation	`_repr_`	`_html_repr_`
Current	3.16ms	1.35ms
New	5.5ms	4.5ms

Note: I have manually ran the command multiple times and observed this. If required, I will run a script and produce the average for both cases

kosiew · 2025-03-04T06:55:18Z

src/dataframe.rs

+        // Get 11 rows to check if there are more than 10
+        let df = self.df.as_ref().clone().limit(0, Some(11))?;
        let batches = wait_for_future(py, df.collect())?;
-        let batches_as_string = pretty::pretty_format_batches(&batches);
+        let num_rows = batches.iter().map(|batch| batch.num_rows()).sum::<usize>();
+
+        // Flatten batches into a single batch for the first 10 rows
+        let mut all_rows = Vec::new();
+        let mut total_rows = 0;
+
+        for batch in &batches {
+            let num_rows_to_take = if total_rows + batch.num_rows() > 10 {
+                10 - total_rows
+            } else {
+                batch.num_rows()
+            };
+
+            if num_rows_to_take > 0 {
+                let sliced_batch = batch.slice(0, num_rows_to_take);
+                all_rows.push(sliced_batch);
+                total_rows += num_rows_to_take;
+            }
+
+            if total_rows >= 10 {
+                break;
+            }
+        }
+
+        let batches_as_string = pretty::pretty_format_batches(&all_rows);
+


You can simplify batches_as_string to:

// First get just the first 10 rows let preview_df = self.df.as_ref().clone().limit(0, Some(10))?; let preview_batches = wait_for_future(py, preview_df.collect())?; // Check if there are more rows by trying to get the 11th row let has_more_rows = { let check_df = self.df.as_ref().clone().limit(10, Some(1))?; let check_batch = wait_for_future(py, check_df.collect())?; !check_batch.is_empty() }; let batches_as_string = pretty::pretty_format_batches(&preview_batches);

This directly retrieves just the first 10 rows, eliminating the need for manual row tracking and slicing.

I did try this initiatially but calling collect twice led to a severe performance degradation. It used to take 50ms. With the manual slicing, it dropped to 5ms.

You can check my initial suggestion for the same here

calling collect twice led to a severe performance degradation

I ran this test to compare the performance:

import pyarrow as pa from datafusion import ( SessionContext, ) import time def run_dataframe_repr_long() -> None: ctx = SessionContext() # Create a DataFrame with more than 10 rows batch = pa.RecordBatch.from_arrays( [ pa.array(list(range(15))), pa.array([x * 2 for x in range(15)]), pa.array([x * 3 for x in range(15)]), ], names=["a", "b", "c"], ) df = ctx.create_dataframe([[batch]]) output = repr(df) def average_runtime(func, runs=100): total_time = 0 for _ in range(runs): start_time = time.time() func() end_time = time.time() total_time += end_time - start_time return total_time / runs average_time = average_runtime(run_dataframe_repr_long) print(f"Average runtime over {100} runs: {average_time:.6f} seconds")

and found no significant difference:

pr_1041 - is the branch with one collect
amended_pr_1041 - is the branch with two collect

That's weird. Maybe some artifact of my system settings? If there is no performance issues than I'll use your approach. But then why was the _repr_html_ using batch manipulation at the first place? I took the idea from that function!

hi @Spaarsh

Sorry, in my previous test, I overlooked to maturin develop for the Rust changes.

In my retests, two collects does take about 53% (1935/1265) longer.

Oh no issues. Thanks for corroborating my findings btw!

kosiew · 2025-03-04T06:56:26Z

src/dataframe.rs

        match batches_as_string {
-            Ok(batch) => Ok(format!("DataFrame()\n{batch}")),
+            Ok(batch) => {
+                if num_rows > 10 {


using has_more_rows from above

+ if has_more_rows {

kosiew · 2025-03-04T08:19:31Z

src/dataframe.rs

+            total_rows += batch.num_rows();
            let formatters = batch
                .columns()
                .iter()
                .map(|c| ArrayFormatter::try_new(c.as_ref(), &FormatOptions::default()))
-                .map(|c| {
-                    c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string())))
-                })
+                .map(|c| c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string()))))
                .collect::<Result<Vec<_>, _>>()?;
-
-            for row in 0..batch.num_rows() {
+
+            let num_rows_to_render = if total_rows > 10 { 10 } else { batch.num_rows() };
+
+            for row in 0..num_rows_to_render {
                let mut cells = Vec::new();
                for formatter in &formatters {
                    cells.push(format!("<td>{}</td>", formatter.value(row)));
                }
                let row_str = cells.join("");
                html_str.push_str(&format!("<tr>{}</tr>\n", row_str));
            }
-        }

+            if total_rows >= 10 {
+                break;
+            }


How about simplifying to:

let rows_remaining = 10 - total_rows; let rows_in_batch = batch.num_rows().min(rows_remaining); for row in 0..rows_in_batch { html_str.push_str("<tr>"); for col in batch.columns() { let formatter = ArrayFormatter::try_new(col.as_ref(), &FormatOptions::default())?; html_str.push_str("<td>"); html_str.push_str(&formatter.value(row).to_string()); html_str.push_str("</td>"); } html_str.push_str("</tr>\n"); } total_rows += rows_in_batch;

Reasons:

More Accurate Row Limiting:

Before: total_rows was updated before checking the row limit, which could result in processing extra rows unnecessarily.
After: rows_remaining = 10 - total_rows ensures that we never exceed the row limit.

Avoids Redundant Vec Allocation:

Before: Each row was constructed using a Vec, and format!() was used for each cell.
After: Directly appends elements to html_str, eliminating unnecessary heap allocations.

Simplified and More Efficient Row Processing:

Before:
Used .map() and .collect() to create a list of ArrayFormatters before processing rows.
After:
Retrieves and formats values inside the loop, reducing redundant processing.

Avoids Unnecessary break Condition:

Before: Explicit if total_rows >= 10 { break; } was used to stop processing.
After: The min(rows_remaining, batch.num_rows()) logic naturally prevents extra iterations

kevinjqliu · 2025-03-04T16:15:07Z

btw #1036 also changes _repr_html_

timsaucer · 2025-03-05T23:44:47Z

We have 3 PRs that are all impacting the __repr__ and _repr_html_. We have:

This one which does the additional data checking with a collect()
refactor: collect dataframe as stream in __repr__ #1015 which collects until we get to 10 rows
Improve collection during repr and repr_html #1036 which collects 2MB or 20 rows but just for the html rendering

I suggest we consolidate. My proposal is:

we merge in refactor: collect dataframe as stream in __repr__ #1015 as it is
I update Improve collection during repr and repr_html #1036 to combine the collecting operations to be either by minimum number of rows or data size
We close _repr_ and _html_repr_ show '... and additional rows' message #1041 in favor of the truncation message from 1036 (I'll add it to __repr__ also.

Does this sound reasonable?

Also, its incredible to have so many people pitching in at the same time. I will try to spend some time this weekend to organize some of the open issues to make it easier to not duplicate effort.

Spaarsh · 2025-03-05T23:55:36Z

We have 3 PRs that are all impacting the __repr__ and _repr_html_. We have:

This one which does the additional data checking with a collect()

refactor: collect dataframe as stream in __repr__ #1015 which collects until we get to 10 rows

Scrollable python notebook table rendering #1036 which collects 2MB or 20 rows but just for the html rendering

I suggest we consolidate. My proposal is:

we merge in refactor: collect dataframe as stream in __repr__ #1015 as it is

I update Scrollable python notebook table rendering #1036 to combine the collecting operations to be either by minimum number of rows or data size

We close repr and html_repr show '... and additional rows' message #1041 in favor of the truncation message from 1036 (I'll add it to __repr__ also.

Does this sound reasonable?

Also, its incredible to have so many people pitching in at the same time. I will try to spend some time this weekend to organize some of the open issues to make it easier to not duplicate effort.

It sounds reasonable except there's one problem. As suggested by @kosiew in this comment, there maybe a need to change how repr_html limits the rows to be printed (this one got resolved). There is also a good change suggested in this comment. Either we merge the notebook PR first and discuss the code change here or you could implement these suggestions in that PR itself.

Spaarsh · 2025-03-06T00:16:56Z

On second thoughts, since this change will lead to a performance impact (refer the last part of this comment), would it be better if we keep it separate? Just for the sake of documentation and clarity down the lane?

Spaarsh · 2025-03-14T12:28:08Z

@timsaucer Closing this since the changes were incorporated into #1036.

_repr_ and _html_repr_ show '... and additional rows' message for tru…

c285a09

…ncated outputs

kosiew reviewed Mar 4, 2025

View reviewed changes

kosiew requested changes Mar 4, 2025

View reviewed changes

timsaucer mentioned this pull request Mar 12, 2025

Improve collection during repr and repr_html #1036

Merged

Spaarsh closed this Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

_repr_ and _html_repr_ show '... and additional rows' message #1041

_repr_ and _html_repr_ show '... and additional rows' message #1041

Uh oh!

Spaarsh commented Mar 3, 2025

Uh oh!

Spaarsh commented Mar 3, 2025

Uh oh!

kosiew Mar 4, 2025 •

edited

Loading

Uh oh!

Spaarsh Mar 4, 2025

Uh oh!

kosiew Mar 5, 2025

Uh oh!

Spaarsh Mar 5, 2025

Uh oh!

kosiew Mar 5, 2025 •

edited

Loading

Uh oh!

Spaarsh Mar 5, 2025

Uh oh!

kosiew Mar 4, 2025

Uh oh!

kosiew Mar 4, 2025

Uh oh!

kevinjqliu commented Mar 4, 2025

Uh oh!

timsaucer commented Mar 5, 2025

Uh oh!

Spaarsh commented Mar 5, 2025 •

edited

Loading

Uh oh!

Spaarsh commented Mar 6, 2025

Uh oh!

Spaarsh commented Mar 14, 2025

Uh oh!

Uh oh!

_repr_ and _html_repr_ show '... and additional rows' message #1041

_repr_ and _html_repr_ show '... and additional rows' message #1041

Uh oh!

Conversation

Spaarsh commented Mar 3, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Spaarsh commented Mar 3, 2025

For _repr_

For _repr_html_

Performance Changes

Uh oh!

kosiew Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Spaarsh Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

Spaarsh Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Spaarsh Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Mar 4, 2025

Uh oh!

timsaucer commented Mar 5, 2025

Uh oh!

Spaarsh commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Spaarsh commented Mar 6, 2025

Uh oh!

Spaarsh commented Mar 14, 2025

Uh oh!

Uh oh!

For `_repr_`

For `_repr_html_`

kosiew Mar 4, 2025 •

edited

Loading

kosiew Mar 5, 2025 •

edited

Loading

Spaarsh commented Mar 5, 2025 •

edited

Loading