Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_repr_ and _html_repr_ show '... and additional rows' message #1041

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Spaarsh
Copy link

@Spaarsh Spaarsh commented Mar 3, 2025

…ncated outputs

Which issue does this PR close?

Closes #1026

Rationale for this change

Current implementation of _repr_ and _html_repr_ has no indication for whether the output was truncated or not.

What changes are included in this PR?

The changes override the default output for the _repr_ and _html_repr_ functions and, instead, check if the table has more than 10 rows. If it does, then it prints the first 10 rows along with a ... and additional rows message.

A thing to note is that these changes flatten the batches of the dataframe in the _repr_ function (as already done for the _repr_html_ function). This was done primarily since using .collect twice was causing major performance degradation. Batch flattening does this at 1/10 th of the time.

Are there any user-facing changes?

The users shall now see the ... and additional rows message along with the first 10 rows if the table has more than 10 rows.

@Spaarsh
Copy link
Author

Spaarsh commented Mar 3, 2025

This is the new output:

For _repr_

+---------+---------+
| letters | numbers |
+---------+---------+
| A       | 1       |
| B       | 2       |
| C       | 3       |
| D       | 4       |
| E       | 5       |
| F       | 6       |
| G       | 7       |
| H       | 8       |
| I       | 9       |
| J       | 10      |
+---------+---------+
and more...

For _repr_html_

<table border='1'>
<tr><th>letters</th><th>numbers</th></tr>
<tr><td>A</td><td>1</td></tr>
<tr><td>B</td><td>2</td></tr>
<tr><td>C</td><td>3</td></tr>
<tr><td>D</td><td>4</td></tr>
<tr><td>E</td><td>5</td></tr>
<tr><td>F</td><td>6</td></tr>
<tr><td>G</td><td>7</td></tr>
<tr><td>H</td><td>8</td></tr>
<tr><td>I</td><td>9</td></tr>
<tr><td>J</td><td>10</td></tr>
<tr><td colspan="100%">and more...</td></tr>
</table>

Performance Changes

Implementation _repr_ _html_repr_
Current 3.16ms 1.35ms
New 5.5ms 4.5ms

Note: I have manually ran the command multiple times and observed this. If required, I will run a script and produce the average for both cases

Comment on lines +93 to +121
// Get 11 rows to check if there are more than 10
let df = self.df.as_ref().clone().limit(0, Some(11))?;
let batches = wait_for_future(py, df.collect())?;
let batches_as_string = pretty::pretty_format_batches(&batches);
let num_rows = batches.iter().map(|batch| batch.num_rows()).sum::<usize>();

// Flatten batches into a single batch for the first 10 rows
let mut all_rows = Vec::new();
let mut total_rows = 0;

for batch in &batches {
let num_rows_to_take = if total_rows + batch.num_rows() > 10 {
10 - total_rows
} else {
batch.num_rows()
};

if num_rows_to_take > 0 {
let sliced_batch = batch.slice(0, num_rows_to_take);
all_rows.push(sliced_batch);
total_rows += num_rows_to_take;
}

if total_rows >= 10 {
break;
}
}

let batches_as_string = pretty::pretty_format_batches(&all_rows);

Copy link
Contributor

@kosiew kosiew Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify batches_as_string to:

        // First get just the first 10 rows
        let preview_df = self.df.as_ref().clone().limit(0, Some(10))?;
        let preview_batches = wait_for_future(py, preview_df.collect())?;

        // Check if there are more rows by trying to get the 11th row
        let has_more_rows = {
            let check_df = self.df.as_ref().clone().limit(10, Some(1))?;
            let check_batch = wait_for_future(py, check_df.collect())?;
            !check_batch.is_empty()
        };

        let batches_as_string = pretty::pretty_format_batches(&preview_batches);

This directly retrieves just the first 10 rows, eliminating the need for manual row tracking and slicing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try this initiatially but calling collect twice led to a severe performance degradation. It used to take 50ms. With the manual slicing, it dropped to 5ms.

You can check my initial suggestion for the same here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling collect twice led to a severe performance degradation

I ran this test to compare the performance:

import pyarrow as pa
from datafusion import (
    SessionContext,
)
import time


def run_dataframe_repr_long() -> None:
    ctx = SessionContext()
    # Create a DataFrame with more than 10 rows
    batch = pa.RecordBatch.from_arrays(
        [
            pa.array(list(range(15))),
            pa.array([x * 2 for x in range(15)]),
            pa.array([x * 3 for x in range(15)]),
        ],
        names=["a", "b", "c"],
    )
    df = ctx.create_dataframe([[batch]])

    output = repr(df)


def average_runtime(func, runs=100):
    total_time = 0
    for _ in range(runs):
        start_time = time.time()
        func()
        end_time = time.time()
        total_time += end_time - start_time
    return total_time / runs


average_time = average_runtime(run_dataframe_repr_long)
print(f"Average runtime over {100} runs: {average_time:.6f} seconds")

and found no significant difference:

image

pr_1041 - is the branch with one collect
amended_pr_1041 - is the branch with two collect

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird. Maybe some artifact of my system settings? If there is no performance issues than I'll use your approach. But then why was the _repr_html_ using batch manipulation at the first place? I took the idea from that function!

match batches_as_string {
Ok(batch) => Ok(format!("DataFrame()\n{batch}")),
Ok(batch) => {
if num_rows > 10 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using has_more_rows from above

+                if has_more_rows {

Comment on lines +162 to +183
total_rows += batch.num_rows();
let formatters = batch
.columns()
.iter()
.map(|c| ArrayFormatter::try_new(c.as_ref(), &FormatOptions::default()))
.map(|c| {
c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string())))
})
.map(|c| c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string()))))
.collect::<Result<Vec<_>, _>>()?;

for row in 0..batch.num_rows() {

let num_rows_to_render = if total_rows > 10 { 10 } else { batch.num_rows() };

for row in 0..num_rows_to_render {
let mut cells = Vec::new();
for formatter in &formatters {
cells.push(format!("<td>{}</td>", formatter.value(row)));
}
let row_str = cells.join("");
html_str.push_str(&format!("<tr>{}</tr>\n", row_str));
}
}

if total_rows >= 10 {
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about simplifying to:

            let rows_remaining = 10 - total_rows;
            let rows_in_batch = batch.num_rows().min(rows_remaining);

            for row in 0..rows_in_batch {
                html_str.push_str("<tr>");
                for col in batch.columns() {
                    let formatter =
                        ArrayFormatter::try_new(col.as_ref(), &FormatOptions::default())?;
                    html_str.push_str("<td>");
                    html_str.push_str(&formatter.value(row).to_string());
                    html_str.push_str("</td>");
                }
                html_str.push_str("</tr>\n");
            }

            total_rows += rows_in_batch;

Reasons:

  1. More Accurate Row Limiting:

Before: total_rows was updated before checking the row limit, which could result in processing extra rows unnecessarily.
After: rows_remaining = 10 - total_rows ensures that we never exceed the row limit.

  1. Avoids Redundant Vec Allocation:

Before: Each row was constructed using a Vec, and format!() was used for each cell.
After: Directly appends elements to html_str, eliminating unnecessary heap allocations.

  1. Simplified and More Efficient Row Processing:

Before:
Used .map() and .collect() to create a list of ArrayFormatters before processing rows.
After:
Retrieves and formats values inside the loop, reducing redundant processing.

  1. Avoids Unnecessary break Condition:

Before: Explicit if total_rows >= 10 { break; } was used to stop processing.
After: The min(rows_remaining, batch.num_rows()) logic naturally prevents extra iterations

@kevinjqliu
Copy link
Contributor

btw #1036 also changes _repr_html_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance __repr__ and _repr_html_ with a note for additional rows
3 participants