-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
_repr_ and _html_repr_ show '... and additional rows' message #1041
base: main
Are you sure you want to change the base?
_repr_ and _html_repr_ show '... and additional rows' message #1041
Conversation
This is the new output: For
|
Implementation | _repr_ |
_html_repr_ |
---|---|---|
Current | 3.16ms | 1.35ms |
New | 5.5ms | 4.5ms |
Note: I have manually ran the command multiple times and observed this. If required, I will run a script and produce the average for both cases
// Get 11 rows to check if there are more than 10 | ||
let df = self.df.as_ref().clone().limit(0, Some(11))?; | ||
let batches = wait_for_future(py, df.collect())?; | ||
let batches_as_string = pretty::pretty_format_batches(&batches); | ||
let num_rows = batches.iter().map(|batch| batch.num_rows()).sum::<usize>(); | ||
|
||
// Flatten batches into a single batch for the first 10 rows | ||
let mut all_rows = Vec::new(); | ||
let mut total_rows = 0; | ||
|
||
for batch in &batches { | ||
let num_rows_to_take = if total_rows + batch.num_rows() > 10 { | ||
10 - total_rows | ||
} else { | ||
batch.num_rows() | ||
}; | ||
|
||
if num_rows_to_take > 0 { | ||
let sliced_batch = batch.slice(0, num_rows_to_take); | ||
all_rows.push(sliced_batch); | ||
total_rows += num_rows_to_take; | ||
} | ||
|
||
if total_rows >= 10 { | ||
break; | ||
} | ||
} | ||
|
||
let batches_as_string = pretty::pretty_format_batches(&all_rows); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify batches_as_string
to:
// First get just the first 10 rows
let preview_df = self.df.as_ref().clone().limit(0, Some(10))?;
let preview_batches = wait_for_future(py, preview_df.collect())?;
// Check if there are more rows by trying to get the 11th row
let has_more_rows = {
let check_df = self.df.as_ref().clone().limit(10, Some(1))?;
let check_batch = wait_for_future(py, check_df.collect())?;
!check_batch.is_empty()
};
let batches_as_string = pretty::pretty_format_batches(&preview_batches);
This directly retrieves just the first 10 rows, eliminating the need for manual row tracking and slicing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try this initiatially but calling collect
twice led to a severe performance degradation. It used to take 50ms
. With the manual slicing, it dropped to 5ms
.
You can check my initial suggestion for the same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling collect twice led to a severe performance degradation
I ran this test to compare the performance:
import pyarrow as pa
from datafusion import (
SessionContext,
)
import time
def run_dataframe_repr_long() -> None:
ctx = SessionContext()
# Create a DataFrame with more than 10 rows
batch = pa.RecordBatch.from_arrays(
[
pa.array(list(range(15))),
pa.array([x * 2 for x in range(15)]),
pa.array([x * 3 for x in range(15)]),
],
names=["a", "b", "c"],
)
df = ctx.create_dataframe([[batch]])
output = repr(df)
def average_runtime(func, runs=100):
total_time = 0
for _ in range(runs):
start_time = time.time()
func()
end_time = time.time()
total_time += end_time - start_time
return total_time / runs
average_time = average_runtime(run_dataframe_repr_long)
print(f"Average runtime over {100} runs: {average_time:.6f} seconds")
and found no significant difference:

pr_1041 - is the branch with one collect
amended_pr_1041 - is the branch with two collect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird. Maybe some artifact of my system settings? If there is no performance issues than I'll use your approach. But then why was the _repr_html_
using batch manipulation at the first place? I took the idea from that function!
match batches_as_string { | ||
Ok(batch) => Ok(format!("DataFrame()\n{batch}")), | ||
Ok(batch) => { | ||
if num_rows > 10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using has_more_rows
from above
+ if has_more_rows {
total_rows += batch.num_rows(); | ||
let formatters = batch | ||
.columns() | ||
.iter() | ||
.map(|c| ArrayFormatter::try_new(c.as_ref(), &FormatOptions::default())) | ||
.map(|c| { | ||
c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string()))) | ||
}) | ||
.map(|c| c.map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string())))) | ||
.collect::<Result<Vec<_>, _>>()?; | ||
|
||
for row in 0..batch.num_rows() { | ||
|
||
let num_rows_to_render = if total_rows > 10 { 10 } else { batch.num_rows() }; | ||
|
||
for row in 0..num_rows_to_render { | ||
let mut cells = Vec::new(); | ||
for formatter in &formatters { | ||
cells.push(format!("<td>{}</td>", formatter.value(row))); | ||
} | ||
let row_str = cells.join(""); | ||
html_str.push_str(&format!("<tr>{}</tr>\n", row_str)); | ||
} | ||
} | ||
|
||
if total_rows >= 10 { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about simplifying to:
let rows_remaining = 10 - total_rows;
let rows_in_batch = batch.num_rows().min(rows_remaining);
for row in 0..rows_in_batch {
html_str.push_str("<tr>");
for col in batch.columns() {
let formatter =
ArrayFormatter::try_new(col.as_ref(), &FormatOptions::default())?;
html_str.push_str("<td>");
html_str.push_str(&formatter.value(row).to_string());
html_str.push_str("</td>");
}
html_str.push_str("</tr>\n");
}
total_rows += rows_in_batch;
Reasons:
- More Accurate Row Limiting:
Before: total_rows was updated before checking the row limit, which could result in processing extra rows unnecessarily.
After: rows_remaining = 10 - total_rows ensures that we never exceed the row limit.
- Avoids Redundant Vec Allocation:
Before: Each row was constructed using a Vec, and format!() was used for each cell.
After: Directly appends elements to html_str, eliminating unnecessary heap allocations.
- Simplified and More Efficient Row Processing:
Before:
Used .map() and .collect() to create a list of ArrayFormatters before processing rows.
After:
Retrieves and formats values inside the loop, reducing redundant processing.
- Avoids Unnecessary break Condition:
Before: Explicit if total_rows >= 10 { break; } was used to stop processing.
After: The min(rows_remaining, batch.num_rows()) logic naturally prevents extra iterations
btw #1036 also changes |
…ncated outputs
Which issue does this PR close?
Closes #1026
Rationale for this change
Current implementation of
_repr_
and_html_repr_
has no indication for whether the output was truncated or not.What changes are included in this PR?
The changes override the default output for the
_repr_
and_html_repr_
functions and, instead, check if the table has more than 10 rows. If it does, then it prints the first 10 rows along with a... and additional rows
message.A thing to note is that these changes flatten the batches of the dataframe in the
_repr_ function
(as already done for the_repr_html_ function
). This was done primarily since using.collect
twice was causing major performance degradation. Batch flattening does this at 1/10 th of the time.Are there any user-facing changes?
The users shall now see the
... and additional rows
message along with the first 10 rows if the table has more than 10 rows.