PERF: Look into reducing copies for native execution #7435

sfc-gh-mvashishtha · 2025-02-08T01:37:22Z

Native execution uses a pandas dataframe to represent the data. For operations that act on just one dataframe, the sequence for defaulting to pandas is:

copy the pandas dataframe
call a pandas method on it (with some wrapping to ensure semantic correctness)
construct a new query compiler out of the result, which requires another copy

I have observed that the copies can take a very long time (on the order of seconds for GBs of data) whether or not pandas copy-on-write is enabled on a 2GB numerical dataset.

We do the copies to ensure correct semantics-- in the compiler constructor we do the copy so that mutating the original pandas dataframe doesn't affect the new dataframe, and we copy before doing the operation because NativeQueryCompiler.to_pandas() has to make a copy for the same reason. However, if we're careful not to ever return a mutated pandas dataframe or mutate an input pandas dataframe, we can skip the copies. OTOH, maybe we should be relying more on pandas copy on write.

Here is the little benchmark I was trying out on my mac (8 physical cores, 64 GB memory, MacBook Pro 16-inch 2023, macOS Sequoia 15.3, Apple M2 CPU)

import modin.pandas as pd
import numpy as np
from modin import set_execution

set_execution("Native", "Native")

# make a df of about 8 GB 
df = pd.DataFrame(np.random.randint(0,100,size=(2**22,2**8)))

%time repr(df.sort_values(0))

The modin sort takes about 15.4 seconds whereas pandas takes about 5.21 seconds to do the equivalent task. Copying the pandas frame takes about 6 seconds, whether or not I have copy on write enabled.

output of `modin.pandas.show_versions()`


INSTALLED VERSIONS
------------------
commit                : 35275e1c2d8a420a0fd16e3fca6ae5383fbdbc55
python                : 3.9.21
python-bits           : 64
OS                    : Darwin
OS-release            : 24.3.0
Version               : Darwin Kernel Version 24.3.0: Thu Jan  2 20:24:23 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T6020
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

Modin dependencies
------------------
modin                 : 0.32.0+11.g35275e1c.dirty
ray                   : 2.40.0
dask                  : 2024.8.0
distributed           : 2024.8.0

pandas dependencies
-------------------
pandas                : 2.2.3
numpy                 : 2.0.2
pytz                  : 2024.2
dateutil              : 2.8.2
pip                   : 24.2
Cython                : None
sphinx                : 7.4.7
IPython               : 8.18.1
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
blosc                 : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : 2024.11.0
fsspec                : 2024.12.0
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : 3.1.5
lxml.etree            : 5.3.0
matplotlib            : 3.9.4
numba                 : None
numexpr               : 2.10.2
odfpy                 : None
openpyxl              : 3.1.5
pandas_gbq            : 0.26.1
psycopg2              : 2.9.10
pymysql               : None
pyarrow               : 19.0.0
pyreadstat            : None
pytest                : 8.3.4
python-calamine       : None
pyxlsb                : None
s3fs                  : 2024.12.0
scipy                 : 1.13.1
sqlalchemy            : 2.0.37
tables                : N/A
tabulate              : 0.9.0
xarray                : 2024.7.0
xlrd                  : 2.0.1
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.2
qtpy                  : None
pyqt5                 : None

The text was updated successfully, but these errors were encountered:

sfc-gh-joshi · 2025-02-11T20:30:02Z

whether or not pandas copy-on-write is enabled

Do you know why this might be the case? Is there some metadata manipulation that's causing a lot of overhead?

…ution. (#7436) # User-facing changes Prior to this commit, users had to switch the config variable `NativeDataFrameMode` from the default of `"Default"` to `"Pandas"` to use native execution. Now native execution is another modin execution mode with `StorageFormat` of `"Native"` and `Engine` of `"Native.`" # Integration tests and CI Prior to this commit, we ran 1) a [set of tests](https://github.com/modin-project/modin/tree/8a832de870243294c407dee6300d993647205ff3/modin/tests/pandas/native_df_mode) checking that native Modin dataframes could interoperate with non-native dataframes 2) a [subset](https://github.com/modin-project/modin/blob/8a832de870243294c407dee6300d993647205ff3/.github/workflows/ci.yml#L710-L719) of tests in native dataframe mode. Now, we run the interoperability test suite, but also run the entire rest of the test suite (except for some partitioned-execution-only tests) in native execution mode via the `test-all` job matrix. This commit also renames the interoperability test suite from modin/tests/pandas/native_df_mode to modin/tests/pandas/native_df_interoperability/. # Deleting most of the NativeQueryCompiler implementation NativeQueryCompiler had a long implementation which was mostly the same as the BaseQueryCompiler implementation. However, there were some bugs in NativeQueryCompiler, including some correctness bugs related to copying the underlying pandas dataframe (see #7435). This commit deletes most of the NativeQueryCompiler implementation, so that the native query compiler mostly works just like the BaseQueryCompiler. The main difference is that while `BaseQueryCompiler` uses a partitioned pandas dataframe (under the `Python` execution, so all in a single process), the native query compiler does not use partitions. # Warning messages about default to pandas While BaseQueryCompiler and BaseIO warn when they default to pandas, they should not do so when using native execution. We add class-level fields to these classes that tell whether to warn on default to pandas. By default, we treat warnings as errors in our test suite, so in many places we have to look for the default to pandas warning only if we are not native execution mode. For convenience, this PR adds testing utility methods to 1) detect the global native execution mode 2) detect whether a dataframe or series is using native execution 3) conditionally expect a warning about defaulting to pandas. Signed-off-by: sfc-gh-mvashishtha <[email protected]>

sfc-gh-mvashishtha added P1 Important tasks that we should complete soon Performance 🚀 Performance related issues and pull requests. labels Feb 8, 2025

sfc-gh-mvashishtha mentioned this issue Feb 10, 2025

FEAT-#7433: Replace NativeDataFrameMode with a complete "native" execution. #7436

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Look into reducing copies for native execution #7435

PERF: Look into reducing copies for native execution #7435

sfc-gh-mvashishtha commented Feb 8, 2025

sfc-gh-joshi commented Feb 11, 2025

PERF: Look into reducing copies for native execution #7435

PERF: Look into reducing copies for native execution #7435

Comments

sfc-gh-mvashishtha commented Feb 8, 2025

sfc-gh-joshi commented Feb 11, 2025