Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Look into reducing copies for native execution #7435

Open
sfc-gh-mvashishtha opened this issue Feb 8, 2025 · 1 comment
Open

PERF: Look into reducing copies for native execution #7435

sfc-gh-mvashishtha opened this issue Feb 8, 2025 · 1 comment
Labels
P1 Important tasks that we should complete soon Performance 🚀 Performance related issues and pull requests.

Comments

@sfc-gh-mvashishtha
Copy link
Contributor

Native execution uses a pandas dataframe to represent the data. For operations that act on just one dataframe, the sequence for defaulting to pandas is:

  1. copy the pandas dataframe
  2. call a pandas method on it (with some wrapping to ensure semantic correctness)
  3. construct a new query compiler out of the result, which requires another copy

I have observed that the copies can take a very long time (on the order of seconds for GBs of data) whether or not pandas copy-on-write is enabled on a 2GB numerical dataset.

We do the copies to ensure correct semantics-- in the compiler constructor we do the copy so that mutating the original pandas dataframe doesn't affect the new dataframe, and we copy before doing the operation because NativeQueryCompiler.to_pandas() has to make a copy for the same reason. However, if we're careful not to ever return a mutated pandas dataframe or mutate an input pandas dataframe, we can skip the copies. OTOH, maybe we should be relying more on pandas copy on write.

Here is the little benchmark I was trying out on my mac (8 physical cores, 64 GB memory, MacBook Pro 16-inch 2023, macOS Sequoia 15.3, Apple M2 CPU)

import modin.pandas as pd
import numpy as np
from modin import set_execution

set_execution("Native", "Native")

# make a df of about 8 GB 
df = pd.DataFrame(np.random.randint(0,100,size=(2**22,2**8)))

%time repr(df.sort_values(0))

The modin sort takes about 15.4 seconds whereas pandas takes about 5.21 seconds to do the equivalent task. Copying the pandas frame takes about 6 seconds, whether or not I have copy on write enabled.

output of `modin.pandas.show_versions()`

INSTALLED VERSIONS
------------------
commit                : 35275e1c2d8a420a0fd16e3fca6ae5383fbdbc55
python                : 3.9.21
python-bits           : 64
OS                    : Darwin
OS-release            : 24.3.0
Version               : Darwin Kernel Version 24.3.0: Thu Jan  2 20:24:23 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T6020
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

Modin dependencies
------------------
modin                 : 0.32.0+11.g35275e1c.dirty
ray                   : 2.40.0
dask                  : 2024.8.0
distributed           : 2024.8.0

pandas dependencies
-------------------
pandas                : 2.2.3
numpy                 : 2.0.2
pytz                  : 2024.2
dateutil              : 2.8.2
pip                   : 24.2
Cython                : None
sphinx                : 7.4.7
IPython               : 8.18.1
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
blosc                 : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : 2024.11.0
fsspec                : 2024.12.0
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : 3.1.5
lxml.etree            : 5.3.0
matplotlib            : 3.9.4
numba                 : None
numexpr               : 2.10.2
odfpy                 : None
openpyxl              : 3.1.5
pandas_gbq            : 0.26.1
psycopg2              : 2.9.10
pymysql               : None
pyarrow               : 19.0.0
pyreadstat            : None
pytest                : 8.3.4
python-calamine       : None
pyxlsb                : None
s3fs                  : 2024.12.0
scipy                 : 1.13.1
sqlalchemy            : 2.0.37
tables                : N/A
tabulate              : 0.9.0
xarray                : 2024.7.0
xlrd                  : 2.0.1
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.2
qtpy                  : None
pyqt5                 : None

@sfc-gh-mvashishtha sfc-gh-mvashishtha added P1 Important tasks that we should complete soon Performance 🚀 Performance related issues and pull requests. labels Feb 8, 2025
@sfc-gh-joshi
Copy link
Contributor

whether or not pandas copy-on-write is enabled

Do you know why this might be the case? Is there some metadata manipulation that's causing a lot of overhead?

sfc-gh-mvashishtha added a commit that referenced this issue Feb 13, 2025
…ution. (#7436)

# User-facing changes

Prior to this commit, users had to switch the config variable `NativeDataFrameMode` from the default of `"Default"` to `"Pandas"` to use native execution. Now native execution is another modin execution mode with `StorageFormat` of `"Native"` and `Engine` of `"Native.`"

# Integration tests and CI

Prior to this commit, we ran 1) a [set of tests](https://github.com/modin-project/modin/tree/8a832de870243294c407dee6300d993647205ff3/modin/tests/pandas/native_df_mode) checking that native Modin dataframes could interoperate with non-native dataframes 2) a [subset](https://github.com/modin-project/modin/blob/8a832de870243294c407dee6300d993647205ff3/.github/workflows/ci.yml#L710-L719) of tests in native dataframe mode.

Now, we run the interoperability test suite, but also run the entire rest of the test suite (except for some partitioned-execution-only tests) in native execution mode via the `test-all` job matrix. This commit also renames the interoperability test suite from modin/tests/pandas/native_df_mode to modin/tests/pandas/native_df_interoperability/.

# Deleting most of the NativeQueryCompiler implementation

NativeQueryCompiler had a long implementation which was mostly the same as the BaseQueryCompiler implementation. However, there were some bugs in NativeQueryCompiler, including some correctness bugs related to copying the underlying pandas dataframe (see #7435). This commit deletes most of the NativeQueryCompiler implementation, so that the native query compiler mostly works just like the BaseQueryCompiler. The main difference is that while `BaseQueryCompiler` uses a partitioned pandas dataframe (under the `Python` execution, so all in a single process), the native query compiler does not use partitions.

# Warning messages about default to pandas

While BaseQueryCompiler and BaseIO warn when they default to pandas, they should not do so when using native execution. We add class-level fields to these classes that tell whether to warn on default to pandas.

By default, we treat warnings as errors in our test suite, so in many places we have to look for the default to pandas warning only if we are not native execution mode. For convenience, this PR adds testing utility methods to 1) detect the global native execution mode 2) detect whether a dataframe or series is using native execution 3) conditionally expect a warning about defaulting to pandas.

Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Important tasks that we should complete soon Performance 🚀 Performance related issues and pull requests.
Projects
None yet
Development

No branches or pull requests

2 participants