Skip to content

NestedSeries Implementation #331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

NestedSeries Implementation #331

wants to merge 19 commits into from

Conversation

dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Aug 18, 2025

Resolves #304. This PR has gotten large enough that it might have been better to split it up into a few smaller steps, sorry about that. I've yet to write documentation, but I think it makes sense to write it as a follow up PR while we iron out any implementation/api behaviors here.

Some design choices I made here, which we can definitely do differently:

  • NestedSeries takes on a set of the nest accessor functions for direct use
  • NestedSeries works with non-nested dtypes just as a normal pandas series, but nested specific properties/methods are tagged with a decorator which will throw an exception for attempted use with a non-nested dtype
  • When returning a non-nested series, still try to return a native pandas series. Don't use NestedSeries as an everywhere replacement for pandas series when it's not needed.
  • For masking, return result as a NestedSeries always, instead of sometimes as a NestedFrame

Copy link

github-actions bot commented Aug 18, 2025

Before [919fe82] After [47b8b7c] Ratio Benchmark (Parameter)
1.24±0.01ms 1.35±0ms 1.09 benchmarks.NestedFrameReduce.time_run
10.9±0.1ms 11.1±0.2ms 1.02 benchmarks.NestedFrameQuery.time_run
11.6±0.4ms 11.7±0.3ms 1.01 benchmarks.NestedFrameAddNested.time_run
177M 179M 1.01 benchmarks.ReadFewColumnsHTTPS.peakmem_run
136M 136M 1 benchmarks.CountNestedBy.peakmem_run
102M 102M 1 benchmarks.NestedFrameAddNested.peakmem_run
107M 107M 1 benchmarks.NestedFrameQuery.peakmem_run
106M 106M 1 benchmarks.NestedFrameReduce.peakmem_run
271M 270M 1 benchmarks.ReassignHalfOfNestedSeries.peakmem_run
250M 247M 0.99 benchmarks.AssignSingleDfToNestedSeries.peakmem_run

Click here to view all benchmarks.

Copy link

codecov bot commented Aug 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.19%. Comparing base (46acb8e) to head (7e1233c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #331      +/-   ##
==========================================
+ Coverage   98.11%   98.19%   +0.08%     
==========================================
  Files          18       19       +1     
  Lines        1748     1829      +81     
==========================================
+ Hits         1715     1796      +81     
  Misses         33       33              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dougbrn dougbrn changed the title [WIP] NestedSeries Implementation NestedSeries Implementation Aug 20, 2025
@dougbrn dougbrn marked this pull request as ready for review August 20, 2025 17:11
@dougbrn dougbrn requested review from gitosaurus and hombit August 20, 2025 17:22
# Allow boolean masking given a Series of booleans
if isinstance(key, pd.Series) and pd.api.types.is_bool_dtype(key.dtype):
flat_df = self.to_flat() # Use the flat representation
if not key.index.equals(flat_df.index):
raise ValueError("Boolean mask must have the same index as the flattened nested dataframe.")
# Apply the mask to the series, return a new NestedFrame
return NestedFrame(index=self._series.index).add_nested(flat_df[key], name=self._series.name)
# return NestedFrame(index=self._series.index).add_nested(flat_df[key], name=self._series.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code?

Comment on lines +511 to +513
# if len(key) == 1 and not isinstance(new_array.dtype.field_dtype(key[0]), NestedDtype):
# # If only one field is requested, return it as a pd.Series
# return self._series[key[0]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code or future plan?

if not isinstance(self.dtype, NestedDtype):
return super().__getitem__(key)

# Return a flatten series for a single field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Return a flatten series for a single field
# Return a flattened series for a single field

Comment on lines +61 to +63
# Handle boolean masking
if isinstance(key, pd.Series) and pd.api.types.is_bool_dtype(key.dtype):
return self.nest[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Comment on lines +19 to +27
class NestedSeries(pd.Series):
"""
A Series that can contain nested data structures, such as lists or dictionaries.
This class extends the functionality of a standard pandas Series to handle nested data.
"""

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a user does binary operations on a NestedSeries with a Series? I suspect that you may want to follow the same procedure when extending a Pandas class that _SeriesFromNest does, here.

I had been wondering whether _SeriesFromNest and NestedSeries could be dovetailed, but on reflection I do think they are serving different purposes: the former tracks a series (field) extracted from a nest, and the latter represents the nest as a first-class object. Do you agree?

I wonder if this means that this PR resolves (or helps resolve) #284.

@@ -585,17 +606,18 @@ def to_flatten_inner(self, field: str) -> pd.Series:
>>> from nested_pandas import NestedFrame
>>> from nested_pandas.datasets import generate_data
>>> nf = generate_data(5, 2, seed=1).rename(columns={"nested": "inner"})
>>> nf["b"] = "b" # Shorten width of example output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny! 🙏 for the comment. Is that because 'black' formatting interferes with doctests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement NestedSeries
2 participants