Skip to content

Consider allowing 1 dimensional [ to drop the geometry column #2131

Open
@DavisVaughan

Description

@DavisVaughan

In dplyr and in many other places in the tidyverse, we program with 1 dimensional calls to [, such as df["x"], where we expect that the result has exactly 1 column, and should be named x.

In ?dplyr::dplyr_extending, we discuss how this is one of the invariants that is required for compatibility with dplyr.

But sf doesn't do this, and instead retains the geometry column as a "sticky" column:

library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.0, PROJ 9.1.1; sf_use_s2() is TRUE

nrows <- 10
geometry = st_sfc(lapply(1:nrows, function(x) st_geometrycollection()))
df <- st_sf(id = 1:nrows, geometry = geometry)

df
#> Simple feature collection with 10 features and 1 field (with 10 geometries empty)
#> Geometry type: GEOMETRYCOLLECTION
#> Dimension:     XY
#> Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
#> CRS:           NA
#>    id                 geometry
#> 1   1 GEOMETRYCOLLECTION EMPTY
#> 2   2 GEOMETRYCOLLECTION EMPTY
#> 3   3 GEOMETRYCOLLECTION EMPTY
#> 4   4 GEOMETRYCOLLECTION EMPTY
#> 5   5 GEOMETRYCOLLECTION EMPTY
#> 6   6 GEOMETRYCOLLECTION EMPTY
#> 7   7 GEOMETRYCOLLECTION EMPTY
#> 8   8 GEOMETRYCOLLECTION EMPTY
#> 9   9 GEOMETRYCOLLECTION EMPTY
#> 10 10 GEOMETRYCOLLECTION EMPTY

# `geometry` sticks around
df["id"]
#> Simple feature collection with 10 features and 1 field (with 10 geometries empty)
#> Geometry type: GEOMETRYCOLLECTION
#> Dimension:     XY
#> Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
#> CRS:           NA
#>    id                 geometry
#> 1   1 GEOMETRYCOLLECTION EMPTY
#> 2   2 GEOMETRYCOLLECTION EMPTY
#> 3   3 GEOMETRYCOLLECTION EMPTY
#> 4   4 GEOMETRYCOLLECTION EMPTY
#> 5   5 GEOMETRYCOLLECTION EMPTY
#> 6   6 GEOMETRYCOLLECTION EMPTY
#> 7   7 GEOMETRYCOLLECTION EMPTY
#> 8   8 GEOMETRYCOLLECTION EMPTY
#> 9   9 GEOMETRYCOLLECTION EMPTY
#> 10 10 GEOMETRYCOLLECTION EMPTY

This has caused quite a bit of pain in dplyr over the years, and has recently also just bitten me again in hardhat, where I also use df[cols] as a way to select columns tidymodels/hardhat#228.

In dplyr, algorithms underlying functions like arrange() and distinct() use df[i] to first select the columns to order by or compute the unique values of, so retaining extra columns here can be particularly problematic. I know sf has methods for these two verbs to work around this, but I think those could be avoided entirely if the geometry column wasn't sticky here.

I think:

  • Having dplyr::select() retain sticky columns makes for a great user experience
  • Having df[i] retain sticky columns makes for a painful programming experience

This ^ is our general advice regarding sticky columns, and is how dplyr's grouped data frames work:

library(dplyr, warn.conflicts = FALSE)

df <- tibble(g = 1, x = 2)
df <- group_by(df, g)

select(df, x)
#> Adding missing grouping variables: `g`
#> # A tibble: 1 × 2
#> # Groups:   g [1]
#>       g     x
#>   <dbl> <dbl>
#> 1     1     2

# Returns a bare tibble, not a grouped data frame
df["x"]
#> # A tibble: 1 × 1
#>       x
#>   <dbl>
#> 1     2

It is also how tsibble works with its index column.

Would you ever consider allowing df[i] to only return exactly the columns selected by i? If the geometry column isn't selected, then the appropriate behavior would be to return a bare data frame or bare tibble depending on the underlying data structure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions