Description
In dplyr and in many other places in the tidyverse, we program with 1 dimensional calls to [
, such as df["x"]
, where we expect that the result has exactly 1 column, and should be named x
.
In ?dplyr::dplyr_extending
, we discuss how this is one of the invariants that is required for compatibility with dplyr.
But sf doesn't do this, and instead retains the geometry
column as a "sticky" column:
library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.0, PROJ 9.1.1; sf_use_s2() is TRUE
nrows <- 10
geometry = st_sfc(lapply(1:nrows, function(x) st_geometrycollection()))
df <- st_sf(id = 1:nrows, geometry = geometry)
df
#> Simple feature collection with 10 features and 1 field (with 10 geometries empty)
#> Geometry type: GEOMETRYCOLLECTION
#> Dimension: XY
#> Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
#> CRS: NA
#> id geometry
#> 1 1 GEOMETRYCOLLECTION EMPTY
#> 2 2 GEOMETRYCOLLECTION EMPTY
#> 3 3 GEOMETRYCOLLECTION EMPTY
#> 4 4 GEOMETRYCOLLECTION EMPTY
#> 5 5 GEOMETRYCOLLECTION EMPTY
#> 6 6 GEOMETRYCOLLECTION EMPTY
#> 7 7 GEOMETRYCOLLECTION EMPTY
#> 8 8 GEOMETRYCOLLECTION EMPTY
#> 9 9 GEOMETRYCOLLECTION EMPTY
#> 10 10 GEOMETRYCOLLECTION EMPTY
# `geometry` sticks around
df["id"]
#> Simple feature collection with 10 features and 1 field (with 10 geometries empty)
#> Geometry type: GEOMETRYCOLLECTION
#> Dimension: XY
#> Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
#> CRS: NA
#> id geometry
#> 1 1 GEOMETRYCOLLECTION EMPTY
#> 2 2 GEOMETRYCOLLECTION EMPTY
#> 3 3 GEOMETRYCOLLECTION EMPTY
#> 4 4 GEOMETRYCOLLECTION EMPTY
#> 5 5 GEOMETRYCOLLECTION EMPTY
#> 6 6 GEOMETRYCOLLECTION EMPTY
#> 7 7 GEOMETRYCOLLECTION EMPTY
#> 8 8 GEOMETRYCOLLECTION EMPTY
#> 9 9 GEOMETRYCOLLECTION EMPTY
#> 10 10 GEOMETRYCOLLECTION EMPTY
This has caused quite a bit of pain in dplyr over the years, and has recently also just bitten me again in hardhat, where I also use df[cols]
as a way to select columns tidymodels/hardhat#228.
In dplyr, algorithms underlying functions like arrange()
and distinct()
use df[i]
to first select the columns to order by or compute the unique values of, so retaining extra columns here can be particularly problematic. I know sf has methods for these two verbs to work around this, but I think those could be avoided entirely if the geometry column wasn't sticky here.
I think:
- Having
dplyr::select()
retain sticky columns makes for a great user experience - Having
df[i]
retain sticky columns makes for a painful programming experience
This ^ is our general advice regarding sticky columns, and is how dplyr's grouped data frames work:
library(dplyr, warn.conflicts = FALSE)
df <- tibble(g = 1, x = 2)
df <- group_by(df, g)
select(df, x)
#> Adding missing grouping variables: `g`
#> # A tibble: 1 × 2
#> # Groups: g [1]
#> g x
#> <dbl> <dbl>
#> 1 1 2
# Returns a bare tibble, not a grouped data frame
df["x"]
#> # A tibble: 1 × 1
#> x
#> <dbl>
#> 1 2
It is also how tsibble works with its index column.
Would you ever consider allowing df[i]
to only return exactly the columns selected by i
? If the geometry column isn't selected, then the appropriate behavior would be to return a bare data frame or bare tibble depending on the underlying data structure.