-
Notifications
You must be signed in to change notification settings - Fork 60
/
README.Rmd
233 lines (184 loc) · 9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
output:
github_document:
html_preview: false
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
options(tibble.print_min = 3)
```
# 🏎💨vroom <a href='https:/vroom.r-lib.org'><img src='man/figures/logo.png' align="right" height="135" /></a>
<!-- badges: start -->
[![R-CMD-check](https://github.com/tidyverse/vroom/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidyverse/vroom/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/tidyverse/vroom/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/vroom?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/vroom)](https://cran.r-project.org/package=vroom)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
<!-- badges: end -->
```{r echo = FALSE, message = FALSE}
tm <- vroom::vroom(system.file("bench", "taxi.tsv", package = "vroom"))
versions <- vroom::vroom(system.file("bench", "session_info.tsv", package = "vroom"))
# Use the base version number for read.delim
versions$package[versions$package == "base"] <- "read.delim"
library(dplyr)
tbl <- tm %>% filter(type == "real", op == "read", reading_package %in% c("data.table", "readr", "read.delim") | manip_package == "base") %>%
rename(package = reading_package) %>%
left_join(versions) %>%
transmute(
package = package,
version = ondiskversion,
"time (sec)" = time,
speedup = max(time) / time,
"throughput" = paste0(prettyunits::pretty_bytes(size / time), "/sec")
) %>%
arrange(desc(speedup))
```
The fastest delimited reader for R, **`r filter(tbl, package == "vroom") %>% pull("throughput") %>% trimws()`**.
<img src="https://raw.githubusercontent.com/tidyverse/vroom/main/img/taylor.gif" align="right" />
But that's impossible! How can it be [so fast](https://vroom.r-lib.org/articles/benchmarks.html)?
vroom doesn't stop to actually _read_ all of your data, it simply indexes where each record is located so it can be read later.
The vectors returned use the [Altrep framework](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) to lazily load the data on-demand when it is accessed, so you only pay for what you use.
This lazy access is done automatically, so no changes to your R data-manipulation code are needed.
vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.
```{r, echo = FALSE}
knitr::kable(tbl, digits = 2, align = "lrrrr")
```
## Features
vroom has nearly all of the parsing features of
[readr](https://readr.tidyverse.org) for delimited and fixed width files, including
- delimiter guessing\*
- custom delimiters (including multi-byte\* and Unicode\* delimiters)
- specification of column types (including type guessing)
- numeric types (double, integer, big integer\*, number)
- logical types
- datetime types (datetime, date, time)
- categorical types (characters, factors)
- column selection, like `dplyr::select()`\*
- skipping headers, comments and blank lines
- quoted fields
- double and backslashed escapes
- whitespace trimming
- windows newlines
- [reading from multiple files or connections\*](#reading-multiple-files)
- embedded newlines in headers and fields\*\*
- writing delimited files with as-needed quoting.
- robust to invalid inputs (vroom has been extensively tested with the
[afl](https://lcamtuf.coredump.cx/afl/) fuzz tester)\*.
\* *these are additional features not in readr.*
\*\* *requires `num_threads = 1`.*
## Installation
Install vroom from CRAN with:
```r
install.packages("vroom")
```
Alternatively, if you need the development version from
[GitHub](https://github.com/) install it with:
``` r
# install.packages("pak")
pak::pak("tidyverse/vroom")
```
## Usage
See [getting started](https://vroom.r-lib.org/articles/vroom.html)
to jump start your use of vroom!
vroom uses the same interface as readr to specify column types.
```{r, include = FALSE}
tibble::rownames_to_column(mtcars, "model") %>%
vroom::vroom_write("mtcars.tsv", delim = "\t")
```
```{r example}
vroom::vroom("mtcars.tsv",
col_types = list(cyl = "i", gear = "f",hp = "i", disp = "_",
drat = "_", vs = "l", am = "l", carb = "i")
)
```
```{r, include = FALSE}
unlink("mtcars.tsv")
```
## Reading multiple files
vroom natively supports reading from multiple files (or even multiple
connections!).
First we generate some files to read by splitting the nycflights dataset by
airline.
For the sake of the example, we'll just take the first 2 lines of each file.
```{r}
library(nycflights13)
purrr::iwalk(
split(flights, flights$carrier),
~ { .x$carrier[[1]]; vroom::vroom_write(head(.x, 2), glue::glue("flights_{.y}.tsv"), delim = "\t") }
)
```
Then we can efficiently read them into one tibble by passing the filenames directly to vroom.
The `id` argument can be used to request a column that reveals the filename that each row originated from.
```{r}
files <- fs::dir_ls(glob = "flights*tsv")
files
vroom::vroom(files, id = "source")
```
```{r, include = FALSE}
fs::file_delete(files)
```
## Learning more
- [Getting started with vroom](https://vroom.r-lib.org/articles/vroom.html)
- [📽 vroom: Because Life is too short to read slow](https://www.youtube.com/watch?v=RA9AjqZXxMU&t=10s) - Presentation at UseR!2019 ([slides](https://speakerdeck.com/jimhester/vroom))
- [📹 vroom: Read and write rectangular data quickly](https://www.youtube.com/watch?v=ZP_y5eaAc60) - a video tour of the vroom features.
## Benchmarks
The speed quoted above is from a real `r format(fs::fs_bytes(tm$size[[1]]))` dataset with `r format(tm$rows[[1]], big.mark = ",")` rows and `r tm$cols[[1]]` columns,
see the [benchmark article](https://vroom.r-lib.org/articles/benchmarks.html)
for full details of the dataset and
[bench/](https://github.com/tidyverse/vroom/tree/main/inst/bench) for the code
used to retrieve the data and perform the benchmarks.
# Environment variables
In addition to the arguments to the `vroom()` function, you can control the
behavior of vroom with a few environment variables. Generally these will not
need to be set by most users.
- `VROOM_TEMP_PATH` - Path to the directory used to store temporary files when
reading from a R connection. If unset defaults to the R session's temporary
directory (`tempdir()`).
- `VROOM_THREADS` - The number of processor threads to use when indexing and
parsing. If unset defaults to `parallel::detectCores()`.
- `VROOM_SHOW_PROGRESS` - Whether to show the progress bar when indexing.
Regardless of this setting the progress bar is disabled in non-interactive
settings, R notebooks, when running tests with testthat and when knitting
documents.
- `VROOM_CONNECTION_SIZE` - The size (in bytes) of the connection buffer when
reading from connections (default is 128 KiB).
- `VROOM_WRITE_BUFFER_LINES` - The number of lines to use for each buffer when
writing files (default: 1000).
There are also a family of variables to control use of the Altrep framework.
For versions of R where the Altrep framework is unavailable (R < 3.5.0) they
are automatically turned off and the variables have no effect. The variables
can take one of `true`, `false`, `TRUE`, `FALSE`, `1`, or `0`.
- `VROOM_USE_ALTREP_NUMERICS` - If set use Altrep for _all_ numeric types
(default `false`).
There are also individual variables for each type. Currently only
`VROOM_USE_ALTREP_CHR` defaults to `true`.
- `VROOM_USE_ALTREP_CHR`
- `VROOM_USE_ALTREP_FCT`
- `VROOM_USE_ALTREP_INT`
- `VROOM_USE_ALTREP_BIG_INT`
- `VROOM_USE_ALTREP_DBL`
- `VROOM_USE_ALTREP_NUM`
- `VROOM_USE_ALTREP_LGL`
- `VROOM_USE_ALTREP_DTTM`
- `VROOM_USE_ALTREP_DATE`
- `VROOM_USE_ALTREP_TIME`
## RStudio caveats
RStudio's environment pane calls `object.size()` when it refreshes the pane, which
for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes
([RStudio#4210](https://github.com/rstudio/rstudio/pull/4210),
[RStudio#4292](https://github.com/rstudio/rstudio/pull/4292)) for this issue,
so it is recommended you use at least that version.
## Thanks
- [Gabe Becker](https://github.com/gmbecker), [Luke
Tierney](https://homepage.divms.uiowa.edu/~luke/) and [Tomas Kalibera](https://github.com/kalibera) for
conceiving, Implementing and maintaining the [Altrep
framework](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html)
- [Romain François](https://github.com/romainfrancois), whose
[Altrepisode](https://web.archive.org/web/20200315075838/https://purrple.cat/blog/2018/10/14/altrep-and-cpp/) package
and [related blog-posts](https://web.archive.org/web/20200315075838/https://purrple.cat/blog/2018/10/14/altrep-and-cpp/) were a great guide for creating new Altrep objects in C++.
- [Matt Dowle](https://github.com/mattdowle) and the rest of the [Rdatatable](https://github.com/Rdatatable) team, `data.table::fread()` is blazing fast and great motivation to see how fast we could go faster!