You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One or more parsing issues, call `problems()` on your data frame for details
Sure enough there are problems:
>readr::problems(X)
# A tibble: 3,853,670 × 5rowcolexpectedactualfile<int><int><chr><chr><chr>14335944anintegerensembl""24335945adoublencRNA_gene""34335954anintegerensembl""44335955adoublemiRNA""54335964anintegerensembl""64335965adoubleexon""74335974anintegercpg""84335975adoublebiological_region""94335984anintegerEponine""104335985adoublebiological_region""# ℹ 3,853,660 more rows# ℹ Use `print(n = ...)` to see more rows
The parsed line 433594 looks like this:
>X[433594,]
# A tibble: 1 × 9seqidsourcetypestartendscorestrandphaseattributes<chr><chr><chr><int><dbl><dbl><chr><int><chr>1# "#\n" 3 NA NA 60677110 60677223 NA -
However if I unzip the file first then the problem goes away:
> X[433594,]
# A tibble: 1 × 9
seqid source type start end score strand phase attributes
<chr> <chr> <chr> <int> <dbl> <dbl> <chr> <int> <chr>
1 13 ensembl ncRNA_gene 60677110 60677223 NA - NA ID=gene:ENSMUSG…
(I can re-gzip the file to restore the problem)
One thing that may be relevant is that the file seems to be sprinkled with comment lines (they are '###\n') including one just around this problem line (but lots of others before this as well):
I have encountered exactly the same thing. I am very sure that my file is ok, it only contains tab-separated count numbers. If I unzip the gz file, I can read all 451163 lines (and 13857 columns), although I only read a selection of them. With the gz file, I only get 309927 lines. The gz file size is 5857388x1024 bytes. Interestingly, this means that the code manages to read 309927/451163x5857388x1024 = 4120309944 bytes, while 2^32 is 4294967296. Is there a 32 bit integer limit somewhere in the code, where there should be a 64-bit integer? Looks very suspicious to me. I'm pretty convinced this is a bug, and it is likely that it has to do with a 32-bit variable of some kind. Could anyone look into this, it is pretty annoying since these files are very large when not gunzipped.
I'm running R on a Windows 10 64-bit machine, but it seems gavinband had a mac
Thanks for making readr and tidyverse!
I am using
read_tsv()
(read 2.1.4) to parse this largeish file from a public repository:My code is:
However, this reports:
Sure enough there are problems:
The parsed line 433594 looks like this:
However if I unzip the file first then the problem goes away:
With correct results on that line:
(I can re-gzip the file to restore the problem)
One thing that may be relevant is that the file seems to be sprinkled with comment lines (they are '###\n') including one just around this problem line (but lots of others before this as well):
Session info:
Many thanks for any help with this issue.
The text was updated successfully, but these errors were encountered: