-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting retrieve_data() results to a data frame (tibble) #100
Comments
If that Is your grouping |
I think the The grouping was selected based on the names as defined in the description of various example search outputs (e.g. https://bacdive.dsmz.de/api/bacdive/bacdive_id/1/) that I checked. I also tried providing extra columns for I have only done fuzzy taxon name searches though (e.g. search term "Fusobacterium"), I'm not familiar with the rest of the database so I don't know if any other metadata can appear. But in terms of votes, I personally always prefer easily accesible 'tidy' data ;). Edit: the only issue is the converting to a tibble with the above code is that it can sometimes take a while if you have many bacdive IDs. I don't know whether speed optimisation is important for this package, but one would maybe have to switch away from |
Thanks for the additional info :-) Speed is indeed a consideration, but in all my measurements so far, BacDive's server was the bottleneck. Until they speed it up, I wouldn't be worried about something like your above Looking into these
This causes a "left-/up-ward shift/creep" of the Do you mean this with "converting to a table (based on a condition of the object in the cell before un-nesting)"? |
Indeed - the server is for an average search still the slowest thing, taking longer than the 'table-isation' itself. Yes, screenshot 2 is exactly what I mean. I realise now I shouldn't have used the term 'unnesting' as that isn't what I actually meant. I actually meant that the
could be conditional e.g. if the second field of the unlisted string This would at least match the description here: https://bacdive.dsmz.de/api/bacdive/bacdive_id/2654/. |
I just realised the 'key' column is leftover from testing (before I renamed the columns to the bacdive categories). Only lines 15-17 is the issue. Thus this should have the correct columns and also have the condition for correcting references lines: ## get some search results
data_bacdive_raw <- BacDiveR::retrieve_data("Fusobacterium", searchType = "taxon")
## original pipe for converting list of lists to tibble
data_bacdive_tib <- data_bacdive_raw %>%
unlist() %>%
bind_rows() %>%
gather(grouped_category, value, 1:ncol(.)) %>%
separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field"))
## shows faulty reference column incorrectly putting field in subsection
data_bacdive_tib %>% filter(is.na(field))
#># A tibble: 144 x 5
#> bacdive_id section subsection field value
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2654 references ID_referenc… NA 626
#> 2 2654 references ID_referenc… NA 20215
#> 3 2654 references ID_referenc… NA 20218
#> 4 2654 references reference1 NA Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295
#> 5 2654 references reference2 NA "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea, va…
#> 6 2654 references reference3 NA Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganisms.…
#> 7 5758 references ID_referenc… NA 9019
#> 8 5758 references ID_referenc… NA 20215
#> 9 5758 references ID_referenc… NA 20218
#>10 5758 references reference1 NA Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699
#> # ... with 134 more rows
## now fix the references field
data_bacdive_tib_fixed <- data_bacdive_tib %>%
mutate(field = if_else(section == "references", subsection, field),
subsection = if_else(section == "references", NA_character_, subsection))
## to show ID_references now correctly not in subsection
data_bacdive_tib %>% filter(is.na(field))
#> # A tibble: 0 x 5
#> # ... with 5 variables: bacdive_id <chr>, section <chr>, subsection <chr>, field <chr>, value <chr>
data_bacdive_tib_fixed %>% filter(is.na(subsection))
## shows ID_references now correctly in field
#># A tibble: 144 x 5
#> bacdive_id section subsection field value
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2654 references NA ID_refere… 626
#> 2 2654 references NA ID_refere… 20215
#> 3 2654 references NA ID_refere… 20218
#> 4 2654 references NA reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295
#> 5 2654 references NA reference2 "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea,…
#> 6 2654 references NA reference3 Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganis…
#> 7 5758 references NA ID_refere… 9019
#> 8 5758 references NA ID_refere… 20215
#> 9 5758 references NA ID_refere… 20218
#>10 5758 references NA reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699
#># ... with 134 more rows Apologies for the confusion. I should've put in my original message the caveat: written after dealing with teething baby all day, may not make 100% sense |
Note to self: https://github.com/ropensci/roadoi#whats-returned may be a useful example to check, also their list-column use. |
First I want to say thank you for this package, I'm working on some metagenomic data with lots of 'unusual' taxa, and trying to find a good (accessible) database to get a quick summary of characteristics of these has been surprisingly difficult.
This package saved me a lot of headaches trying 'manually' parse the API search results myself
I have neither a bug nor feature request, rather just some info which might be useful for others.
You can use a sequence of tidyverse tools convert the results from the
BacDiveR::retrieve_data()
function to a clean(ish) table format using the following code:As far as I can see with the table from the search above the only issue is the references field is not correctly formatted (being placed in the subsection rather than field column - thus the 'NA' messages), because in the original results it is a dataframe rather than a list itself.
This worked for me using BacDiveR_0.7.0
The text was updated successfully, but these errors were encountered: