Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't build new SQL string without dataset in connection #434

Closed
abalter opened this issue Feb 23, 2021 · 8 comments
Closed

Can't build new SQL string without dataset in connection #434

abalter opened this issue Feb 23, 2021 · 8 comments
Labels
reprex needs a minimal reproducible example

Comments

@abalter
Copy link

abalter commented Feb 23, 2021

I want to write custom functions to generate queries. In general usage, I have had no problems setting up connections with only the project and then accessing a table with the dataset name:

conn = dbConnect(bigquery(), "projname")

bq_table = tbl(conn, "dataset.tablename")

I've never had any problems with this in general use, even with cross-source tables.

However, when building up queries from strings, the dataset needs to be in the connection. Reprex below.

library(magrittr)
library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(bigrquery)
library(DBI)
library(RPostgres)
library(RPostgreSQL)

bq_deauth()
bq_auth(email="[email protected]")


getCount = function(bq_table)
{
  ### extract connection
  conn = bq_table$src$con
  print(conn)

  ### build sql string
  sql_string = paste0(
    "SELECT COUNT(*) FROM (",
    dbplyr::sql_render(bq_table),
    ")"
  ) %>%
    str_replace("\n", "")
  print(sql_string)

  ### build query object
  sql_query = dbplyr::sql(sql_string)
  print(sql_query)

  new_bq_table = tbl(conn, sql_query)
}


conn = dbConnect(
  bigquery(),
  project = "proj1-305622"
)

# df = tibble(
#   A = sample(1:10, 5),
#   B = letters[1:5]
# )
#
# dbWriteTable(
#   conn,
#   name = "dataset.df",
#   value = df1,
#   overwrite = T
# )

df_bq = tbl(conn, "dataset.df")

df_bq %>% getCount()
#> <BigQueryConnection>
#>   Billing: proj1-305622
#> [1] "SELECT COUNT(*) FROM (SELECT *FROM `dataset.df`)"
#> <SQL> SELECT COUNT(*) FROM (SELECT *FROM `dataset.df`)
#> Error: dataset is not a string (a length one character vector).

conn = dbConnect(
  bigquery(),
  project = "proj1-305622",
  dataset = "dataset"
)

df_bq = tbl(conn, "dataset.df")


df_bq %>% getCount()
#> <BigQueryConnection>
#>   Dataset: proj1-305622.dataset
#>   Billing: proj1-305622
#> [1] "SELECT COUNT(*) FROM (SELECT *FROM `dataset.df`)"
#> <SQL> SELECT COUNT(*) FROM (SELECT *FROM `dataset.df`)

Created on 2021-02-23 by the reprex package (v1.0.0)

@abalter
Copy link
Author

abalter commented Feb 25, 2021

But it does seem like an inconsistency as in #433

Oh, and the SQL assembled by dbplyr runs fine, but not a user-constructed SQL string.

@meztez
Copy link
Contributor

meztez commented Feb 26, 2021

Digging through the call stack when calling dbplyr_query_fields it will end up using

db_query_fields.BigQueryConnection <- function(con, sql) {

Which you can see is trying to access the connection dataset. In your case it is NULL.
So the call to bq_dataset(con@project, con@dataset) will fail in

function (con, sql) 
{
  if (dbplyr::is.sql(sql)) {
    ds <- bq_dataset(con@project, con@dataset)
    fields <- bq_query_fields(sql, con@billing, default_dataset = ds)
  }
  else {
    tb <- as_bq_table(con, sql)
    fields <- bq_table_fields(tb)
  }
  vapply(fields, "[[", "name", FUN.VALUE = character(1))
}

@abalter
Copy link
Author

abalter commented Feb 26, 2021

EXACTLY. But the dataset is specified when I access the table mytable = tbl(conn, dataset.table).

I really should have included something in the original reprex. Notice the portion below that I've marked out. The query as generated by dplyr "works" but not the hand-coded query.

library(magrittr)
library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(bigrquery)
library(DBI)
library(RPostgres)
library(RPostgreSQL)

bq_deauth()
bq_auth(email="[email protected]")


getCount = function(bq_table)
{
  ### extract connection
  conn = bq_table$src$con
  print(conn)

  ### build sql string
  sql_string = paste0(
    "SELECT COUNT(*) FROM (",
    dbplyr::sql_render(bq_table),
    ")"
  ) %>%
    str_replace("\n", "")
  print(sql_string)

  ### build query object
  sql_query = dbplyr::sql(sql_string)
  print(sql_query)

  new_bq_table = tbl(conn, sql_query)
}


conn = dbConnect(
  bigquery(),
  project = "proj1-305622"
)

# df = tibble(
#   A = sample(1:10, 5),
#   B = letters[1:5]
# )
#
# dbWriteTable(
#   conn,
#   name = "dataset.df",
#   value = df1,
#   overwrite = T
# )

df_bq = tbl(conn, "dataset.df")

#########     works here    ################

df_bq %>% count() %>% show_query()
#> <SQL>
#> SELECT count(*) AS `n`
#> FROM `dataset.df`
df_bq %>% count()
#> # Source:   lazy query [?? x 1]
#> # Database: BigQueryConnection
#>       n
#>   <int>
#> 1     5

############################################

df_bq %>% getCount()
#> <BigQueryConnection>
#>   Billing: proj1-305622
#> [1] "SELECT COUNT(*) FROM (SELECT *FROM `dataset.df`)"
#> <SQL> SELECT COUNT(*) FROM (SELECT *FROM `dataset.df`)
#> Error: dataset is not a string (a length one character vector).

conn = dbConnect(
  bigquery(),
  project = "proj1-305622",
  dataset = "dataset"
)

df_bq = tbl(conn, "dataset.df")


df_bq %>% getCount()
#> <BigQueryConnection>
#>   Dataset: proj1-305622.dataset
#>   Billing: proj1-305622
#> [1] "SELECT COUNT(*) FROM (SELECT *FROM `dataset.df`)"
#> <SQL> SELECT COUNT(*) FROM (SELECT *FROM `dataset.df`)

Created on 2021-02-26 by the reprex package (v1.0.0)

@abalter
Copy link
Author

abalter commented Mar 1, 2021

Here is putting it another way--querying with the connection and a string works using DBI directly, but not in bigrquery. Same connection, same query string:

> df_bq %>% count() %>% sql_render()
<SQL> SELECT count(*) AS `n`
FROM `dataset.df`
> df_bq %>% count()
Complete
Billed: 0 B
Downloading 1 rows in 1 pages.
# Source:   lazy query [?? x 1]
# Database: BigQueryConnection
      n
  <int>
1     5
> 
> sql_string = df_bq %>% count %>% sql_render()
> print(sql_string)
<SQL> SELECT count(*) AS `n`
FROM `dataset.df`
> 
> dbGetQuery(conn, sql_string)
Complete
Billed: 0 B
Downloading 1 rows in 1 pages.
# A tibble: 1 x 1
      n
  <int>
1     5
> tbl_bq = tbl(conn, sql_string)
Error: dataset is not a string (a length one character vector).

@shukryzablah
Copy link

Is there a workaround? I'm running into the same issue. I didn't add the dataset parameter because I wanted to join across projects. Upon trying to use a SQL expression in tbl, I get the same error about the dataset not being a string (because it wasn't provided in the connection).

@hadley
Copy link
Member

hadley commented Nov 3, 2023

I think this is a simpler reprex, but it hits #540, so that will need to be resolved first.

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(bigrquery)
options(gargle_oauth_email = TRUE)

ds <- bq_test_dataset()
#> ℹ The bigrquery package is using a cached token for '[email protected]'.

con <- DBI::dbConnect(
  bigquery(),
  project = bq_test_project(),
  dataset = ds$dataset
)

df <- tibble(A = sample(1:10, 5), B = letters[1:5])
DBI::dbWriteTable(con, "df", df)

getCount <- function(bq_table) {
  sql <- paste0("SELECT COUNT(*) FROM (", dbplyr::sql_render(bq_table), ")")
  tbl(bq_table$src$con, dbplyr::sql(sql))
}
getCount(tbl(con, "df"))
#> Warning: <BigQueryConnection> uses an old dbplyr interface
#> ℹ Please install a newer version of the package or contact the maintainer
#> This warning is displayed once every 8 hours.
#> Error in curl::curl_fetch_memory(url, handle = handle): URL rejected: Malformed input to a URL function

Created on 2023-11-03 with reprex v2.0.2

@hadley hadley added the reprex needs a minimal reproducible example label Nov 3, 2023
@hadley
Copy link
Member

hadley commented Nov 7, 2023

This now seems to work:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(bigrquery)
options(gargle_oauth_email = TRUE)

ds <- bq_test_dataset()
#> ℹ The bigrquery package is using a cached token for '[email protected]'.

con <- DBI::dbConnect(
  bigquery(),
  project = bq_test_project(),
  dataset = ds$dataset
)

df <- tibble(A = sample(1:10, 5), B = letters[1:5])
DBI::dbWriteTable(con, "df", df)

getCount <- function(bq_table) {
  sql <- paste0("SELECT COUNT(*) FROM (", dbplyr::sql_render(bq_table), ")")
  tbl(bq_table$src$con, dbplyr::sql(sql))
}
getCount(tbl(con, "df"))
#> # Source:   SQL [1 x 1]
#> # Database: BigQueryConnection
#>     f0_
#>   <int>
#> 1     5

Created on 2023-11-07 with reprex v2.0.2.9000

@hadley hadley closed this as completed Nov 7, 2023
@abalter
Copy link
Author

abalter commented Nov 7, 2023

Cool! I'll check it out. I believe I came to this problem trying to write a nest/unnest translation for BigQuery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

4 participants