A Singer target that writes data to Google BigQuery.
target-bigquery
works together with any other Singer Tap to move data from sources like Braintree, Freshdesk and Hubspot to Google BigQuery.
Email: [email protected]
- GCP web console -> API & Services -> Library
- Search for BigQuery API -> click Enable
It is recommended to use target-bigquery
with a service account.
Create a service account credential:
- API & Services -> Credentials -> Create Credentials -> Service account
- Under Service account details, enter Service account name. Click Create
- Under Grant this service account access to the project, select BigQuery Data Editor and BigQuery Job User as the minimal set of permissions. Click Done
- BigQuery Data Editor permission allows the service account to access (and change) the data.
- BigQuery Job User permission allows the service account to actually run a load or select job.
- On the API & Services Credentials screen, select the service account you just created.
- Click ADD KEY -> Create new key -> JSON key. Download the service account credential JSON file.
-
Name the file client_secrets.json. You can place the file where
target-bigquery
will be executed or provide a path to the service account json file. -
Set a GOOGLE_APPLICATION_CREDENTIALS environment variable on the machine, where the value is the fully qualified path to client_secrets.json file:
- Creating an environment variable on a Windows 10 machine
- Creating an environment variable on a Mac machine
Create a file called target-config.json in your working directory, following this sample target-config.json file (or see the example below).
- Required parameters are the project name
project_id
anddataset_id
. - Optional parameters are
table_suffix
,validate records
,add_metadata_columns
andlocation
. - Default data location is "US" (if your location is not the US, you can indicate a different location in your target-config.json file).
- The data will be written to the dataset specified in your target-config.json.
- If you do not have the dataset with this name yet, it will be created.
- The table will be created.
Sample target-config.json file:
{
"project_id": "{your_GCP_project_id}",
"dataset_id": "{your_dataset_id}",
"table_suffix": "_sample_table_suffix",
"validate_records": true,
"add_metadata_columns": true,
"location": "EU"
}
This is a little bit outside of the scope of this documentation, but let's quickly take a look at sample tap config files as well, to see how tap and target work together.
Sample tap-config.json file configures the data source:
{ "base": "USD",
"start_date": "2021-01-01"
}
- Sample state.json file is now just a empty JSON file
{}
, and it will be written or updated when the tap runs. - This is an optional file.
- The tap will write the date into state.json file, indicating when the data loading stopped at.
- Next time you run the tap, it'll continue from this date in the state file. If state.json file is provided, then it takes presedence over the "start_date" in the tap config file.
Learn more: https://github.com/singer-io/getting-started
-
First, make sure Python 3 is installed on your system or follow these installation instructions for Mac or Ubuntu.
-
target-bigquery
can be run with any Singer Tap, but we'll use tap-exchangeratesapi - which pulls currency exchange rate data from a public data set - as an example. (Learn more about Exchangeratesapi.io) -
In the `sample_config/target-config. file, enter the id of your GCP (Google Cloud Platform Project) - you can find it on the Home page of your GCP web console.
Sample target-config.json file:
{
"project_id": "{your project id}",
"dataset_id": "exchangeratesapi"
}
- These commands will install
tap-exchangeratesapi
andtarget-bigquery
with pip and then run them together, piping the output oftap-exchangeratesapi
totarget-bigquery
:
> cd "{your project root directory}"
› pip install tap-exchangeratesapi git+git://github.com/adswerve/target-bigquery
› tap-exchangeratesapi --config sample_config/tap-exchange-rate-api.json | ^
target-bigquery --config sample_config/target-config.json > sample_config/state.json
-
"^" on a Windows machine indicates a new line. On a Mac, use "\".
-
If you're using a different Tap, substitute
tap-exchangeratesapi
in the final command above to the command used to run your Tap.
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can:
- improve query performance,
- control costs by reducing the number of bytes read by a query.
You can partition BigQuery tables by:
-
Ingestion time: Tables are partitioned based on the data's ingestion (load) time or arrival time.
-
Date/timestamp/datetime: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column.
-
Integer range: Tables are partitioned based on an integer column.
- When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema.
- The columns you specify are used to colocate related data.
- When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
- Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data.
- You can cluster up to 4 columns in a table
Learn more about BigQuery partitioned and clustered tables:
https://cloud.google.com/bigquery/docs/partitioned-tables
https://cloud.google.com/bigquery/docs/clustered-tables
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
https://medium.com/analytics-vidhya/bigquery-partitioning-clustering-9f84fc201e61
Example 1: tap-recharge data
This is not a follow-along example. Additional tap configuration would be required to run it. This example is just for illustration purposes.
If we were to load tap-recharge charges table into BigQuery, we could partition it by date.
For clustering, we can selected:
- foreign keys and
- columns likely to appear in
WHERE
andGROUP BY
statements
To configure partitioning and clustering in BigQuery destination tables, we create target-tables-config.json:
{
"streams": {
"charges": {
"partition_field": "updated_at",
"cluster_fields": ["type", "status", "customer_id", "transaction_id"]
}
}
}
We can verify in BigQuery web UI that partitioning and clustering worked:
Example 2: tap-exchangeratesapi data
You can follow along and try this example on your own. We will continue where we left off in Step 4: Install and Run above.
- Take a look at our tap-exchangeratesapi data. We have:
- dates
- datetimes
- floats which show exchange rates
In our tap-exchangeratesapi example, no columns are good candidates for clustering.
You can only set up partitioning.
- Create your target-tables-config.json with partitioning configuration. Leave cluster fields blank:
{
"streams": {
"exchange_rate": {
"partition_field": "date",
"cluster_fields": []
}
}}
-
Clear you state.json, so it's an empty JSON
{}
, because we want to load all data again. Skip this step, if you didn't previously load this data in Step 4 above. -
Delete your BigQuery destination table exchangeratesapi, because we want to re-load it again from scratch. Skip this step, if you didn't previously load this data in Step 4 above.
-
Load data data into BigQuery, while also configuring target tables:
› tap-exchangeratesapi --config sample_config/tap-exchange-rate-api.json | ^
target-bigquery --config sample_config/target-config.json ^
-t sample_config/target-tables-config.json > sample_config/state.json
- "^" on a Windows machine indicates a new line. On a Mac, use "\".
- Verify in BigQuery web UI that partitioning worked: