initial commit

transferwise · Jun 2, 2019 · 78fe330 · 78fe330
commit 78fe330
Show file tree

Hide file tree

Showing 18 changed files with 1,626 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,32 @@
+# IDE
+.vscode
+.idea/*
+
+
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+.virtualenvs
+*.egg-info/
+*__pycache__/
+*~
+dist/
+
+# Singer JSON files
+properties.json
+config.json
+state.json
+
+*.db
+.DS_Store
+venv
+env
+blog_old.md
+node_modules
+*.pyc
+tmp
+
+# Docs
+docs/_build/
+docs/_templates/
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,9 @@
+MIT License
+
+Copyright (c) 2019 TransferWise Ltd. (https://transferwise.com)
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,170 @@
+# pipelinewise-target-snowflake
+
+[![PyPI version](https://badge.fury.io/py/pipelinewise-target-snowflake.svg)](https://badge.fury.io/py/pipelinewise-target-snowflake)
+[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pipelinewise-target-snowflake.svg)](https://pypi.org/project/pipelinewise-target-snowflake/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+
+[Singer](https://www.singer.io/) target that loads data into Snowflake following the [Singer spec](https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md).
+
+This is a [PipelineWise](https://transferwise.github.io/pipelinewise) compatible target connector.
+
+## How to use it
+
+The recommended method of running this target is to use it from [PipelineWise](https://transferwise.github.io/pipelinewise). When running it from PipelineWise you don't need to configure this tap with JSON files and most of things are automated. Please check the related documentation at [Target Snowflake](https://transferwise.github.io/pipelinewise/connectors/targets/snowflake.html)
+
+If you want to run this [Singer Target](https://singer.io) independently please read further.
+
+## Install
+
+First, make sure Python 3 is installed on your system or follow these
+installation instructions for [Mac](http://docs.python-guide.org/en/latest/starting/install3/osx/) or
+[Ubuntu](https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-ubuntu-16-04).
+
+It's recommended to use a virtualenv:
+
+```bash
+  python3 -m venv venv
+  pip install pipelinewise-target-snowflake
+```
+
+or
+
+```bash
+  python3 -m venv venv
+  . venv/bin/activate
+  pip install --upgrade pip
+  pip install .
+```
+
+## Flow diagram
+
+![Flow Diagram](flow-diagram.jpg)
+
+### To run
+
+Like any other target that's following the singer specificiation:
+
+`some-singer-tap | target-snowflake --config [config.json]`
+
+It's reading incoming messages from STDIN and using the properites in `config.json` to upload data into Snowflake.
+
+**Note**: To avoid version conflicts run `tap` and `targets` in separate virtual environments.
+
+### Pre-requirements
+
+You need to create two objects in snowflake in one schema before start using this target.
+
+1. A named external stage object on S3. This will be used to upload the CSV files to S3 and to MERGE data into snowflake tables.
+
+```
+CREATE STAGE {schema}.{stage_name}
+url='s3://{s3_bucket}'
+credentials=(AWS_KEY_ID='{aws_key_id}' AWS_SECRET_KEY='{aws_secret_key}')
+encryption=(MASTER_KEY='{client_side_encryption_master_key}');
+```
+
+The `encryption` option is optional and used for client side encryption. If you want client side encryption enabled you'll need
+to define the same master key in the target `config.json`. Furhter details below in the Configuration settings section.
+
+2. A named file format. This will be used by the MERGE/COPY commands to parse the CSV files correctly from S3:
+
+`CREATE file format IF NOT EXISTS {schema}.{file_format_name} type = 'CSV' escape='\\' field_optionally_enclosed_by='"';`
+
+
+### Configuration settings
+
+Running the the target connector requires a `config.json` file. Example with the minimal settings:
+
+   ```json
+   {
+     "account": "localhost",
+     "dbname": 5432,
+     "user": "my_analytics",
+     "password": "my_user",
+     "warehouse": "my_virtual_warehouse",
+     "aws_access_key_id": "secret",
+     "aws_secret_access_key": "secret",
+     "s3_bucket": "bucket_name",
+     "stage": "snowflake_external_stage_object_name",
+     "file_format": "snowflake_file_format_object_name",
+     "default_target_schema": "my_target_schema"
+   }
+   ```
+
+Full list of options in `config.json`:
+
+| Property                            | Type    | Required?  | Description                                                   |
+|-------------------------------------|---------|------------|---------------------------------------------------------------|
+| account                             | String  | Yes        | Snowflake account name (i.e. rtXXXXX.eu-central-1)            |
+| dbname                              | String  | Yes        | Snowflake Database name                                       |
+| user                                | String  | Yes        | Snowflake User                                                |
+| password                            | String  | Yes        | Snowflake Password                                            |
+| warehouse                           | String  | Yes        | Snowflake virtual warehouse name                              |
+| aws_access_key_id                   | String  | Yes        | S3 Access Key Id                                              |
+| aws_secret_access_key               | String  | Yes        | S3 Secret Access Key                                          |
+| s3_bucket                           | String  | Yes        | S3 Bucket name                                                |
+| s3_key_prefix                       | String  |            | (Default: None) A static prefix before the generated S3 key names. Using prefixes you can upload files into specific directories in the S3 bucket. |
+| stage                               | String  | Yes        | Named external stage name created at pre-requirements section. Has to be a fully qualified name including the schema name |
+| file_format                         | String  | Yes        | Named file format name created at pre-requirements section. Has to be a fully qualified name including the schema name. |
+| batch_size                          | Integer |            | (Default: 100000) Maximum number of rows in each batch. At the end of each batch, the rows in the batch are loaded into Snowflake. |
+| default_target_schema               | String  |            | Name of the schema where the tables will be created. If `schema_mapping` is not defined then every stream sent by the tap is loaded into this schema.    |
+| default_target_schema_select_permission | String  |            | Grant USAGE privilege on newly created schemas and grant SELECT privilege on newly created tables to a specific role or a list of roles. If `schema_mapping` is not defined then every stream sent by the tap is granted accordingly.   |
+| schema_mapping                      | Object  |            | Useful if you want to load multiple streams from one tap to multiple Snowflake schemas.<br><br>If the tap sends the `stream_id` in `<schema_name>-<table_name>` format then this option overwrites the `default_target_schema` value. Note, that using `schema_mapping` you can overwrite the `default_target_schema_select_permission` value to grant SELECT permissions to different groups per schemas or optionally you can create indices automatically for the replicated tables.<br><br> **Note**: This is an experimental feature and recommended to use via PipelineWise YAML files that will generate the object mapping in the right JSON format. For further info check a [PipelineWise YAML Example]
+| disable_table_cache                 | Boolean |            | (Default: False) By default the connector caches the available table structures in Snowflake at startup. In this way it doesn't need to run additional queries when ingesting data to check if altering the target tables is required. With `disable_table_cache` option you can turn off this caching. You will always see the most recent table structures but will cause an extra query runtime. |
+| client_side_encryption_master_key   | String  |            | (Default: None) When this is defined, Client-Side Encryption is enabled. The data in S3 will be encrypted, No third parties, including Amazon AWS and any ISPs, can see data in the clear. Snowflake COPY command will decrypt the data once it's in Snowflake. The master key must be 256-bit length and must be encoded as base64 string. |
+| client_side_encryption_stage_object | String  |            | (Default: None) Required when `client_side_encryption_master_key` is defined. The name of the encrypted stage object in Snowflake that created separately and using the same encryption master key. |
+| add_metadata_columns                | Boolean |            | (Default: False) Metadata columns add extra row level information about data ingestions, (i.e. when was the row read in source, when was inserted or deleted in snowflake etc.) Metadata columns are creating automatically by adding extra columns to the tables with a column prefix `_SDC_`. The column names are following the stitch naming conventions documented at https://www.stitchdata.com/docs/data-structure/integration-schemas#sdc-columns. Enabling metadata columns will flag the deleted rows by setting the `_SDC_DELETED_AT` metadata column. Without the `add_metadata_columns` option the deleted rows from singer taps will not be recongisable in Snowflake. |
+| hard_delete                         | Boolean |            | (Default: False) When `hard_delete` option is true then DELETE SQL commands will be performed in Snowflake to delete rows in tables. It's achieved by continuously checking the  `_SDC_DELETED_AT` metadata column sent by the singer tap. Due to deleting rows requires metadata columns, `hard_delete` option automatically enables the `add_metadata_columns` option as well. |
+
+
+
+### To run tests:
+
+1. Define environment variables that requires running the tests
+```
+  export TARGET_SNOWFLAKE_ACCOUNT=<snowflake-account-name>
+  export TARGET_SNOWFLAKE_DBNAME=<snowflake-database-name>
+  export TARGET_SNOWFLAKE_USER=<snowflake-user>
+  export TARGET_SNOWFLAKE_PASSWORD=<snowfale-password>
+  export TARGET_SNOWFLAKE_WAREHOUSE=<snowflake-warehouse>
+  export TARGET_SNOWFLAKE_SCHEMA=<snowflake-schema>
+  export TARGET_SNOWFLAKE_AWS_ACCESS_KEY=<aws-access-key-id>
+  export TARGET_SNOWFLAKE_AWS_SECRET_ACCESS_KEY=<aws-access-secret-access-key>
+  export TARGET_SNOWFLAKE_S3_BUCKET=<s3-external-bucket>
+  export TARGET_SNOWFLAKE_S3_KEY_PREFIX=<bucket-directory>
+  export TARGET_SNOWFLAKE_STAGE=<stage-object-with-schema-name>
+  export TARGET_SNOWFLAKE_FILE_FORMAT=<file-format-object-with-schema-name>
+  export CLIENT_SIDE_ENCRYPTION_MASTER_KEY=<client_side_encryption_master_key>
+  export CLIENT_SIDE_ENCRYPTION_STAGE_OBJECT=<client_side_encryption_stage_object>
+```
+
+2. Install python dependencies in a virtual env and run nose unit and integration tests
+```
+  python3 -m venv venv
+  . venv/bin/activate
+  pip install --upgrade pip
+  pip install .
+  pip install nose
+```
+
+3. To run unit tests:
+```
+  nosetests --where=tests/unit
+```
+
+4. To run integration tests:
+```
+  nosetests --where=tests/integration
+```
+
+### To run pylint:
+
+1. Install python dependencies and run python linter
+```
+  python3 -m venv venv
+  . venv/bin/activate
+  pip install --upgrade pip
+  pip install .
+  pip install pylint
+  pylint target_snowflake -d C,W,unexpected-keyword-arg,duplicate-code
+```
diff --git a/flow-diagram.jpg b/flow-diagram.jpg
diff --git a/flow-diagram.svg b/flow-diagram.svg
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,6 @@
+idna==2.7
+singer-python==5.1.1
+snowflake-connector-python==1.7.4
+boto3==1.9.33
+inflection==0.3.1
+joblib==0.13.2
diff --git a/setup.py b/setup.py
@@ -0,0 +1,27 @@
+#!/usr/bin/env python
+
+from setuptools import setup
+
+setup(name="pipelinewise-target-snowflake",
+      version="1.0.0",
+      description="Singer.io target for loading data to Snowflake - PipelineWise compatible",
+      author="TransferWise",
+      url='https://github.com/transferwise/pipelinewise-target-snowflake',
+      classifiers=["Programming Language :: Python :: 3 :: Only"],
+      py_modules=["target_snowflake"],
+      install_requires=[
+          'idna==2.7',
+          'singer-python==5.1.1',
+          'snowflake-connector-python==1.7.4',
+          'boto3==1.9.33',
+          'inflection==0.3.1',
+          'joblib==0.13.2'
+      ],
+      entry_points="""
+          [console_scripts]
+          target-snowflake=target_snowflake:main
+      """,
+      packages=["target_snowflake"],
+      package_data = {},
+      include_package_data=True,
+)