This is a command-line tool for uploading data into your Datateer data lake.
The upload agent pushes files into an AWS S3 bucket, where the files are picked up for ingestion and further processing
Ensure you have python and pip installed, then follow these steps:
- Install with
pip install datateer-upload-agent
- Do one-time agent configuration with
datateer config upload-agent
- Do one-time feed configuration with
datateer config feed
- Upload data with
datateer upload <feed_key> <path>
All data in the data lake has the following metadata:
- A provider is an organization that is providing data. This could be your organization if you are pushing data from an internal database or application
- A source is the system or application that is providing data. A provider can provide data from one or more systems
- A feed is an independent data feed. A source can provide one or more feeds. For example, if the source is a database, each feed could represent a single table or view. If the source is an API, each feed could represent a single entity.
- A file is a data file like a CSV file. It is a point-in-time extraction of a feed, and it is what you upload using the agent.
datateer upload orders_feed ./my_exported_data/orders.csv
will upload the file at ./my_exported_data/orders.csv
using the feed key orders_feed
datateer config upload-agent
will ask you a series of questions to configure your agent
Datateer client code:
Raw bucket name:
Access key:
Access secret:
If you need to reconfigure the agent, just rerun datateer config upload-agent
datateer config feed
will ask a series of questions to configure a new feed
Provider: xyz
Data Source: internal_app1
Feed: orders
Feed key [orders]: orders_feed
datateer config feed --update orders_feed
will rerun the configuration questions for the feed with the key orders_feed
datateer config upload-agent --show
will show you your existing configuration
client-code: xyz
raw-bucket: xyz-pipeline-raw-202012331213123432341213
access-key: ABC***
access-secret: 123***
feeds: 3
1) Feed "customers" will upload to xyz/internal_app1/customers/
2) Feed "orders_feed" will upload to xyz/internal_app1/orders/
3) Feed "leads" will upload to salesforce/salesforce/leads
Feed "abc" will upload to provider/source/feed
- The data lake supports CSV, TSV, and JSONL files
- The first row of the data file must contain header names
- Adding new data fields or removing data fields are both supported
- You should strive to be consistent with your header names over time. The data lake can handle changes, but it will likely confuse anyone using the feeds
Configuration can be handled completely through the datateer config
commands. If you need more details, this section provides more details on how configuration works and where it is stored.
Here is where the Datateer upload agent will look for configuration information, in order of preference:
- In a relative directory named
.datateer
, in a file namedconfig.yml
. - In the future, we may add global configuration in the user's home directory or in environment variables
An example configuration file will look like this:
client-code: xyz
upload-agent:
raw-bucket: xyz-pipeline-raw-202012331213123432341213
access-key: ABC***
access-secret: 123***
feeds:
customers:
provider: xyz
source: internal_app1
feed: customers
orders_feed:
provider: xyz
source: internal_app1
feed: orders
leads:
provider: salesforce
source: salesforce
feed: leads
To develop in this repo:
- Install poetry and activate shell with
poetry shell
- Run
poetry install
- To test run
pytest
orptw
- To run locally, install with
pip install -e .