Skip to content

ddb-sync is a tool used for syncing data from a set of source tables to a set of destination tables in DynamoDB. It's configurable to perform backfill operations, stream consumptions or both by CLI or config file.

License

Notifications You must be signed in to change notification settings

instructure/ddb-sync

Repository files navigation

DynamoDB Sync - ddb-sync GoDoc

Table of Contents

Introduction

ddb-sync is a tool used for syncing data from a set of source tables to a set of destination tables in DynamoDB. It's configurable to perform the backfill operation, the stream consumption or both by CLI or config file options.

Quickstart

go get github.com/instructure/ddb-sync and ddb-sync --help for usage

Functionality

ddb-sync has two phases of work: streaming and backfilling. Backfill is a scan of the source table and duplication of each record to the destination table. Stream consumes a pre-configured DynamoDB Stream and writes all record changes to the destination table.

If streaming and backfill are configured on a table, stream runs sequentially after backfill.

Note: On actively written tables, the streaming phase is required to get new records synced.

ddb-sync does not verify source and destination tables consistency and an outside methodology should be used to verify the table has been delivered as expected.

Problems we are solving

  • Table Migrations
    • Across regions
    • Across accounts
  • Table synchronizations

Similar tools

Usage

Installation

ddb-sync is a golang binary.

Ensure you have a proper Go environment setup and then:

go get github.com/instructure/ddb-sync

Permission Prerequisites

The IAM permissions required by the roles the application uses to read and put data to the destination tables is listed as follows for each phase.

Table Backfill Permissions Stream Permissions
Source Table dynamodb:DescribeTable
dynamodb:Scan
dynamodb:DescribeStream
dynamodb:DescribeTable
dynamodb:GetRecords (on <src_table_arn>/streams/*)
dynamodb:GetShardIterator
Destination Table dynamodb:BatchWriteItem dynamodb:DeleteItem
dynamodb:PutItem

Example backfill policy statement

In order to backfill a table, a policy resembling the following must be present.

{
  "Action": [
    "dynamodb:DescribeTable",
    "dynamodb:Scan"
  ],
  "Resource": [
    "<input_table_arn>"
  ],
  "Effect": "Allow"
},
{
  "Action": [
    "dynamodb:BatchWriteItem"
  ],
  "Resource": [
    "<output_table_arn>"
  ],
  "Effect": "Allow"
}

Example stream policy statement

In order to stream a table, a policy resembling the following must be present.

{
  "Action": [
    "dynamodb:DescribeStream",
    "dynamodb:DescribeTable",
    "dynamodb:GetShardIterator"
  ],
  "Resource": [
    "<input_table_arn>"
  ],
  "Effect": "Allow"
},
{
  "Action": [
    "dynamodb:GetRecords"
  ],
  "Resource": [
    "<input_table_arn>/stream/*"
  ],
  "Effect": "Allow"
},
{
  "Action": [
    "dynamodb:DeleteItem",
    "dynamodb:PutItem"
  ],
  "Resource": [
    "<output_table_arn>"
  ],
  "Effect": "Allow"
}

Configuration

ddb-sync is probably most useful when used with a config file. You can use a config file to run many concurrent operations together. Alternatively, you can provide a single set operation as flags to the CLI. Either a config file or a combination of input and output options are required. In addition if using the flags the input and output table must differ by either name, region, or have a role arn provided.

The ddb-sync configuration file format is YAML. It describes a plan object which has an array of objects of input and output tables, their table name, region and the role ARN to access it with the appropriate permissions listed above.

An example is below.

---
plan:
  - input:
      table: ddb-sync-source
      region: us-west-2
      role_arn: arn:aws:iam::<account_num>:role/ddb-sync-READ_ONLY_SOURCE
    output:
      table: ddb-sync-dest
      region: us-east-2
      role_arn: arn:aws:iam::<account_num>:role/ddb-sync_WRITE_ONLY_DEST
    stream:
      disabled: true
    backfill:
      disabled: false
      segments: 0,1,2  # This is 0-indexed, 0 through 3 are valid in this case
      total_segments: 4
  - input:
      table: ddb-sync-source-2
      region: us-west-2
      role_arn: arn:aws:iam::<account_num>:role/ddb-sync-READ_ONLY_SOURCE
    output:
      table: ddb-sync-dest-2
      region: us-east-2
      role_arn: arn:aws:iam::<account_num>:role/ddb-sync_WRITE_ONLY_DEST

Tuning

Note: Be sure your source tables have provisioned capacity for reads and writes before using the tool. To optimize performance you should adjust provisioned read and write capacity. Be cautious about table sharding constraints.

During the output of the tool we note consumed write capacity units. This can be used to help provision and scale your table with use of the tool. See the output section for more details.

It is highly recommended that you run this utility on an EC2 instance with the proper permissions attached to the instance profile. This will provide predictable network performance, consistent compute resources, and keep your data off the coffee shop router.

If you have a partitioned source table, you may want to consider specifying backfill-total-segments to optimize your capacity to scan your source table. A typical value for this would be ~N where N is the number of partitions in your source table.

If you are limited by the I/O capacity of an EC2 instance, you can consider splitting up the backfill segments across multiple EC2 instances. This would be done by specifying total_segments count and giving an unique set of backfill-segments to each separate instance. This configuration requires that all backfill segments complete before invoking a stream operation and therefore stream must be disabled and invoked as a unique operation in this case.

Invocation

Invoke the compiled binary and provide options for a run.

ddb-sync <cli-options>

The CLI options are present below:

  --config-file string            Filename for configuration yaml

  --input-region string           The input region
  --input-role-arn string         ARN of the input role
  --input-table string            Name of the input table

  --output-region string          The output region
  --output-role-arn string        ARN of the output role
  --output-table string           Name of the output table

  --backfill-segments ints        [Optional] Specify backfill scan segment(s) to target in this operation, 0-indexed. Example: "0,1,2". Prohibits streaming and "backfill-total-segments" must be specified.
  --backfill-total-segments int   Specify backfill 'Scan' concurrency segments

  --backfill                      Perform the backfill operation (default true)
  --stream                        Perform the streaming operation (default true)

To disable streaming or backfill, pass false.

ddb-sync --stream=false

Stopping

Backfill only operations will exit (0) upon completion of all steps. However, when streaming steps are enabled, the command will not ever exit. When you've ascertained that your streaming operations have completed, you can SIGINT for a clean shutdown.

Output

ddb-sync logs to stdout and displays status messaging to stderr. Either can be redirected to a file for later viewing.

For instance, ddb-sync --config-file <config_file.yml> 1>ddb-sync-log.out would capture the log to a file called ddb-sync-log.out.

Logging

The log will display each operation on a table, relevant status to the operation. The format is below.

timestamp [src-table] => [dest-table]: Log statement

ddb-sync logs (with a timestamp prefix) on starting or completing actions as well as shard transitions in the stream. An example is below.

2018/06/19 11:05:27 [ddb-sync-source] ⇨ [ddb-sync-dest]: Backfill complete. 14301 items written over 5m9s

ddb-sync will add progress updates every 20 minutes to the log for historical convenience, e.g.:

2018/06/19 11:12:26
================= Progress Update ==================
[ddb-sync-source] ⇨ [ddb-sync-dest]: Streaming: 400 items written over 20m1s
[ddb-sync-source-2] ⇨ [ddb-sync-dest-2]: Backfill in progress: 822 items written over 1m1s
====================================================

Status Messaging

The bottom portion of the output given to the user is the status messaging. It describes each table, the operation being worked on, the write capacity unit usage rate for writing to the destination table and the amount of records written from each operation.

An example follows:

------------------------------------------------ Current Status ---------------------------------------------------
TABLE                    DETAILS              BACKFILL        STREAM                         RATES & BUFFER
⇨ ddb-sync-dest         47K items (29MiB)    -COMPLETE-      2019 written (~18h57m latent)  12 items/s ⇨ ◕ ⇨ 32 WCU/s
⇨ ddb-sync-dest-2       ~46K items (~26MiB)  -SKIPPED-       789 written (~46m35s latent)   9  items/s ⇨ ◕ ⇨ 17 WCU/s
⇨ ddb-sync-destination  ~46K items (~26MiB)  267164 written  -PENDING-                      49 items/s ⇨ ◕ ⇨ 50 WCU/s

Development and Testing tools

You can use the golang script in contrib/dynamo_writer.go to help configure some faked items. You'll need a table that conforms to table partition key schema.

For instance, in inseng, you can test with the source table of ddb-sync-source.

You can run the item writer to write items:

vgo run contrib/dynamo_writer.go <table_name>

Contributing

Pull requests welcome for both bug fixes and features.

Our build system includes linting and a test suite so be sure to include further tests as a part of your pull request.


About

ddb-sync is a tool used for syncing data from a set of source tables to a set of destination tables in DynamoDB. It's configurable to perform backfill operations, stream consumptions or both by CLI or config file.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •