Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

Commit

Permalink
feat: block index migration script (#1)
Browse files Browse the repository at this point in the history
Provides 2 commands

- clis.js: scan the source dynamo table. for each record transform it to
the new format and see if it exists in the destination table. If not,
write the record to disk as ndjson
- write-cli.js: takes an ndjson file of indexes as input and writes them
in batches to the destination table. Writes any operations to disk as
ndjson for retrying.

_these would be a single command, but I am in the middle of running the
read side of things and didn't want to break it, so the write side lives
in separate files. If we care we can merge them later_

Dynamo table scans let us split up the results into segments. You tell
it how many segments to divide the result set into, and a current
segment, and the scan command will fetch all the results for that
segment only. This is nice, as it makes it easier to divide the work up
across workers.

License: MIT

---------

Signed-off-by: Oli Evans <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
  • Loading branch information
olizilla and Alan Shaw authored Aug 31, 2023
1 parent 8b18083 commit 33bc598
Show file tree
Hide file tree
Showing 14 changed files with 12,165 additions and 23 deletions.
15 changes: 15 additions & 0 deletions .github/actions/test/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: Test
description: 'Setup and test'

runs:
using: "composite"
steps:
- uses: actions/setup-node@v3
with:
registry-url: 'https://registry.npmjs.org'
node-version: 18
cache: 'npm'
- run: npm ci
shell: bash
- run: npm test
shell: bash
13 changes: 13 additions & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: Test
on:
pull_request:
branches:
- main

jobs:
test:
name: Test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: ./.github/actions/test
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
node_modules
migrate-block-index-*
write-unprocessed-*
52 changes: 29 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,26 +11,46 @@ to:
{ blockmultihash: string, carPath: string, length: number, offset: number }
```

Upgrades the carPath key to `${carCid}/${carCid}.car` style where the source key is prefixed with `/raw` and the source bucket is `dotstorage-prod-*`

## Getting started

You will need:

- `node` v18
- aws credentials exported in env vars

run `cli.js`
There are two phases: find and transform the indexes, and, write them to destination.

run `cli.js` to find and transform the set of indexes that exist in the source table but not in the destination.

This will write `migrate-block-index-<totalSegments>-<currentSegment>.ndjson` files to disk with just the subset of indexes that need to be written to the destination table.

```sh
$ ./cli.js src-table-name dest-table-name 859 0
# writes missing indexes to migrate-block-index-859-000.ndjson
```
$ ./cli.js src-table-name dest-table-name
ℹ Migrating: src-table-name -> dest-table-name

run `write-cli.js` to write indexes from a file to the destination table.

```sh
$ ./write-cli.js migrate-block-index dest-table-name 859 0
# migrate-block-index-859-000.ndjson -> dest-table-name
```

### Concurrency

Each command takes a `totalSegments` and `segment` arg which is used to split up the initial table scan into `totalSegment` partitions. Use `parallel` to manage concurrency e.g

```sh
# runs 859 jobs from 0 to 858, default parallelization = cpu threads
seq 0 858 | parallel ./cli.js blocks prod-ep-v1-blocks-cars-position 859 {1}
```

## Migration details

This will be used to migrate indexes from the `blocks` table to the `blocks-cars-position` table.

1 dynamo write costs more than 5x a read, so try to reduce writes!

### Source table scan cost

`blocks` table stats
Expand Down Expand Up @@ -58,24 +78,6 @@ Assuming we have to write 1 new record for every source record
- 286 * $1.25 per million = **$357.5 total write cost**


### Reducing writes

1 dynamo write costs > 5x a read, so try to reduce writes!

- If we have or can derive a car cid key
- if exists in dest table
- **skip!**
- else
- If the old key exists:
- **skip!**
- else
- **write car cid key**
- else
- If the old key exists:
- **skip!**
- else
- **write old style key** (as it's all we have)

### References

> Write operation costs $1.25 per million requests.
Expand All @@ -86,3 +88,7 @@ Assuming we have to write 1 new record for every source record
>
> Write consumption: Measured in 1KB increments (also rounded up!) for each write operation. This is the size of the item you're writing / updating / deleting during a write operation... if you write a single 7.5KB item in DynamoDB, you will consume 8 WCUs (7.5 / 1 == 7.5, rounded up to 8).
https://www.alexdebrie.com/posts/dynamodb-costs/

> If a ConditionExpression evaluates to false during a conditional write, DynamoDB still consumes write capacity from the table. The amount consumed is dependent on the size of the item (whether it’s an existing item in the table or a new one you are attempting to create or update). For example, if an existing item is 300kb and the new item you are trying to create or update is 310kb, the write capacity units consumed will be the 310kb item.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.ConditionalUpdate

23 changes: 23 additions & 0 deletions cli.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/usr/bin/env node

import ora from 'ora'
import { todo } from './migrate.js'
import { DynamoDBClient } from '@aws-sdk/client-dynamodb'

const src = process.argv.at(2) ?? exit('plz pass source table name')
const dst = process.argv.at(3) ?? exit('plz pass destination table name')
const totalSegments = parseInt(process.argv.at(4) ?? exit('plz pass totalSegments'))
const segment = parseInt(process.argv.at(5) ?? exit('plz pass segment'))
const client = new DynamoDBClient()

const spinner = ora({
text: `Migrating: ${src} -> ${dst}`
}).start()

// await migrate(src, dst, client, spinner)
await todo(src, dst, segment, totalSegments, client, spinner)

function exit (msg) {
console.error(msg)
process.exit(1)
}
21 changes: 21 additions & 0 deletions examples/batch-get-response.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"Responses": {
"prod-ep-v1-blocks-cars-position": [
{
"blockmultihash": {
"S": "z2Drjgb6bW26vwjB44f8gHTeFUKFUofvRVxHerEWp7twXxxxxxx"
},
"offset": {
"N": "3317501"
},
"length": {
"N": "2765"
},
"carpath": {
"S": "region/bucket/raw/QmXLdJm1icGsm8PYtfUPHpXbZdWQRWZcmSQyLixxxxxxxx/3153187342xxxxxxxx/ciqhzffo5jf6ei5spizagdqyegetvxj35rxfkadhwgastaqxxxxxxxx.car"
}
}
]
},
"UnprocessedKeys": {}
}
22 changes: 22 additions & 0 deletions examples/batch-get.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"prod-ep-v1-blocks-cars-position": {
"Keys": [
{
"blockmultihash": {
"S": "z2Drjgb6bW26vwjB44f8gHTeFUKFUofvRVxHerEWp7txxxxxxxx"
},
"carpath": {
"S": "region/bucket/raw/QmXLdJm1icGsm8PYtfUPHpXbZdWQRWZcmSQyLixxxxxxxx/3153187342xxxxxxxx/ciqhzffo5jf6ei5spizagdqyegetvxj35rxfkadhwgastaqxxxxxxxx.car"
}
},
{
"blockmultihash": {
"S": "nope"
},
"carpath": {
"S": "region/bucket/raw/QmXLdJm1icGsm8PYtfUPHpXbZdWQRWZcmSQyLixxxxxxxx/3153187342xxxxxxxx/ciqhzffo5jf6ei5spizagdqyegetvxj35rxfkadhwgastaqxxxxxxxx.car"
}
}
]
}
}
Loading

0 comments on commit 33bc598

Please sign in to comment.