Skip to content

Commit

Permalink
Merge pull request #282 from authzed/bulk-import-documentation
Browse files Browse the repository at this point in the history
Bulk import documentation
  • Loading branch information
tstirrat15 authored Nov 20, 2024
2 parents 5eb4e76 + 46889a7 commit e70a12f
Show file tree
Hide file tree
Showing 3 changed files with 127 additions and 2 deletions.
1 change: 0 additions & 1 deletion pages/_meta.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
"title": "SpiceDB Documentation",
"type": "page"
},

"authzed": {
"title": "AuthZed Product Documentation",
"type": "page"
Expand Down
3 changes: 2 additions & 1 deletion pages/spicedb/ops/_meta.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
{
"observability": "Observability Tooling",
"deploying-spicedb-operator": "Deploying the SpiceDB Operator",
"deploying-spicedb-on-eks": "Deploying SpiceDB on AWS EKS"
"deploying-spicedb-on-eks": "Deploying SpiceDB on AWS EKS",
"bulk-operations": "Bulk Importing Relationships"
}
125 changes: 125 additions & 0 deletions pages/spicedb/ops/bulk-operations.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
import { Tabs } from 'nextra/components'

# Bulk Importing Relationships

## Overview

When setting up a SpiceDB cluster for the first time, there's often a data ingest process required to
set up the initial set of relations.
This can be done with [`WriteRelationships`](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.PermissionsService.WriteRelationships) running in a loop, but you can only create 1,000 relationships (by default) at a time with this approach, and each transaction creates a new revision which incurs a bit of overhead.

For faster ingest, we provide an [`ImportBulkRelationships`](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.PermissionsService.ImportBulkRelationships) call, which takes advantage of client-side gRPC streaming to accelerate the process and removes the cap on the number of relations that can be written at once.

## Batching

There are two batch sizes to consider: the number of relationships in a chunk written to the stream and the overall number of relationships in the lifetime of the request.
Breaking the request into chunks is a network optimization that makes it faster to push relationships from the client to the cluster.

The overall number of relationships should reflect how many rows can easily be written in a single transaction by your datastore.
Note that you probably **don't** want to push all of your relationships through in a single request, as this could time out in your datastore.

## Example

We'll use the [authzed-dotnet](https://github.com/authzed/authzed-dotnet) client for this example.
Other client libraries will have different syntax and structures around their streaming and iteration,
but this should demonstrate the two different levels of chunking that we'll do in the process.

<Tabs items={["Dotnet", "Python"]}>
<Tabs.Tab>
```csharp
var TOTAL_RELATIONSHIPS_TO_WRITE = 1000;
var RELATIONSHIPS_PER_TRANSACTION = 100;
var RELATIONSHIPS_PER_REQUEST_CHUNK = 10;

// Start by breaking the full list into a sequence of chunks where each chunk fits easily
// into a datastore transaction.
var transactionChunks = allRelationshipsToWrite.Chunk(RELATIONSHIPS_PER_TRANSACTION);

foreach (var relationshipsForRequest in transactionChunks) {
// For each of those transaction chunks, break it down further into chunks that
// optimize for network throughput.
var requestChunks = relationshipsForRequest.Chunk(RELATIONSHIPS_PER_REQUEST_CHUNK);
// Open up a client stream to the server for this transaction chunk
using var importCall = permissionsService.ImportBulkRelationships();
foreach (var requestChunk in requestChunks) {
// For each network chunk, write to the client stream.
// NOTE: this makes the calls sequentially rather than concurrently; this could be
// optimized further by using tasks.
await importCall.RequestStream.WriteAsync(new ImportBulkRelationshipsRequest{
Relationships = { requestChunk }
});
}
// When we're done with the transaction chunk, complete the call and process the response.
await importCall.RequestStream.CompleteAsync();
var importResponse = await importCall;
Console.WriteLine("request successful");
Console.WriteLine(importResponse.NumLoaded);
// Repeat!
}
```
</Tabs.Tab>
<Tabs.Tab>
```python
from itertools import batched

TOTAL_RELATIONSHIPS_TO_WRITE = 1_000

RELATIONSHIPS_PER_TRANSACTION = 100
RELATIONSHIPS_PER_REQUEST_CHUNK = 10

# NOTE: batched takes a larger iterator and makes an iterator of smaller chunks out of it.
# We iterate over chunks of size RELATIONSHIPS_PER_TRANSACTION, and then we break each request into
# chunks of size RELATIONSHIPS_PER_REQUEST_CHUNK.
transaction_chunks = batched(
all_relationships_to_write, RELATIONSHIPS_PER_TRANSACTION
)
for relationships_for_request in transaction_chunks:
request_chunks = batched(relationships_for_request, RELATIONSHIPS_PER_REQUEST_CHUNK)
response = client.ImportBulkRelationships(
(
ImportBulkRelationshipsRequest(relationships=relationships_chunk)
for relationships_chunk in request_chunks
)
)
print("request successful")
print(response.num_loaded)
```
</Tabs.Tab>
</Tabs>

The code for this example is [available here](https://github.com/authzed/authzed-dotnet/blob/main/examples/bulk-import/BulkImport/Program.cs).

## Retrying and Resuming

`ImportBulkRelationships`'s semantics only allow the creation of relationships.
If a relationship is imported that already exists in the database, it will error.
This can be frustrating when populating an instance if the process fails with a retryable error, such as those related to transient
network conditions.
The [authzed-go](https://github.com/authzed/authzed-go) client offers a [`RetryableClient`](https://github.com/authzed/authzed-go/blob/main/v1/retryable_client.go)
with retry logic built into its `ImportBulkRelationships` logic.

This is used internally by [zed](https://github.com/authzed/zed) and is exposed by the `authzed-go` library, and works by
either skipping over the offending batch if the `Skip` strategy is used or falling back to `WriteRelationships` with a touch
semantic if the `Touch` strategy is used.
Similar logic can be implemented using the other client libraries.

## Why does it work this way?

SpiceDB's `ImportBulkRelationships` service uses [gRPC client streaming] as a network optimization.
It **does not** commit those relationships to your datastore as it receives them, but rather opens a database transaction
at the start of the call and then commits that transaction when the client ends the stream.

This is because there isn't a good way to handle server-side errors in a commit-as-you-go approach.
We take this approach because if we were to commit each chunk sent over the network, the semantics
of server-side errors are ambiguous.
For example, you might receive an error that closes the stream, but that doesn't necessarily mean
that the last chunk you sent is where the error happened.
The error source could be sent as error context, but error handling and resumption would be difficult and cumbersome.

A [gRPC bidirectional streaming](https://grpc.io/docs/what-is-grpc/core-concepts/#bidirectional-streaming-rpc) approach could
help address this by ACKing each chunk individually, but that also requires a good amount of bookkeeping on the client to ensure
that every chunk that's written by the client has been acknowledged by the server.
Requiring multiple client-streaming requests means that you can use normal language error-handling flows
and know exactly what's been written to the server.

[gRPC client streaming]: https://grpc.io/docs/what-is-grpc/core-concepts/#client-streaming-rpc

0 comments on commit e70a12f

Please sign in to comment.