-
Notifications
You must be signed in to change notification settings - Fork 40
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #282 from authzed/bulk-import-documentation
Bulk import documentation
- Loading branch information
Showing
3 changed files
with
127 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
{ | ||
"observability": "Observability Tooling", | ||
"deploying-spicedb-operator": "Deploying the SpiceDB Operator", | ||
"deploying-spicedb-on-eks": "Deploying SpiceDB on AWS EKS" | ||
"deploying-spicedb-on-eks": "Deploying SpiceDB on AWS EKS", | ||
"bulk-operations": "Bulk Importing Relationships" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
import { Tabs } from 'nextra/components' | ||
|
||
# Bulk Importing Relationships | ||
|
||
## Overview | ||
|
||
When setting up a SpiceDB cluster for the first time, there's often a data ingest process required to | ||
set up the initial set of relations. | ||
This can be done with [`WriteRelationships`](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.PermissionsService.WriteRelationships) running in a loop, but you can only create 1,000 relationships (by default) at a time with this approach, and each transaction creates a new revision which incurs a bit of overhead. | ||
|
||
For faster ingest, we provide an [`ImportBulkRelationships`](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.PermissionsService.ImportBulkRelationships) call, which takes advantage of client-side gRPC streaming to accelerate the process and removes the cap on the number of relations that can be written at once. | ||
|
||
## Batching | ||
|
||
There are two batch sizes to consider: the number of relationships in a chunk written to the stream and the overall number of relationships in the lifetime of the request. | ||
Breaking the request into chunks is a network optimization that makes it faster to push relationships from the client to the cluster. | ||
|
||
The overall number of relationships should reflect how many rows can easily be written in a single transaction by your datastore. | ||
Note that you probably **don't** want to push all of your relationships through in a single request, as this could time out in your datastore. | ||
|
||
## Example | ||
|
||
We'll use the [authzed-dotnet](https://github.com/authzed/authzed-dotnet) client for this example. | ||
Other client libraries will have different syntax and structures around their streaming and iteration, | ||
but this should demonstrate the two different levels of chunking that we'll do in the process. | ||
|
||
<Tabs items={["Dotnet", "Python"]}> | ||
<Tabs.Tab> | ||
```csharp | ||
var TOTAL_RELATIONSHIPS_TO_WRITE = 1000; | ||
var RELATIONSHIPS_PER_TRANSACTION = 100; | ||
var RELATIONSHIPS_PER_REQUEST_CHUNK = 10; | ||
|
||
// Start by breaking the full list into a sequence of chunks where each chunk fits easily | ||
// into a datastore transaction. | ||
var transactionChunks = allRelationshipsToWrite.Chunk(RELATIONSHIPS_PER_TRANSACTION); | ||
|
||
foreach (var relationshipsForRequest in transactionChunks) { | ||
// For each of those transaction chunks, break it down further into chunks that | ||
// optimize for network throughput. | ||
var requestChunks = relationshipsForRequest.Chunk(RELATIONSHIPS_PER_REQUEST_CHUNK); | ||
// Open up a client stream to the server for this transaction chunk | ||
using var importCall = permissionsService.ImportBulkRelationships(); | ||
foreach (var requestChunk in requestChunks) { | ||
// For each network chunk, write to the client stream. | ||
// NOTE: this makes the calls sequentially rather than concurrently; this could be | ||
// optimized further by using tasks. | ||
await importCall.RequestStream.WriteAsync(new ImportBulkRelationshipsRequest{ | ||
Relationships = { requestChunk } | ||
}); | ||
} | ||
// When we're done with the transaction chunk, complete the call and process the response. | ||
await importCall.RequestStream.CompleteAsync(); | ||
var importResponse = await importCall; | ||
Console.WriteLine("request successful"); | ||
Console.WriteLine(importResponse.NumLoaded); | ||
// Repeat! | ||
} | ||
``` | ||
</Tabs.Tab> | ||
<Tabs.Tab> | ||
```python | ||
from itertools import batched | ||
|
||
TOTAL_RELATIONSHIPS_TO_WRITE = 1_000 | ||
|
||
RELATIONSHIPS_PER_TRANSACTION = 100 | ||
RELATIONSHIPS_PER_REQUEST_CHUNK = 10 | ||
|
||
# NOTE: batched takes a larger iterator and makes an iterator of smaller chunks out of it. | ||
# We iterate over chunks of size RELATIONSHIPS_PER_TRANSACTION, and then we break each request into | ||
# chunks of size RELATIONSHIPS_PER_REQUEST_CHUNK. | ||
transaction_chunks = batched( | ||
all_relationships_to_write, RELATIONSHIPS_PER_TRANSACTION | ||
) | ||
for relationships_for_request in transaction_chunks: | ||
request_chunks = batched(relationships_for_request, RELATIONSHIPS_PER_REQUEST_CHUNK) | ||
response = client.ImportBulkRelationships( | ||
( | ||
ImportBulkRelationshipsRequest(relationships=relationships_chunk) | ||
for relationships_chunk in request_chunks | ||
) | ||
) | ||
print("request successful") | ||
print(response.num_loaded) | ||
``` | ||
</Tabs.Tab> | ||
</Tabs> | ||
|
||
The code for this example is [available here](https://github.com/authzed/authzed-dotnet/blob/main/examples/bulk-import/BulkImport/Program.cs). | ||
|
||
## Retrying and Resuming | ||
|
||
`ImportBulkRelationships`'s semantics only allow the creation of relationships. | ||
If a relationship is imported that already exists in the database, it will error. | ||
This can be frustrating when populating an instance if the process fails with a retryable error, such as those related to transient | ||
network conditions. | ||
The [authzed-go](https://github.com/authzed/authzed-go) client offers a [`RetryableClient`](https://github.com/authzed/authzed-go/blob/main/v1/retryable_client.go) | ||
with retry logic built into its `ImportBulkRelationships` logic. | ||
|
||
This is used internally by [zed](https://github.com/authzed/zed) and is exposed by the `authzed-go` library, and works by | ||
either skipping over the offending batch if the `Skip` strategy is used or falling back to `WriteRelationships` with a touch | ||
semantic if the `Touch` strategy is used. | ||
Similar logic can be implemented using the other client libraries. | ||
|
||
## Why does it work this way? | ||
|
||
SpiceDB's `ImportBulkRelationships` service uses [gRPC client streaming] as a network optimization. | ||
It **does not** commit those relationships to your datastore as it receives them, but rather opens a database transaction | ||
at the start of the call and then commits that transaction when the client ends the stream. | ||
|
||
This is because there isn't a good way to handle server-side errors in a commit-as-you-go approach. | ||
We take this approach because if we were to commit each chunk sent over the network, the semantics | ||
of server-side errors are ambiguous. | ||
For example, you might receive an error that closes the stream, but that doesn't necessarily mean | ||
that the last chunk you sent is where the error happened. | ||
The error source could be sent as error context, but error handling and resumption would be difficult and cumbersome. | ||
|
||
A [gRPC bidirectional streaming](https://grpc.io/docs/what-is-grpc/core-concepts/#bidirectional-streaming-rpc) approach could | ||
help address this by ACKing each chunk individually, but that also requires a good amount of bookkeeping on the client to ensure | ||
that every chunk that's written by the client has been acknowledged by the server. | ||
Requiring multiple client-streaming requests means that you can use normal language error-handling flows | ||
and know exactly what's been written to the server. | ||
|
||
[gRPC client streaming]: https://grpc.io/docs/what-is-grpc/core-concepts/#client-streaming-rpc |