diff --git a/pages/_meta.json b/pages/_meta.json index 6d8961b..b5a72cb 100644 --- a/pages/_meta.json +++ b/pages/_meta.json @@ -7,7 +7,6 @@ "title": "SpiceDB Documentation", "type": "page" }, - "authzed": { "title": "AuthZed Product Documentation", "type": "page" diff --git a/pages/spicedb/ops/_meta.json b/pages/spicedb/ops/_meta.json index 52996fb..537c61a 100644 --- a/pages/spicedb/ops/_meta.json +++ b/pages/spicedb/ops/_meta.json @@ -1,5 +1,6 @@ { "observability": "Observability Tooling", "deploying-spicedb-operator": "Deploying the SpiceDB Operator", - "deploying-spicedb-on-eks": "Deploying SpiceDB on AWS EKS" + "deploying-spicedb-on-eks": "Deploying SpiceDB on AWS EKS", + "bulk-operations": "Bulk Importing Relationships" } diff --git a/pages/spicedb/ops/bulk-operations.mdx b/pages/spicedb/ops/bulk-operations.mdx new file mode 100644 index 0000000..e8eccf8 --- /dev/null +++ b/pages/spicedb/ops/bulk-operations.mdx @@ -0,0 +1,125 @@ +import { Tabs } from 'nextra/components' + +# Bulk Importing Relationships + +## Overview + +When setting up a SpiceDB cluster for the first time, there's often a data ingest process required to +set up the initial set of relations. +This can be done with [`WriteRelationships`](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.PermissionsService.WriteRelationships) running in a loop, but you can only create 1,000 relationships (by default) at a time with this approach, and each transaction creates a new revision which incurs a bit of overhead. + +For faster ingest, we provide an [`ImportBulkRelationships`](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.PermissionsService.ImportBulkRelationships) call, which takes advantage of client-side gRPC streaming to accelerate the process and removes the cap on the number of relations that can be written at once. + +## Batching + +There are two batch sizes to consider: the number of relationships in a chunk written to the stream and the overall number of relationships in the lifetime of the request. +Breaking the request into chunks is a network optimization that makes it faster to push relationships from the client to the cluster. + +The overall number of relationships should reflect how many rows can easily be written in a single transaction by your datastore. +Note that you probably **don't** want to push all of your relationships through in a single request, as this could time out in your datastore. + +## Example + +We'll use the [authzed-dotnet](https://github.com/authzed/authzed-dotnet) client for this example. +Other client libraries will have different syntax and structures around their streaming and iteration, +but this should demonstrate the two different levels of chunking that we'll do in the process. + + + + ```csharp + var TOTAL_RELATIONSHIPS_TO_WRITE = 1000; + var RELATIONSHIPS_PER_TRANSACTION = 100; + var RELATIONSHIPS_PER_REQUEST_CHUNK = 10; + + // Start by breaking the full list into a sequence of chunks where each chunk fits easily + // into a datastore transaction. + var transactionChunks = allRelationshipsToWrite.Chunk(RELATIONSHIPS_PER_TRANSACTION); + + foreach (var relationshipsForRequest in transactionChunks) { + // For each of those transaction chunks, break it down further into chunks that + // optimize for network throughput. + var requestChunks = relationshipsForRequest.Chunk(RELATIONSHIPS_PER_REQUEST_CHUNK); + // Open up a client stream to the server for this transaction chunk + using var importCall = permissionsService.ImportBulkRelationships(); + foreach (var requestChunk in requestChunks) { + // For each network chunk, write to the client stream. + // NOTE: this makes the calls sequentially rather than concurrently; this could be + // optimized further by using tasks. + await importCall.RequestStream.WriteAsync(new ImportBulkRelationshipsRequest{ + Relationships = { requestChunk } + }); + } + // When we're done with the transaction chunk, complete the call and process the response. + await importCall.RequestStream.CompleteAsync(); + var importResponse = await importCall; + Console.WriteLine("request successful"); + Console.WriteLine(importResponse.NumLoaded); + // Repeat! + } + ``` + + + ```python + from itertools import batched + + TOTAL_RELATIONSHIPS_TO_WRITE = 1_000 + + RELATIONSHIPS_PER_TRANSACTION = 100 + RELATIONSHIPS_PER_REQUEST_CHUNK = 10 + + # NOTE: batched takes a larger iterator and makes an iterator of smaller chunks out of it. + # We iterate over chunks of size RELATIONSHIPS_PER_TRANSACTION, and then we break each request into + # chunks of size RELATIONSHIPS_PER_REQUEST_CHUNK. + transaction_chunks = batched( + all_relationships_to_write, RELATIONSHIPS_PER_TRANSACTION + ) + for relationships_for_request in transaction_chunks: + request_chunks = batched(relationships_for_request, RELATIONSHIPS_PER_REQUEST_CHUNK) + response = client.ImportBulkRelationships( + ( + ImportBulkRelationshipsRequest(relationships=relationships_chunk) + for relationships_chunk in request_chunks + ) + ) + print("request successful") + print(response.num_loaded) + ``` + + + +The code for this example is [available here](https://github.com/authzed/authzed-dotnet/blob/main/examples/bulk-import/BulkImport/Program.cs). + +## Retrying and Resuming + +`ImportBulkRelationships`'s semantics only allow the creation of relationships. +If a relationship is imported that already exists in the database, it will error. +This can be frustrating when populating an instance if the process fails with a retryable error, such as those related to transient +network conditions. +The [authzed-go](https://github.com/authzed/authzed-go) client offers a [`RetryableClient`](https://github.com/authzed/authzed-go/blob/main/v1/retryable_client.go) +with retry logic built into its `ImportBulkRelationships` logic. + +This is used internally by [zed](https://github.com/authzed/zed) and is exposed by the `authzed-go` library, and works by +either skipping over the offending batch if the `Skip` strategy is used or falling back to `WriteRelationships` with a touch +semantic if the `Touch` strategy is used. +Similar logic can be implemented using the other client libraries. + +## Why does it work this way? + +SpiceDB's `ImportBulkRelationships` service uses [gRPC client streaming] as a network optimization. +It **does not** commit those relationships to your datastore as it receives them, but rather opens a database transaction +at the start of the call and then commits that transaction when the client ends the stream. + +This is because there isn't a good way to handle server-side errors in a commit-as-you-go approach. +We take this approach because if we were to commit each chunk sent over the network, the semantics +of server-side errors are ambiguous. +For example, you might receive an error that closes the stream, but that doesn't necessarily mean +that the last chunk you sent is where the error happened. +The error source could be sent as error context, but error handling and resumption would be difficult and cumbersome. + +A [gRPC bidirectional streaming](https://grpc.io/docs/what-is-grpc/core-concepts/#bidirectional-streaming-rpc) approach could +help address this by ACKing each chunk individually, but that also requires a good amount of bookkeeping on the client to ensure +that every chunk that's written by the client has been acknowledged by the server. +Requiring multiple client-streaming requests means that you can use normal language error-handling flows +and know exactly what's been written to the server. + +[gRPC client streaming]: https://grpc.io/docs/what-is-grpc/core-concepts/#client-streaming-rpc