Merge pull request #282 from authzed/bulk-import-documentation

Bulk import documentation
authzed · Nov 20, 2024 · e70a12f · e70a12f
2 parents 5eb4e76 + 46889a7
commit e70a12f
Show file tree

Hide file tree

Showing 3 changed files with 127 additions and 2 deletions.
diff --git a/pages/_meta.json b/pages/_meta.json
@@ -7,7 +7,6 @@
     "title": "SpiceDB Documentation",
     "type": "page"
   },
-
   "authzed": {
     "title": "AuthZed Product Documentation",
     "type": "page"

diff --git a/pages/spicedb/ops/_meta.json b/pages/spicedb/ops/_meta.json
@@ -1,5 +1,6 @@
 {
   "observability": "Observability Tooling",
   "deploying-spicedb-operator": "Deploying the SpiceDB Operator",
-  "deploying-spicedb-on-eks": "Deploying SpiceDB on AWS EKS"
+  "deploying-spicedb-on-eks": "Deploying SpiceDB on AWS EKS",
+  "bulk-operations": "Bulk Importing Relationships"
 }
diff --git a/pages/spicedb/ops/bulk-operations.mdx b/pages/spicedb/ops/bulk-operations.mdx
@@ -0,0 +1,125 @@
+import { Tabs } from 'nextra/components'
+
+# Bulk Importing Relationships
+
+## Overview
+
+When setting up a SpiceDB cluster for the first time, there's often a data ingest process required to
+set up the initial set of relations.
+This can be done with [`WriteRelationships`](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.PermissionsService.WriteRelationships) running in a loop, but you can only create 1,000 relationships (by default) at a time with this approach, and each transaction creates a new revision which incurs a bit of overhead.
+
+For faster ingest, we provide an [`ImportBulkRelationships`](https://buf.build/authzed/api/docs/main:authzed.api.v1#authzed.api.v1.PermissionsService.ImportBulkRelationships) call, which takes advantage of client-side gRPC streaming to accelerate the process and removes the cap on the number of relations that can be written at once.
+
+## Batching
+
+There are two batch sizes to consider: the number of relationships in a chunk written to the stream and the overall number of relationships in the lifetime of the request.
+Breaking the request into chunks is a network optimization that makes it faster to push relationships from the client to the cluster.
+
+The overall number of relationships should reflect how many rows can easily be written in a single transaction by your datastore.
+Note that you probably **don't** want to push all of your relationships through in a single request, as this could time out in your datastore.
+
+## Example
+
+We'll use the [authzed-dotnet](https://github.com/authzed/authzed-dotnet) client for this example.
+Other client libraries will have different syntax and structures around their streaming and iteration,
+but this should demonstrate the two different levels of chunking that we'll do in the process.
+
+<Tabs items={["Dotnet", "Python"]}>
+    <Tabs.Tab>
+        ```csharp
+        var TOTAL_RELATIONSHIPS_TO_WRITE = 1000;
+        var RELATIONSHIPS_PER_TRANSACTION = 100;
+        var RELATIONSHIPS_PER_REQUEST_CHUNK = 10;
+
+        // Start by breaking the full list into a sequence of chunks where each chunk fits easily
+        // into a datastore transaction.
+        var transactionChunks = allRelationshipsToWrite.Chunk(RELATIONSHIPS_PER_TRANSACTION);
+
+        foreach (var relationshipsForRequest in transactionChunks) {
+            // For each of those transaction chunks, break it down further into chunks that
+            // optimize for network throughput.
+            var requestChunks = relationshipsForRequest.Chunk(RELATIONSHIPS_PER_REQUEST_CHUNK);
+            // Open up a client stream to the server for this transaction chunk
+            using var importCall = permissionsService.ImportBulkRelationships();
+            foreach (var requestChunk in requestChunks) {
+                // For each network chunk, write to the client stream.
+                // NOTE: this makes the calls sequentially rather than concurrently; this could be
+                // optimized further by using tasks.
+                await importCall.RequestStream.WriteAsync(new ImportBulkRelationshipsRequest{
+                        Relationships = { requestChunk }
+                        });
+            }
+            // When we're done with the transaction chunk, complete the call and process the response.
+            await importCall.RequestStream.CompleteAsync();
+            var importResponse = await importCall;
+            Console.WriteLine("request successful");
+            Console.WriteLine(importResponse.NumLoaded);
+            // Repeat!
+        }
+        ```
+    </Tabs.Tab>
+    <Tabs.Tab>
+        ```python
+        from itertools import batched
+
+        TOTAL_RELATIONSHIPS_TO_WRITE = 1_000
+
+        RELATIONSHIPS_PER_TRANSACTION = 100
+        RELATIONSHIPS_PER_REQUEST_CHUNK = 10
+
+        # NOTE: batched takes a larger iterator and makes an iterator of smaller chunks out of it.
+        # We iterate over chunks of size RELATIONSHIPS_PER_TRANSACTION, and then we break each request into
+        # chunks of size RELATIONSHIPS_PER_REQUEST_CHUNK.
+        transaction_chunks = batched(
+            all_relationships_to_write, RELATIONSHIPS_PER_TRANSACTION
+        )
+        for relationships_for_request in transaction_chunks:
+            request_chunks = batched(relationships_for_request, RELATIONSHIPS_PER_REQUEST_CHUNK)
+            response = client.ImportBulkRelationships(
+                (
+                    ImportBulkRelationshipsRequest(relationships=relationships_chunk)
+                    for relationships_chunk in request_chunks
+                )
+            )
+            print("request successful")
+            print(response.num_loaded)
+        ```
+    </Tabs.Tab>
+</Tabs>
+
+The code for this example is [available here](https://github.com/authzed/authzed-dotnet/blob/main/examples/bulk-import/BulkImport/Program.cs).
+
+## Retrying and Resuming
+
+`ImportBulkRelationships`'s semantics only allow the creation of relationships.
+If a relationship is imported that already exists in the database, it will error.
+This can be frustrating when populating an instance if the process fails with a retryable error, such as those related to transient
+network conditions.
+The [authzed-go](https://github.com/authzed/authzed-go) client offers a [`RetryableClient`](https://github.com/authzed/authzed-go/blob/main/v1/retryable_client.go)
+with retry logic built into its `ImportBulkRelationships` logic.
+
+This is used internally by [zed](https://github.com/authzed/zed) and is exposed by the `authzed-go` library, and works by
+either skipping over the offending batch if the `Skip` strategy is used or falling back to `WriteRelationships` with a touch
+semantic if the `Touch` strategy is used.
+Similar logic can be implemented using the other client libraries.
+
+## Why does it work this way?
+
+SpiceDB's `ImportBulkRelationships` service uses [gRPC client streaming] as a network optimization.
+It **does not** commit those relationships to your datastore as it receives them, but rather opens a database transaction
+at the start of the call and then commits that transaction when the client ends the stream.
+
+This is because there isn't a good way to handle server-side errors in a commit-as-you-go approach.
+We take this approach because if we were to commit each chunk sent over the network, the semantics
+of server-side errors are ambiguous.
+For example, you might receive an error that closes the stream, but that doesn't necessarily mean
+that the last chunk you sent is where the error happened.
+The error source could be sent as error context, but error handling and resumption would be difficult and cumbersome.
+
+A [gRPC bidirectional streaming](https://grpc.io/docs/what-is-grpc/core-concepts/#bidirectional-streaming-rpc) approach could
+help address this by ACKing each chunk individually, but that also requires a good amount of bookkeeping on the client to ensure
+that every chunk that's written by the client has been acknowledged by the server.
+Requiring multiple client-streaming requests means that you can use normal language error-handling flows
+and know exactly what's been written to the server.
+
+[gRPC client streaming]: https://grpc.io/docs/what-is-grpc/core-concepts/#client-streaming-rpc