-
Notifications
You must be signed in to change notification settings - Fork 168
Characterization
We need to characterize Ziti performance so that we can compare it against plain internet, against other technologies and against itself, so we can tell if we improving, maintaining or degrading performance over time.
Characterization scenarios will be done across three axis.
- The model
- This includes the numbers and interactions of services, identities and polices
- The deployment
- This includes the number and type of instances and in which regions they are deployed. It also includes if we are using tunnelers or native Ziti applications
- The traffic
- This includes the number of concurrent concurrent sessions, the amount of data sent and the number of iterations.
Model | Services | Identities | Edge Routers | Service Policies | Edge Router Policies | Service Edge Router Policies |
---|---|---|---|---|---|---|
Baseline | 1 | 1 | 1 | 1 | 1 | 1 |
Small | 20 | 100 | 10 | 10 | 10 | 10 |
Medium | 100 | 5,000 | 100 | 50 | 50 | 10 |
Large | 200 | 100,000 | 500 | 250 | 250 | 100 |
For models with multiple edge routers, do we need to set the runtime up so only one is active, for consistency in test results (and also keeping testing costs down?)
For each policy from A <-> B, ensure we have at least
- an A with a policy which has all Bs
- a B with a policy which has all As
- an A with all policies
- a B with all policies
- Ensure that the A and B we test with are worst case: have access to maximum entities on both sides and are lexically sorted last to expose slowdowns in scans
We can test the model in isolation outside the context of a full deployment/throughput/scale testing to ensure that the queries we need to do for the SDK will scale well. Ideally permission checks would O(1) so that the only non-constant would be service look-ups (since as a user has more services, that will naturally take more time).
This testing can be done locally, just exercising the APIs used by the SDK. If we can eliminate poor performance here that will let us focus on performance in the edge routers for the throughput and connection scale testing.
Results
baseline | small | medium | large
=====================|=======================|======================|=====================
Create API Session: | Create API Session: | Create API Session: | Create API Session:
Min : 6ms | Min : 8ms | Min : 8ms | Min : 15ms
Max : 46ms | Max : 53ms | Max : 66ms | Max : 58ms
Mean : 23.3ms | Mean : 20.45ms | Mean : 24.4ms | Mean : 28.85ms
95th : 45.9ms | 95th : 52.39ms | 95th : 65.6ms | 95th : 57.24ms
Refresh API Session: | Refresh API Session: | Refresh API Session: | Refresh API Session:
Min : 0ms | Min : 0ms | Min : 0ms | Min : 0ms
Max : 0ms | Max : 0ms | Max : 0ms | Max : 0ms
Mean : 0ms | Mean : 0ms | Mean : 0ms | Mean : 0ms
95th : 0ms | 95th : 0ms | 95th : 0ms | 95th : 0ms
Get Services: | Get Services: | Get Services: | Get Services:
Min : 14ms | Min : 156ms | Min : 785ms | Min : 3521ms
Max : 17ms | Max : 187ms | Max : 848ms | Max : 3705ms
Mean : 16ms | Mean : 169.6ms | Mean : 805.4ms | Mean : 3620.5ms
95th : 17ms | 95th : 187ms | 95th : 848ms | 95th : 3705ms
Create Session: | Create Session: | Create Session: | Create Session:
Min : 6ms | Min : 8ms | Min : 18ms | Min : 2033ms
Max : 36ms | Max : 49ms | Max : 38ms | Max : 4951ms
Mean : 15.75ms | Mean : 20.35ms | Mean : 24.05ms | Mean : 3386.95ms
95th : 35.9ms | 95th : 48.95ms | 95th : 37.9ms | 95th : 4944.65ms
Refresh Session: | Refresh Session: | Refresh Session: | Refresh Session:
Min : 0ms | Min : 0ms | Min : 0ms | Min : 0ms
Max : 0ms | Max : 0ms | Max : 0ms | Max : 0ms
Mean : 0ms | Mean : 0ms | Mean : 0ms | Mean : 0ms
95th : 0ms | 95th : 0ms | 95th : 0ms | 95th : 0ms
After denormalizing policy data and adding some query optimizations, results are much improved.
baseline small medium large
========================================================================================
Create API Session: Create API Session: Create API Session: Create API Session:
Min : 5ms Min : 6ms Min : 7ms Min : 16ms
Max : 29ms Max : 66ms Max : 73ms Max : 80ms
Mean : 17.16ms Mean : 18.69ms Mean : 20.52ms Mean : 29ms
95th : 25ms 95th : 33ms 95th : 31.54ms 95th : 49.85ms
Refresh API Session: Refresh API Session: Refresh API Session: Refresh API Session:
Min : 0ms Min : 0ms Min : 0ms Min : 0ms
Max : 0ms Max : 0ms Max : 0ms Max : 0ms
Mean : 0ms Mean : 0ms Mean : 0ms Mean : 0ms
95th : 0ms 95th : 0ms 95th : 0ms 95th : 0ms
Get Services: Get Services: Get Services: Get Services:
Min : 5ms Min : 12ms Min : 10ms Min : 48ms
Max : 25ms Max : 37ms Max : 63ms Max : 132ms
Mean : 9.28ms Mean : 23.02ms Mean : 29.95ms Mean : 73.9ms
95th : 19ms 95th : 32.94ms 95th : 44ms 95th : 108.84ms
Create Session: Create Session: Create Session: Create Session:
Min : 6ms Min : 7ms Min : 8ms Min : 14ms
Max : 23ms Max : 35ms Max : 41ms Max : 60ms
Mean : 12.42ms Mean : 12.86ms Mean : 14.36ms Mean : 29.4375ms
95th : 22ms 95th : 25ms 95th : 28.19ms 95th : 52.55ms
Refresh Session: Refresh Session: Refresh Session: Refresh Session:
Min : 0ms Min : 0ms Min : 0ms Min : 0ms
Max : 0ms Max : 0ms Max : 0ms Max : 0ms
Mean : 0ms Mean : 0ms Mean : 0ms Mean : 0ms
95th : 0ms 95th : 0ms 95th : 0ms 95th : 0ms
We should test with a variety of instance types, from t2 on up. Until we start testing, it will be hard to say what is needed. For high bandwidth applications you often need bigger instance types, even if the CPU and memory aren't required.
The controller should require smaller instances than the router, at least in terms of network use.
We shouldn't need to test deployment variations, such as tunneler vs SDK enabled application for all scenarios. We can pick one or two scenarios in order to find out if there are noticeable differences.
There are some different traffic types we should test:
- IPerf, for sustained throughput testing. This can be done with various degrees of parallelism.
- Something like a web-service or HTTP server, for lots of concurrent, short lived connections, to get a feel for connection setup/teardown overhead.