import { Appear } from "@mdx-deck/components";
What is stream processing?
processing of infinite lists, one element at the time
1. Real-time, low latency
2. Feasible to process data much much larger than available memory
It's a Streaming Platform
1. Kafka proper 👉 messaging and persistence
2. Kafka Streams 👉 stateful stream processing
3. Kafka Connect 👉 glue to get data into and out of Kafka
4. Schema Registry 👉 where all your data contracts live
5. Mirror Maker 👉 replicator of topics across clusters
6. REST Proxy 👉 REST API to consume from Kafka
Fancy way of sequentially writing to and reading from a file (log) over the network.
> echo " foo" >> my_topic
> echo " bar" >> my_topic
The amount of messages consumed per consumer
A way of parallelising reads and writes
Create a topic with multiple Partitions
> mkdir my_other_topic
> touch my_other_topic/1
> touch my_other_topic/2
> echo " foo" >> my_other_topic/1
> echo " bar" >> my_other_topic/2
> echo " baz" >> my_other_topic/1
> tail -f my_other_topic/*
> tail -f my_other_topic/1
> tail -f my_other_topic/2
In reality, apart from value, each message also has a key.
Producer creates the key however they wish.
Messages with the same keys end up on the same partitions.
Notice that we never defined a schema
Each message (and key) is an arbitrary sequence of bytes
Producers and Consumers are in charge of serialising messages into and deserialising them out of bytes (SerDe).
Keeping at least one extra copy of each partition
Primary copy 👉 leader - handles all the writes and reads
Seconary copies 👉 followers - are kept in-sync on standby
Number of leader + followers governed by the replication
factor.
ReplicaFetcherThreads
responsible
> while inotifywait -r -e modify,create,delete /directory; do \
rsync -avz /my_other_topic/1 /my_other_topic/1_2 \
rsync -avz /my_other_topic/2 /my_other_topic/2_2 \
done
Putting followers on different machines (brokers) than leaders.
If a machine goes down, one of the followers becomes the leader 👉 LeaderElection
.
How do we know who's the leader?
Decided by special broker called Controller
.
Information stored in Zookeeper /brokers/topics/[topic]
.
How do we know who's the controller then?
The first broker to register itself in Zookeeper under /controller
.
When a broker fails, new leaders have to be elected.
They will be chosen from in-sync replicas.
New followers have to be created to satisfy the replication
factor.
When the Controller
goes down, remaining brokers race to register themselves under /controller
first.
So we can resiliently process data larger than memory...
What about larger than disk?
If we keep appending to files, we're going to run out of space quickly.
You can't delete arbitrary data
retention.ms
allows for deleting messages older than x.
cleanup.policy=compact
allows for never loosing data, but keeping last n messages per key (database).
Handled by LogCleanerThreads
.
Slides: https://github.com/kujon/kafka-talk