import { Appear } from "@mdx-deck/components";

Deep dive into Kafka

What is stream processing?

processing of infinite lists, one element at the time

Cool consequences

1. Real-time, low latency

2. Feasible to process data much much larger than available memory

What is Kafka?

It's a Streaming Platform

1. Kafka proper 👉 messaging and persistence

2. Kafka Streams 👉 stateful stream processing

3. Kafka Connect 👉 glue to get data into and out of Kafka

4. Schema Registry 👉 where all your data contracts live

5. Mirror Maker 👉 replicator of topics across clusters

6. REST Proxy 👉 REST API to consume from Kafka

7. ... and more ...

Kafka proper

Fancy way of sequentially writing to and reading from a file (log) over the network.

Demo

Creating a file

> touch my_topic

Writing (producing)

> echo "foo" >> my_topic
> echo "bar" >> my_topic

Reading (consuming)

> tail -f my_topic

Consumer offsets

The amount of messages consumed per consumer

> tail -n 1 -f my_topic

Partitions

A way of parallelising reads and writes

Demo

Create a topic with multiple Partitions

> mkdir my_other_topic
> touch my_other_topic/1
> touch my_other_topic/2

Producer

> echo "foo" >> my_other_topic/1
> echo "bar" >> my_other_topic/2
> echo "baz" >> my_other_topic/1

Demo part 2

Consumer

Individual consumers

> tail -f my_other_topic/*

Consumer groups

> tail -f my_other_topic/1
> tail -f my_other_topic/2

Keys

In reality, apart from value, each message also has a key.

Producer creates the key however they wish.

Messages with the same keys end up on the same partitions.

Notice that we never defined a schema

Each message (and key) is an arbitrary sequence of bytes

Producers and Consumers are in charge of serialising messages into and deserialising them out of bytes (SerDe).

Replication

Keeping at least one extra copy of each partition

Primary copy 👉 leader - handles all the writes and reads

Seconary copies 👉 followers - are kept in-sync on standby

Number of leader + followers governed by the `replication` factor.

`ReplicaFetcherThreads` responsible

Demo

> while inotifywait -r -e modify,create,delete /directory; do \
    rsync -avz /my_other_topic/1 /my_other_topic/1_2 \
    rsync -avz /my_other_topic/2 /my_other_topic/2_2 \
done

Distribution

Putting followers on different machines (brokers) than leaders.

If a machine goes down, one of the followers becomes the leader 👉 `LeaderElection`.

How do we know who's the leader?

Decided by special broker called `Controller`.

Information stored in Zookeeper `/brokers/topics/[topic]`.

How do we know who's the controller then?

The first broker to register itself in Zookeeper under `/controller`.

Failures

When a broker fails, new leaders have to be elected.

They will be chosen from in-sync replicas.

New followers have to be created to satisfy the `replication` factor.

When the `Controller` goes down, remaining brokers race to register themselves under `/controller` first.

So we can resiliently process data larger than memory...

What about larger than disk?

If we keep appending to files, we're going to run out of space quickly.

Deleting data

You can't delete arbitrary data

`retention.ms` allows for deleting messages older than x.

`cleanup.policy=compact` allows for never loosing data, but keeping last n messages per key (database).

Handled by `LogCleanerThreads`.

Thank you!

Slides: https://github.com/kujon/kafka-talk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!