Skip to content

Latest commit

 

History

History
294 lines (161 loc) · 4.48 KB

deck.mdx

File metadata and controls

294 lines (161 loc) · 4.48 KB

import { Appear } from "@mdx-deck/components";

Deep dive into Kafka


What is stream processing?

processing of infinite lists, one element at the time


Cool consequences

1. Real-time, low latency

2. Feasible to process data much much larger than available memory


What is Kafka?

It's a Streaming Platform

1. Kafka proper 👉 messaging and persistence
2. Kafka Streams 👉 stateful stream processing
3. Kafka Connect 👉 glue to get data into and out of Kafka

4. Schema Registry 👉 where all your data contracts live
5. Mirror Maker 👉 replicator of topics across clusters
6. REST Proxy 👉 REST API to consume from Kafka
7. ... and more ...

Kafka proper

Fancy way of sequentially writing to and reading from a file (log) over the network.


Demo

Creating a file

> touch my_topic

Writing (producing)

> echo "foo" >> my_topic
> echo "bar" >> my_topic

Reading (consuming)

> tail -f my_topic

Consumer offsets

The amount of messages consumed per consumer

> tail -n 1 -f my_topic

Offsets


Partitions

A way of parallelising reads and writes


Demo

Create a topic with multiple Partitions

> mkdir my_other_topic
> touch my_other_topic/1
> touch my_other_topic/2

Producer

> echo "foo" >> my_other_topic/1
> echo "bar" >> my_other_topic/2
> echo "baz" >> my_other_topic/1

Demo part 2

Consumer

Individual consumers
> tail -f my_other_topic/*
Consumer groups
> tail -f my_other_topic/1
> tail -f my_other_topic/2

Keys

In reality, apart from value, each message also has a key.

Producer creates the key however they wish.

Messages with the same keys end up on the same partitions.


Notice that we never defined a schema

Each message (and key) is an arbitrary sequence of bytes

Producers and Consumers are in charge of serialising messages into and deserialising them out of bytes (SerDe).


Replication

Keeping at least one extra copy of each partition

Primary copy 👉 leader - handles all the writes and reads

Seconary copies 👉 followers - are kept in-sync on standby

Number of leader + followers governed by the replication factor.

ReplicaFetcherThreads responsible


Demo

> while inotifywait -r -e modify,create,delete /directory; do \
    rsync -avz /my_other_topic/1 /my_other_topic/1_2 \
    rsync -avz /my_other_topic/2 /my_other_topic/2_2 \
done

Distribution

Putting followers on different machines (brokers) than leaders.

If a machine goes down, one of the followers becomes the leader 👉 LeaderElection.


Distribution


How do we know who's the leader?

Decided by special broker called Controller.

Information stored in Zookeeper /brokers/topics/[topic].


How do we know who's the controller then?

The first broker to register itself in Zookeeper under /controller.


Failures

When a broker fails, new leaders have to be elected.

They will be chosen from in-sync replicas.

New followers have to be created to satisfy the replication factor.

When the Controller goes down, remaining brokers race to register themselves under /controller first.


So we can resiliently process data larger than memory...

What about larger than disk?

If we keep appending to files, we're going to run out of space quickly.


Deleting data

You can't delete arbitrary data

retention.ms allows for deleting messages older than x.

cleanup.policy=compact allows for never loosing data, but keeping last n messages per key (database).

Handled by LogCleanerThreads.


Thank you!

Slides: https://github.com/kujon/kafka-talk