Skip to content

hassonor/kafka-spark-data-engineering

Repository files navigation

E2E Data Engineering Project: High-Performance Financial Transaction Processing

🚧 Project Status: Under Active Development 🚧

Java Apache Kafka Apache Spark Docker

A production-grade data engineering pipeline designed to handle massive-scale financial transaction processing with high throughput, fault tolerance, and real-time analytics capabilities.

🚀 System Capabilities

  • Processing Volume: 1.2 billion transactions per hour
  • Storage Capacity: Handles up to 1.852 PB of data (5-year projection with 20% YoY growth)
  • Compression Ratio: 5:1 using Snappy compression
  • High Availability: Triple replication for fault tolerance
  • Real-time Processing: Sub-second latency for transaction analytics
  • Scalable Architecture: Horizontally scalable Kafka-Spark infrastructure

📊 Data Specifications

Transaction Schema

case class Transaction (
    transactionId: String,      // 36 bytes
    amount: Float,              // 8 bytes
    userId: String,             // 12 bytes
    transactionTime: BigInt,    // 8 bytes
    merchantId: String,         // 12 bytes
    transactionType: String,    // 8 bytes
    location: String,           // 12 bytes
    paymentMethod: String,      // 15 bytes
    isInternational: Boolean,   // 5 bytes
    currency: String            // 5 bytes
)                              // Total: 120 bytes

Storage Requirements

Time Period Raw Data With 3x Replication & Compression
Per Hour 144 GB 86.4 GB
Per Day 3.456 TB 2.07 TB
Per Month 103.68 TB 62.1 TB
Per Year 1.244 PB 745.2 TB

🛠 Technology Stack

  • Apache Kafka 3.8.1: Distributed streaming platform
    • 3 Controller nodes for cluster management
    • 3 Broker nodes for data distribution
    • Schema Registry for data governance
  • Apache Spark 3.5.0: Real-time data processing
    • 1 Master node
    • 3 Worker nodes
    • Structured Streaming for real-time analytics
  • Additional Components:
    • RedPanda Console for cluster monitoring
    • Docker & Docker Compose for containerization
    • Java 17 for transaction generation
    • Python for Spark processing

🚀 Getting Started

Prerequisites

  • Docker and Docker Compose
  • Java 17 or later
  • Maven
  • Python 3.8 or later

Quick Start

  1. Clone the repository:
git clone https://github.com/yourusername/kafka-spark-data-engineering.git
cd kafka-spark-data-engineering
  1. Start the infrastructure:
docker-compose up -d
  1. Build and run the transaction producer:
mvn clean package
java -jar target/transaction-producer.jar
  1. Start the Spark processor:
python spark_processor.py

Accessing Services

🏗 Architecture

graph LR
    subgraph "Producers"
        java[Java]
        python[Python]
    end

    subgraph "Kafka Ecosystem"
        kc[Kafka Controllers]
        kb[Kafka Brokers]
        sr[Schema Registry]
    end

    subgraph "Processing"
        master[Spark Master]
        w1[Worker]
        w2[Worker]
        w3[Worker]
    end

    subgraph "Analytics Stack"
        es[Elasticsearch]
        logstash[Logstash]
        kibana[Kibana]
    end

    %% Connections
    java -- Streaming --> kc
    python -- Streaming --> kc
    kc -- Streaming --> kb
    kb --> sr
    kb -- Streaming --> master
    master --> w1
    master --> w2
    master --> w3
    w1 --> es
    w2 --> es
    w3 --> es
    es --> logstash
    logstash --> kibana
Loading

🔧 Configuration

Kafka Configuration

  • Bootstrap Servers: localhost:29092,localhost:39092,localhost:49092
  • Topics:
    • financial_transactions: Raw transaction data
    • transaction_aggregates: Processed aggregations
    • transaction_anomalies: Detected anomalies

Spark Configuration

  • Master URL: spark://spark-master:7077
  • Worker Resources:
    • Cores: 2 per worker
    • Memory: 2GB per worker
  • Checkpoint Directory: /mnt/spark-checkpoints
  • State Store: /mnt/spark-state

📈 Performance Optimization

Producer Optimizations

  • Batch Size: 64KB
  • Linger Time: 3ms
  • Compression: Snappy
  • Acknowledgments: 1 (leader acknowledgment)
  • Multi-threaded production: 3 concurrent threads

Consumer Optimizations

  • Structured Streaming for efficient processing
  • Checkpointing for fault tolerance
  • State store for stateful operations
  • Partition optimization for parallel processing

🔍 Monitoring

  • Transaction throughput logging
  • Kafka cluster health monitoring via RedPanda Console
  • Spark UI for job tracking and performance metrics
  • Producer performance metrics with throughput calculation

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.