Skip to content

Latest commit

 

History

History
25 lines (17 loc) · 1.33 KB

spark.md

File metadata and controls

25 lines (17 loc) · 1.33 KB

Apache Spark

  • Fast and general purpose cluster computing system (faster than Hadoop MapReduce)
  • 10x (on disk) - 100x (in memory) faster
    • Hadoop structures cause Disk I/O each time a request is made
    • The concept of Spark is to process this Disk I/O as In-memory
  • Provides high level APIs in Java, Scala, Python/R
  • Integration with Hadoop and its ecosystem and can read existing data

MapReduce vs Spark

image

image

Spark helps overcomes this limitation and optimises processing speed

  • Hadoop MapReduce writes most of the intermediate results to disk, which is a slow process
  • Spark achieves this by minimising disk read/write operations for intermediate results, storing these in memory and perform disk operations only when essential

Spark supports data analysis, machine learning, graphs, streaming data, etc.

Architecture