- Fast and general purpose cluster computing system (faster than Hadoop MapReduce)
- 10x (on disk) - 100x (in memory) faster
- Hadoop structures cause Disk I/O each time a request is made
- The concept of Spark is to process this Disk I/O as In-memory
- Provides high level APIs in Java, Scala, Python/R
- Integration with Hadoop and its ecosystem and can read existing data
Spark helps overcomes this limitation and optimises processing speed
- Hadoop MapReduce writes most of the intermediate results to disk, which is a slow process
- Spark achieves this by minimising disk read/write operations for intermediate results, storing these in memory and perform disk operations only when essential
Spark supports data analysis, machine learning, graphs, streaming data, etc.