Skip to content

Commit

Permalink
Merge branch 'main' into delta-vs-orc
Browse files Browse the repository at this point in the history
  • Loading branch information
MrPowers authored Sep 20, 2024
2 parents 114714c + 37ad9a1 commit bc3640d
Show file tree
Hide file tree
Showing 3 changed files with 4 additions and 2 deletions.
Binary file added src/blog/delta-lake-vs-data-lake/image5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions src/blog/delta-lake-vs-data-lake/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,8 @@ To read your data from a Parquet data lake, you will first have to list all the

Delta Lake stores the paths to all of the underlying Parquet files in the transaction log. This is a separate file which doesn’t require an expensive listing operation. The more files you have, the faster it will be to read your data with Delta Lake compared to regular Parquet files.

![](image5.png)

### Delta Lake vs Data Lake: Metadata

Regular Parquet files store metadata about column values in the footer of each file. This metadata contains min/max values of the columns per row group. This means that when you want to read the metadata of your data lake, you will have to read the metadata from each individual Parquet file. This requires fetching each file and grabbing the footer metadata, which is slow when you have lots of Parquet files.
Expand Down
4 changes: 2 additions & 2 deletions src/blog/delta-lake-vs-parquet-comparison/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,7 @@ Delta Lake allows for schema evolution so you can seamlessly add new columns to

Suppose you append a DataFrame to a Parquet table with a mismatched schema. In that case, you must remember to set a specific option every time you read the table to ensure accurate results. Query engines usually take shortcuts when determining the schema of a Parquet table. They look at the schema of one file and just assume that all the other files have the same schema.

The engine can consults the schema of all the files in a Parquet table when determining the schema of the overall table when you manually set a flag. Checking the schema of all the files is more computationally expensive, so it isn’t set by default. Delta Lake schema evolution is better than what’s offered by Parquet.
The engine consults the schema of all the files in a Parquet table when determining the schema of the overall table when you manually set a flag. Checking the schema of all the files is more computationally expensive, so it isn’t set by default. Delta Lake schema evolution is better than what’s offered by Parquet.

## Delta Lake vs. Parquet: check constraints

Expand All @@ -210,7 +210,7 @@ Versioned data also impacts how engines execute certain transactions. For exampl

Parquet tables don’t support versioned data. When you remove data from a Parquet table, you actually delete it from storage, which is referred to as a “physical deletes”.

Logical data operations are better because they are safer and allow for mistakes to be reversed. If you overwrite a Parquet table, it is an irreversible error (unless there is a separate mechanism backing up the data). It’s easy to undo an overwrite tranaction in a Delta table.
Logical data operations are better because they are safer and allow for mistakes to be reversed. If you overwrite a Parquet table, it is an irreversible error (unless there is a separate mechanism backing up the data). It’s easy to undo an overwrite transaction in a Delta table.

See this blog post on [Why PySpark append and overwrite operations are safer in Delta Lake than Parquet tables](https://delta.io/blog/2022-11-01-pyspark-save-mode-append-overwrite-error/) to learn more.

Expand Down

0 comments on commit bc3640d

Please sign in to comment.