Designing Data-Intensive Applications: 7. Transactions #12

guilleiguaran · 2019-05-31T04:02:11Z

Esta semana a cargo @guilleiguaran
Siguiente semana @duende84

guilleiguaran · 2019-05-31T04:31:23Z

Transactions

Simplify the programming model for applications
Applications are free to ignore potential issues that DB take care through safety guarantees

ACID

A set of safety guarantees: Atomicity, Consistency, Isolation, Durability.
BASE: "not ACID", Basically Available, Soft state, Eventual Consistency.
Atomicity: Several writes grouped in a transaction has success (and are committed) or are aborted due to a fault, not allow partial writes of transactions.
Consistency: Ensure invariants in data, done through basic support on DB level (uniqueness constraints, foreign keys referencing) and mostly done on client-side.
Isolation: A client can safely ignore any other clients operations happening at the same time. Also know as "Serializability" since DB ensures that transactions committed yield same results than executing them serially even if are executed concurrently.
Durability: If transaction is committed data should be durable even in hardware faults or DB crashes. On single node DBs ensure data is written (even for in-memory DBs), on replicated DBs ensure data has been replicated to other nodes.

Weak Isolation levels

"Non-serializable"
Serializable isolation level has high performance cost, that's why it isn't implemented by many databases.
Weak isolation levels protect from some race conditions.

Read Committed

Most basic level for transactions
When reading from DB, you see only data that has been committed (no dirty reads)
When writing to DB, you only overwrite data thas has been committed

Implementation

Present in many DBs: Oracle, Postgres, SQL Server, MemSQL, etc.
Writes: prevented using row-level locks
Reads: DB remember old and new values, during transactions serve old values and switch to new when DB is not busy.

Snapshot isolation and repeteable read

Non-repeteable reads or read skews can result on reads of outdated values that are being updated simultaneously by another transaction.
Read skews are acceptable in read committed isolation level but some situations can't tolerate them, e.g: backups, analytics queries, integrity checks.
Snapshot isolation prevents read skews, each transaction reads from a consistent snapshot and transaction only see data that has been committed at the start of transaction.

Implementation

Use write locks like read committed
Readers never block writers and writers never block readers.
Reads are a generalization of read committed: DB must keep different committed versions of objects because multiple in-progress transactions may need to see the state of the DB at different points of time. This is know also as multi-version concurrency control (MVCC)
Typically storage engines implement both read committed and snapshot isolation levels with MVCC.
Supported by many DBs like Postgres (repeteable read level), MySQL/InnoDB (repeteable read level), Oracle (serializable level) and others.

Lost Updates

Read committed and snapshot isolation levels offer guarantees of what a read-only transaction can see in presence of concurrent writes but not are useful for two transactions writing concurrently.
Lost update problems can occur if two transactions execute a read-modify-write cycle, with two concurrent transactions one of the modifications can be lost because second write doesn't include the first modification (latter write clobbers the earlier write)
Multiple solutions has been designed to avoid lost update problems (Atomic operations, Explicit locking, Automatic detection of lost updates, compare-and-set)

Atomic operations

Many DBs provide atomic write operations
Usually are the best solution if supported by database
Supported in many DBs with the UPDATE statement:

UPDATE counters SET value = value + 1 WHERE key = 'key'

Also supported by other NoSQL DBs like Redis and MongoDB
Implemented with an exclusive lock on read, no others transactions can read same object (also know as cursor stability).
An alternative implementation is executing atomic operations in a single thread.

Explicit locking

Supported in many DBs with the FOR UPDATE clause inside of transactions:

BEGIN TRANSACTION
SELECT * FROM users WHERE id = '1' FOR UPDATE
UPDATE users SET username = 'guille' WHERE id = '1'
COMMIT;

FOR UPDATE clase indicates that the DB should take a lock on rows returned by query
Works but to get it right you need to carefully think about application logic.

Automatic detection of lost updates

As alternative to atomic writes and locks DB can allow writing transactions in parallel and detect lost updates and abort.
DBs can perform this check in conjunction with snapshot isolation
Implemented by Postgres (repeteable read), Oracle (seriaizable), SQL Server (snapshot isolation)
MySQL/InnoDB doesn't detect lost updates in repeteable read level.

compare-and-set

Allow an update to happen only if the value has not changed since last read:

UPDATE wiki.pages SET content = 'new content' WHERE id = 123 AND content = 'old content'

If DBs allows the WHERE to read from old snapshot this statement may not prevent lost updates.

Write skews and phantoms

Generalization of lost update problem. Write skew can occur if two transactions read the same objects and then update some of those objects (different transactions may update different objects)
Options with write skews are limited: Atomic single-object operations aren't useful, automatic detection of lost updates of snapshot isolation level doesn't help, database constrains can't prevent problem (Triggers and materialized views can be used, though), explicit locking rows that transaction depends on is an option but the best option is using true serializable isolation.

Phantoms causing write skews.

The effect where a write in one transaction changes the result of a search query in another transaction is called a phantom.
One way of avoiding phantoms is materializing conflicts, creating artificial objects that can be used to attach locks. Materializing conflicts should be used as last resort if alternatives aren't available, serializable isolation should be preferred in most cases.

Serializability

Strongest isolation level: guarantees that even though transactions are executed in parallel, the end result is the same as if they had executed one at time, serially.
The reason to not use it always can be explained by how it's implemented and how every implementation perform.
Implementations: Actual Serial Execution, Two-Phase Locking (2PL)
Serializable Snapshot Isolation

Serial execution

Every transaction should be small and fast, multi statement sessions aren't supported and store procedure should be used instead.
Limited to cases where active dataset fits in memory, data rarely accessed can be moved to disk but requiring it will slow entire system.
Write throughput must be low to be handled by single CPU or else transactions need to be partitioned without requiring cross-partition coordination
Cross-partition transactions are possible, but there is a hard limit to extent to which they can be used.
Implementations: Redis, Datomic, VoltDB/H-Store.

Two-Phase Locking (2PL)

Similar to locks used to prevent dirty writes but much stronger.
Several transactions can read same objects as long as nobody is writing to it, as soon as anyone intend to write it exclusive access is required:
If transaction TX1 reads and object and transaction TX2 wants to write to same object then TX2 should wait TX1 to finish.
If TX1 has written an object and TX2 wants to read it, then TX2 should wait TX1 to finish, reading old versions of objects is not acceptable.
In 2PL writers don't just block other writers, they also block readers and viceversa.
2PL protects against all race conditions mentioned earlier including lost updates and write skews.
Implementations: MySQL/InnoDB (serializable isolation), SQL Server (serializable isolation), DB2 (repeteable read)

Performance of 2PL

Performance is significantly worse than weaker levels, partially due to overhead of acquiring and releasing locks and mostly because limited concurrency.
Deadlocks can also have significant impact on performance.

Index-range locks.

2PL implementations attach locks to index indicating ranges instead of individual objects.
If another transaction want to write needs to update same part of index and will notice the lock, then write can't proceed until lock is released.
If there aren't suitable indexes for range locks DB can fallback to locking the whole table.

Serializable Snapshot Isolation (SSI)

Provides full serializability but has only a small performance penalty compared to snapshot isolation.
SSI is an optimistic concurrency control technique, instead of blocking when something potentially dangerous happens, transactions continue anyway expecting no problems. When transaction wants to commit the DB checks if anything bad happened, if so transaction is aborted.
In order to provide serializable isolation the DB must detect situations in which a transaction may have acted on an outdated premise and abort transaction in that case. There are two cases to consider: 1) detecting reads of a stale MVCC object version (uncommitted write occurred before the read) 2) Detecting writes that affect prior reads (the writes occurs after the read).
Implementations: Used by single-node and distributed DBs. Postgres 9.1+ (serializable isolation), FoundationDB.

guilleiguaran added the ddia label May 31, 2019

guilleiguaran mentioned this issue May 31, 2019

Designing Data-Intensive Applications #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Designing Data-Intensive Applications: 7. Transactions #12

Designing Data-Intensive Applications: 7. Transactions #12

guilleiguaran commented May 31, 2019

guilleiguaran commented May 31, 2019 •

edited

Loading

Designing Data-Intensive Applications: 7. Transactions #12

Designing Data-Intensive Applications: 7. Transactions #12

Comments

guilleiguaran commented May 31, 2019

guilleiguaran commented May 31, 2019 • edited Loading

Transactions

ACID

Weak Isolation levels

Read Committed

Implementation

Snapshot isolation and repeteable read

Implementation

Lost Updates

Atomic operations

Explicit locking

Automatic detection of lost updates

compare-and-set

Write skews and phantoms

Phantoms causing write skews.

Serializability

Serial execution

Two-Phase Locking (2PL)

Performance of 2PL

Index-range locks.

Serializable Snapshot Isolation (SSI)

guilleiguaran commented May 31, 2019 •

edited

Loading