Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in object_by_id_cache #20450

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

mystenmark
Copy link
Contributor

@mystenmark mystenmark commented Nov 27, 2024

Suppose that a reader thread is trying to cache an object that it just read, while a writer thread is trying to cache an object that it just wrote. The writer thread definitionally has the latest version. The reader thread may be out of date. While we previously took some care to not replace a new version with an old version, this did not take into account evictions, and so the following bug was possible:

  READER                            WRITER
  read object_by_id_cache (miss)
  read dirty set (miss)
                                    write to dirty
  read db (old version)
                                    write to cache (while holding dirty lock)
                                    cache entry is evicted
  write to cache

There is no way for the reader to tell that the value it is caching is out of date, because the up to date entry is already gone from the cache.

We fix this by requiring reader threads to obtain a ticket before they read from the dirty set and/or db. Tickets are expired by writers. Then, the above case looks like this:

  READER                            WRITER
  get ticket
  read cache (miss)
  read dirty (miss)
                                    write dirty
  read db (old version)
                                    expire ticket
                                    write cache (while holding dirty lock)
                                    cache eviction
  no write to cache (ticket expired)

Any interleaving of the above either results in the reader seeing a recent version, or else having an expired ticket.

Copy link

vercel bot commented Nov 27, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 27, 2024 5:06pm
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Nov 27, 2024 5:06pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Nov 27, 2024 5:06pm
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Nov 27, 2024 5:06pm

@mystenmark mystenmark temporarily deployed to sui-typescript-aws-kms-test-env November 27, 2024 06:46 — with GitHub Actions Inactive
Suppose that a reader thread is trying to cache an object that it just
read, while a writer thread is trying to cache an object that it just
wrote. The writer thread definitionally has the latest version. The
reader thread may be out of date. While we previously took some care
to not replace a new version with an old version, this did not take into
account evictions, and so the following bug was possible:

      READER                            WRITER
      read object_by_id_cache (miss)
      read dirty set (miss)
                                        write to dirty
      read db (old version)
                                        write to cache (while holding dirty lock)
                                        cache entry is evicted
      write to cache

There is no way for the reader to tell that the value is is caching is
out of date, because the update to date entry is already gone from the
cache.

We fix this by requiring reader threads to obtain a ticket before they
read from the dirty set and/or db. Tickets are expired by writers. Then,
the above case looks like this:

      READER                            WRITER
      get ticket
      read cache (miss)
      read dirty (miss)
                                        write dirty
      read db (old version)
                                        expire ticket
                                        write cache (while holding dirty lock)
                                        cache eviction
      no write to cache (ticket expired)

Any interleaving of the above either results in the reader seeing a
recent version, or else having an expired ticket.
@mystenmark mystenmark temporarily deployed to sui-typescript-aws-kms-test-env November 27, 2024 06:47 — with GitHub Actions Inactive
@mystenmark mystenmark changed the title Fix bug in object_by_id_cache. Fix bug in object_by_id_cache Nov 27, 2024
// new ticket after this point, then it will read the new value from
// the dirty set (or db).
if matches!(ticket, Ticket::Write) {
gen.fetch_add(1, std::sync::atomic::Ordering::Release);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this scenario possible?

Key 1 Gen 0 is not in cache

Writer calls insert Key 1 and fetch_add Gen 1
Reader obtains ticket Gen 1 for Key 1
Reader misses cache read and reads old value for Key 1 from DB
Writer writes new value for Key 1 to dirty cache
Cache value for Key 1 gets evicted
Reader caches old value in cache with valid Gen 1 ticket

If so, would swapping the generation fetch_add to after the cache update fix it? Or maybe cache eviction also updates generation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not possible because the writer holds the lock on the dirty cache while it calls fetch_add. reader waits for that lock before checking in the dirty cache. So reader will either complete its read before the fetch_add (in which case it will get an old ticket), or it will wait until after the fetch_add, in which case it is guaranteed to read the newest value from the dirty set (or possibly the db if the reader waits so long that the value has been flushed to the db).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants