Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A race condition which may lead to a dead lock inside ScanRows and causes DB locks which are never closed. #7403

Open
swojtasiak opened this issue Mar 28, 2025 · 1 comment
Assignees
Labels
type:critical critical questions

Comments

@swojtasiak
Copy link

swojtasiak commented Mar 28, 2025

I don't include the playground link because it's a race condition which is really hard to reproduce. In our case it happens once a few days, while we issue thousands of queries per seconds to our databases.

The issue shouldn't be difficult to understand taking into account the fact that the SQL API provided by Go is used in a wrong way in the scanning.

The rows.Err() method called here:

gorm/scan.go

Line 353 in a9d2729

if err := rows.Err(); err != nil && err != db.Error {

Can cause a dead lock. This method should be called only and only if we are sure that the underlying sql.Rows instance is closed. Unfortunately it's not always the case here, as it can be called just after the scan operation is done:

https://github.com/swojtasiak/gorm/blob/a9d27293de2267a36fa6c9f8892977d3159cf8ea/scan.go#L346

As the documentation says, every call to rows.Err() should be protected by a call to rows.Next() and only if it returns false, we can be sure that it's safe to call the rows.Err() method.

It's not safe, because every call to rows.Scan() may leave the rows instance in a read locked state, by acquiring a read lock on an internal RWMutex. The lock is held until rows.Close() or rows.Next() is called.

When rows.Err() is called it tries to acquire the read lock again before accessing the error instance. In most cases it's not an issue, because the lock is being acquired only for a moment and it's allowed to acquire it more than once for read operations. It just increments its internal number of calls.

The problem arises when a call to rows.Scan() leaves the rows instance in a locked state and before rows.Err() is called, another goroutine tries to close the rows instance due to a context cancellation, like here:

goroutine 5258713 [sync.RWMutex.Lock, 2190 minutes]:
sync.runtime_SemacquireRWMutex(0x10?, 0x20?, 0x4011aa9e01?)
	/usr/local/go/src/runtime/sema.go:87 +0x28
sync.(*RWMutex).Lock(0x4011aa9fa8?)
	/usr/local/go/src/sync/rwmutex.go:151 +0xf8
database/sql.(*Rows).close(0x4010285950, {0x268ca00, 0x3a30180})
	/usr/local/go/src/database/sql/sql.go:3399 +0x88
database/sql.(*Rows).awaitDone(0x4010285950, {0x26a8508, 0x4016e1a000}, {0x26a8540, 0x401327f590}, {0x26a8540, 0x4013553d60})
	/usr/local/go/src/database/sql/sql.go:3001 +0x1a4
created by database/sql.(*Rows).initContextClose in goroutine 9383
	/usr/local/go/src/database/sql/sql.go:2977 +0x144

The rows.close() method tries to acquire a WRITE lock on the same RWMutex and ends up being locked on it (Remember that call to rows.Scan() didn't release the read lock to allow the scanning code to scan returned rows in a safe way - it prevents closes). Then the rows.Err() method is called trying to acquire the READ lock again:

goroutine 9383 [sync.RWMutex.RLock, 2190 minutes]:
sync.runtime_SemacquireRWMutexR(0x400ba80418?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/sema.go:82 +0x28
sync.(*RWMutex).RLock(...)
	/usr/local/go/src/sync/rwmutex.go:70
database/sql.(*Rows).Err(0x4010285950)
	/usr/local/go/src/database/sql/sql.go:3128 +0x8c
gorm.io/gorm.Scan({0x26aedd8, 0x4010285950}, 0x4016e1a0c0, 0x1)

and we end up with a dead lock, because the RWMutex is already locked for reading by the waiting close operation which asked for a write lock, so another read lock is not allowed anymore.

The simplest fix would be to wrap the Err() call with another Next() operation. The ScanRows looks like it was supposed to be called only once for a rows instance:

if !rows.Next() {
	if err := rows.Err(); err != nil && err != db.Error {
		db.AddError(err)
	}
}

But for some reason, this is not the case in the tests, where it's called twice on a single Rows instance. While I have no clue what the author had in mind when implementing this operation, I won't provide a PR.

It's a really critical bug for loaded databases, because it causes DB locks which have to be closed manually.

edit:
Ok, I found that it's supposed to be called multiple times for one rows instance. It's a sample from the documentation:

rows, err := db.Model(&User{}).Where("name = ?", "jinzhu").Rows()
defer rows.Close()

for rows.Next() {
  var user User
  // ScanRows scans a row into a struct
  db.ScanRows(rows, &user)

  // Perform operations on each user
}

So it shouldn't call rows.Err() at all.

@github-actions github-actions bot added the type:critical critical questions label Mar 28, 2025
swojtasiak added a commit to swojtasiak/gorm that referenced this issue Mar 31, 2025
@swojtasiak
Copy link
Author

PR #7404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:critical critical questions
Projects
None yet
Development

No branches or pull requests

2 participants