Skip to content

Commit

Permalink
[#26192] YSQL: flaky test: TestPgRegressMisc.testPgRegressMiscSerial3
Browse files Browse the repository at this point in the history
Summary:
The test `TestPgRegressMisc.testPgRegressMiscSerial3` is flaky (in particular
tsan and asan) and when it fails, we see an error like

```
21:26:55.191 (main) [ERROR - org.yb.BaseYBTest$1$2.logEventDetails(BaseYBTest.java:243)] YB Java test failed: class="org.yb.pgsql.TestPgRegressMisc", method="testPgRegressMiscSerial3"
org.junit.internal.runners.model.MultipleFailureException: There were 2 errors:
  java.lang.AssertionError(pg_regress exited with error code: 1, failed tests: [yb_create_table_like])
  com.yugabyte.util.PSQLException(ERROR: Cannot delete non-empty tablegroup, table 000034cb0000300080000000000040a5 is not deleted)
21:26:55.194 (main) [INFO - org.yb.BaseYBTest$1$2.logEventDetails(BaseYBTest.java:250)] YB Java test class="org.yb.pgsql.TestPgRegressMisc", method="testPgRegressMiscSerial3" took 331.45 seconds
```

After debugging, I found that the table `000034cb0000300080000000000040a5` is
created as an index of a base table in a table group. When it is deleted, the
relevant code is

```
    auto colocated_tablet = table.table_info_with_write_lock->GetColocatedUserTablet();
    if (colocated_tablet) {
      // TryRemoveFromTablegroup only affects tables that are part of some tablegroup.
      // We directly remove it from tablegroup no matter if it is retained by snapshot schedules.
      RETURN_NOT_OK(TryRemoveFromTablegroup(table.table_info_with_write_lock->id()));
      // Send a RemoveTableFromTablet() request to each
      // colocated parent tablet replica in the table.
```

The code removes the table from a possible containing tablegroup only when
it still has a tablet. Because it is an index of a base table, when the base
table is deleted, we also delete the index table. In a race condition, another
thread has already invoked `table->ClearTabletMaps` so `colocated_tablet` is
nullptr. As a result `TryRemoveFromTablegroup` isn't invoked. So
`000034cb0000300080000000000040a5` is left in the tablegroup's in-memory data
structure. Later when we try to delete the tablegroup, we hit the error
`Cannot delete non-empty tablegroup, table 000034cb0000300080000000000040a5 is not deleted`.

To fix this bug, I made a change to use condition `IsColocatedUserTable()` instead of the
current `colocated_tablet` being not null to invoke
```
      RETURN_NOT_OK(TryRemoveFromTablegroup(table.table_info_with_write_lock->id()));
```
In this way we avoid the above error.

Test Plan: ./yb_build.sh tsan --java-test org.yb.pgsql.TestPgRegressMisc#testPgRegressMiscSerial3 -n 100 --tp 1

Reviewers: hsunder, zdrudi

Reviewed By: zdrudi

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D42167
  • Loading branch information
myang2021 committed Mar 3, 2025
1 parent a308602 commit 267eb3f
Showing 1 changed file with 8 additions and 3 deletions.
11 changes: 8 additions & 3 deletions src/yb/master/catalog_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -6637,11 +6637,16 @@ Status CatalogManager::DeleteTableInternal(
// Send a DeleteTablet() request to each tablet replica in the table.
RETURN_NOT_OK(DeleteOrHideTabletsOfTable(
*table.table_info_with_write_lock, table.delete_retainer, epoch));
// TryRemoveFromTablegroup only affects tables that are part of some tablegroup.
// We directly remove it from tablegroup no matter if it is retained by snapshot schedules.
// Note that we call TryRemoveFromTablegroup irrespective of colocated_tablet because
// it is possible that the tablet is already removed from table by a racing thread and
// in that case colocated_tablet will be nullptr.
if (table.table_info_with_write_lock.info->IsColocatedUserTable()) {
RETURN_NOT_OK(TryRemoveFromTablegroup(table.table_info_with_write_lock->id()));
}
auto colocated_tablet = table.table_info_with_write_lock->GetColocatedUserTablet();
if (colocated_tablet) {
// TryRemoveFromTablegroup only affects tables that are part of some tablegroup.
// We directly remove it from tablegroup no matter if it is retained by snapshot schedules.
RETURN_NOT_OK(TryRemoveFromTablegroup(table.table_info_with_write_lock->id()));
// Send a RemoveTableFromTablet() request to each
// colocated parent tablet replica in the table.
if (!table.delete_retainer.IsHideOnly()) {
Expand Down

0 comments on commit 267eb3f

Please sign in to comment.