Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_timeline_copy is flaky because of race condition inside moto/s3 #10667

Open
alexanderlaw opened this issue Feb 5, 2025 · 0 comments
Open
Labels
a/test/flaky/investigated a/test/flaky Area: related to flaky tests t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link
Contributor

A number of test_timeline_copy failures, e. g.:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/13040862411/index.html#/testresult/174be23113d55fb0 # [release-pg15-100000] / ARM64

test_runner/regress/test_wal_acceptor.py:2447: in test_timeline_copy
    sk.http_client().copy_timeline(
test_runner/fixtures/safekeeper/http.py:291: in copy_timeline
    res.raise_for_status()
/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/requests/models.py:1024: in raise_for_status
    raise HTTPError(http_error_msg, response=self)
E   requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:21530/v1/tenant/b2fcb6f628b105d018e72f58a7c6e8d5/timeline/2fb1c61e998b27a863bf82d06866b247/copy

...
2025-01-30 00:01:50.547 INFO [_internal.py:97] 127.0.0.1 - - [30/Jan/2025 00:01:50] "�[35m�[1mPUT /safekeeper-test-timeline-copy-release-pg15-100000-//b2fcb6f628b105d018e72f58a7c6e8d5/d834403b51e253bee4e5d6e79cbeaa2d/000000010000000000000001?x-id=CopyObject HTTP/1.1�[0m" 500 -
2025-01-30 00:01:50.578 ERROR [_internal.py:97] Error on request:
Traceback (most recent call last):
  File ".../site-packages/werkzeug/serving.py", line 370, in run_wsgi
    execute(self.server.app)
...
  File ".../site-packages/moto/s3/responses.py", line 1673, in _key_response_put
    self.backend.copy_object(
  File ".../site-packages/moto/s3/models.py", line 2792, in copy_object
    value=src_key.value,
          ^^^^^^^^^^^^^^
  File ".../site-packages/moto/s3/models.py", line 184, in value
    self._value_buffer.seek(0)
  File "/home/nonroot/.pyenv/versions/3.11.10/lib/python3.11/tempfile.py", line 825, in seek
    return self._file.seek(*args)
           ^^^^^^^^^^^^^^^^^^^^^^
ValueError: seek of closed file

and:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10556/13029076128/index.html#/testresult/71e92c6c93540a42 # [release-pg17-100000] / ARM64
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6560/13026466544/index.html#/testresult/f57c131cbb605e50 # [release-pg14-100000] / ARM64
https://neon-github-public-dev.s3.amazonaws.com/reports/main/12994277281/index.html#/testresult/1954bd3c905604dd # [release-pg16-100000] / X64
https://neon-github-public-dev.s3.amazonaws.com/reports/main/12748032596/index.html#/testresult/23d45078dcdc321e # [release-pg17-100000] / X64

demonstrate a race condition inside moto/s3/utils.py / setlist().

With
time.sleep(.1)
added just before the last line of the function:
https://github.com/getmoto/moto/blob/3012a95d69bf5a0f6e39695e2530bdff12982028/moto/s3/utils.py#L156-L168

I get
BUILD_TYPE=release poetry run pytest test_runner/regress/test_wal_acceptor.py::test_timeline_copy >test.log
failing with the same diagnostics on a first or second run.

Initially observed with moto 5.0.6, but then reproduced with 5.0.28 and on the current master.

With extra debug logging (see attached a patch for the master: moto-debugging.patch.txt ), I'm observing the following events:

...
2025-02-05 11:24:13.913 INFO [_internal.py:97] 127.0.0.1 - - [05/Feb/2025 11:24:13] "PUT /safekeeper-test-timeline-copy-release-pg14-500000-//3efc93439b7dd564025818f7d63d14cb/e43b91266996653cfa36101cb42948f5/000000010000000000000001?x-id=PutObject HTTP/1.1" 200 -

### moto replies to PutObject query for 3efc.../e43b.../00000001000000000000000

...
2025-02-05 11:24:14.14  INFO [models.py:2238] !!!put_object| <Thread(Thread-56 (process_request_thread), started daemon 136614384764608)>: /3efc93439b7dd564025818f7d63d14cb/e43b91266996653cfa36101cb42948f5/000000010000000000000001 False, [<moto.s3.models.FakeKey object at 0x7c400f352510>]
...
2025-02-05 11:24:14.16  INFO [models.py:2240]     !!!put_object| bucket.key: /3efc93439b7dd564025818f7d63d14cb/e43b91266996653cfa36101cb42948f5/000000010000000000000001
2025-02-05 11:24:14.16  INFO [models.py:2242]         !!!put_object| valuse: <moto.s3.models.FakeKey object at 0x7c400f356f00> / <tempfile.SpooledTemporaryFile object at 0x7c400f356080>

### put_object() finds the previous version of the object value (3efc.../e43b.../00000001000000000000000) and disposes it (closes the underlying SpooledTemporaryFile 7c400f356080):

2025-02-05 11:24:14.16  INFO [utils.py:169] !!!setlist| disposing existing_version: <Thread(Thread-56 (process_request_thread), started daemon 136614384764608)> / /3efc93439b7dd564025818f7d63d14cb/e43b91266996653cfa36101cb42948f5/000000010000000000000001, <moto.s3.models.FakeKey object at 0x7c400f356f00> / <tempfile.SpooledTemporaryFile object at 0x7c400f356080>
...
2025-02-05 11:24:14.103 INFO [responses.py:1809] !!!copy_object| <Thread(Thread-62 (process_request_thread), started daemon 136614565119680)> / ParseResult(scheme='', netloc='', path='safekeeper-test-timeline-copy-release-pg14-500000-//3efc93439b7dd564025818f7d63d14cb/e43b91266996653cfa36101cb42948f5/000000010000000000000001', params='', query='', fragment='') / safekeeper-test-timeline-copy-release-pg14-500000- / /3efc93439b7dd564025818f7d63d14cb/e43b91266996653cfa36101cb42948f5/000000010000000000000001 / None / <tempfile.SpooledTemporaryFile object at 0x7c400f356080>
2025-02-05 11:24:14.103 INFO [responses.py:1814] !!!copy_object| + <moto.s3.models.FakeBucket object at 0x7c400f353080> / <moto.s3.models.FakeKey object at 0x7c400f356f00> / <tempfile.SpooledTemporaryFile object at 0x7c400f356080>

### copy_object() attempts to copy object 3efc.../e43b.../000000010000000000000001, which still has underlying file 7c400f356080,
### though it should be replaced by put_object(), and fails:

2025-02-05 11:24:14.103 INFO [_internal.py:97] 127.0.0.1 - - [05/Feb/2025 11:24:14] "^[[35m^[[1mPUT /safekeeper-test-timeline-copy-release-pg14-500000-//3efc93439b7dd564025818f7d63d14cb/2f9f886d9a886a0d21b5299c28c8e5c8/000000010000000000000001?x-id=CopyObject HTTP/1.1^[[0m" 500 -
...
  File ".../site-packages/moto/s3/models.py", line 2977, in copy_object
    value=src_key.value,
          ^^^^^^^^^^^^^^
  File ".../site-packages/moto/s3/models.py", line 194, in value
    self._value_buffer.seek(0)
  File "/usr/lib/python3.12/tempfile.py", line 967, in seek
    return self._file.seek(*args)
           ^^^^^^^^^^^^^^^^^^^^^^
ValueError: seek of closed file

Full log of the failed run

While testing, I also saw errors "ValueError: read of closed file", which doesn't affect the test result, that is the test finishes successfully despite these errors.

@alexanderlaw alexanderlaw added a/test/flaky Area: related to flaky tests a/test/flaky/investigated t/bug Issue Type: Bug labels Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test/flaky/investigated a/test/flaky Area: related to flaky tests t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant