Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with journal version upgrades #31131

Open
sebastianst opened this issue Feb 5, 2025 · 5 comments · Fixed by ethereum-optimism/op-geth#497
Open

Dealing with journal version upgrades #31131

sebastianst opened this issue Feb 5, 2025 · 5 comments · Fixed by ethereum-optimism/op-geth#497
Assignees

Comments

@sebastianst
Copy link
Contributor

sebastianst commented Feb 5, 2025

Whenever the journal version is upgraded, a new geth release will discard the outdated journal at startup with a message like

unexpected journal version want 3 got 2

This may then lead to the node being in a broken state, especially if it's a full node, e.g. complaining about missing trie nodes like

lvl=error msg="Failed to create sealing context" err="missing trie node 0a22215a54961846c48037c5b4e6ff243a96041f6262b57fe30f37e94b847442 (path ) state 0x0a22215a54961846c48037c5b4e6ff243a96041f6262b57fe30f37e94b847442 is not available"

The node then has to be manually recovered.

The list of journal version upgrades includes:

When geth went from 0 to 1, we introduced a journal version upgrade path in op-geth (ethereum-optimism/op-geth#368). But it doesn't seem like a scalable approach to always add upgrade paths, now going from 1 to 2 and 3.

What is the recommended way how to deal with journal version upgrades? I couldn't find any recommendations in the geth release notes.

@rjl493456442
Copy link
Member

This may then lead to the node being in a broken state, especially if it's a full node, e.g. complaining about missing trie nodes like

This behavior is not expected in Geth. The discarded journal corresponds to the layers in memory. Geth can recover from losing all memory states, which is equivalent to an unclean shutdown.

Maybe it's something specific with op-Geth?

@rjl493456442 rjl493456442 self-assigned this Feb 6, 2025
@sebastianst
Copy link
Contributor Author

This may then lead to the node being in a broken state, especially if it's a full node, e.g. complaining about missing trie nodes like

This behavior is not expected in Geth. The discarded journal corresponds to the layers in memory. Geth can recover from losing all memory states, which is equivalent to an unclean shutdown.

Maybe it's something specific with op-Geth?

Thanks for reply! Interesting, we'll investigate if it has something to do with our diff then. It doesn't always happen, just sometimes.

@fjl
Copy link
Contributor

fjl commented Feb 6, 2025

It could also be a bug of course!

@protolambda
Copy link
Contributor

I investigated the logs of our node during the restart that removed the journal, and the pre-restart logs, and found: https://gist.github.com/protolambda/7e2002a0de7cf868fbc1617fffa656cd

I believe the journal became too large due to a large write-buffer during shutdown, and once the node removed the journal due to the version change, the gap was so large that the state was unavailable.

I opened a potential fix in op-geth to limit the number of difflayers in the write-buffer: ethereum-optimism/op-geth#497
If you think that's right, I can cherry-pick it and open a PR on upstream geth.

@sebastianst
Copy link
Contributor Author

reopening, as we have only (hopefully) fixed it in our fork

@sebastianst sebastianst reopened this Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants