Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on long-running Simnet sessions #78

Closed
fariedt opened this issue Jan 1, 2025 · 21 comments
Closed

Crash on long-running Simnet sessions #78

fariedt opened this issue Jan 1, 2025 · 21 comments
Labels
help wanted Extra attention is needed

Comments

@fariedt
Copy link

fariedt commented Jan 1, 2025

This is with node v22.11.0 on macOS 15.2, against commit #a1467ddf5591696ddac70e79860d16be8dbf690b (master branch).

Long-running tests crash after about 1.8 million invariant tests. I ran into this with a small contract I was testing, and verified that it occurs with the example code. Output:

$ /usr/bin/time ./rv example counter invariant --seed=100 --runs=5000000 > counter-invariant.txt
(node:7159) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)

/Users/fn/Repositories/rendezvous/node_modules/@hirosystems/clarinet-sdk-wasm/clarinet_sdk.js:1575
    throw new Error(getStringFromWasm0(arg0, arg1));
          ^

Error: recursive use of an object detected which would lead to unsafe aliasing in rust
    at module.exports.__wbindgen_throw (/Users/fn/Repositories/rendezvous/node_modules/@hirosystems/clarinet-sdk-wasm/clarinet_sdk.js:1575:11)
    at wasm://wasm/01dccebe:wasm-function[5035]:0x4f8024
    at wasm://wasm/01dccebe:wasm-function[5034]:0x4f8019
    at wasm://wasm/01dccebe:wasm-function[4300]:0x472e4c
    at Proxy.getAccounts (/Users/fn/Repositories/rendezvous/node_modules/@hirosystems/clarinet-sdk-wasm/clarinet_sdk.js:542:26)
    at ChainArbitrary.chainer (/Users/fn/Repositories/rendezvous/invariant.js:56:105)
    at ChainArbitrary.valueChainer (/Users/fn/Repositories/rendezvous/node_modules/fast-check/lib/check/arbitrary/definition/Arbitrary.js:59:39)
    at /Users/fn/Repositories/rendezvous/node_modules/fast-check/lib/check/arbitrary/definition/Arbitrary.js:47:38
    at mapHelper (/Users/fn/Repositories/rendezvous/node_modules/fast-check/lib/stream/StreamHelpers.js:24:15)
    at mapHelper.next (<anonymous>)

Node.js v22.11.0
     6305.78 real      8413.73 user       118.20 sys
$ 

I can share my log file, but there doesn't appear to be anything interesting or unusual in it. It's about 4.5 MB, compressed.

@moodmosaic
Copy link
Member

/cc @hugocaillard

@BowTiedRadone
Copy link
Collaborator

@fariedt, thank you for opening this issue! We’re aware of it. This occurs in the clarinet-sdk Simnet session after a large number of transactions. @hugocaillard has already helped mitigate the issue by introducing a fix that extends the number of transactions before it happens, which is included in rv via 3229143. Still, it’s impressive how many tests it managed to handle before reaching this point!

By the way, what’s your use case for Rendezvous? Are you exploring its limits, or testing a specific real-world contract? Sharing your contract and the invariants/property tests you’ve written would greatly help us improve the tool, especially since we’re still in version 0.0.1.

@hassan-truscova
Copy link

Thanks @BowTiedRadone .
@fariedt I ran into the same issue and the fix seems to work so far. I hope it stays this way. In case i get an error later, i will update.

@hassan-truscova
Copy link

@BowTiedRadone @fariedt i think i spoke too soon. It just crashed.

I am on Ubuntu 22.04.5 LTS, node v20.16.0 and commit 3229143324294ee7e2674b272f7d9ba51d4205ba

bash rv example counter invariant --runs=100000000

       wallet_7 counter decrement 
[PASS] deployer counter invariant-counter-gt-zero 
       wallet_1 counter add 1828062306
[FAIL] wallet_5 counter invariant-counter-gt-zero 
/home/tools/rendezvous/node_modules/@hirosystems/clarinet-sdk-wasm/clarinet_sdk.js:1552
    throw new Error(getStringFromWasm0(arg0, arg1));
          ^

Error: recursive use of an object detected which would lead to unsafe aliasing in rust
    at module.exports.__wbindgen_throw (/home/tools/rendezvous/node_modules/@hirosystems/clarinet-sdk-wasm/clarinet_sdk.js:1552:11)
    at wasm://wasm/01e3cfce:wasm-function[4996]:0x4f99bb
    at wasm://wasm/01e3cfce:wasm-function[4995]:0x4f99b0
    at wasm://wasm/01e3cfce:wasm-function[4258]:0x473f5e
    at Proxy.getAccounts (/home/tools/rendezvous/node_modules/@hirosystems/clarinet-sdk-wasm/clarinet_sdk.js:542:26)
    at ChainArbitrary.chainer (/home/tools/rendezvous/invariant.js:56:105)
    at ChainArbitrary.valueChainer (/home/tools/rendezvous/node_modules/fast-check/lib/check/arbitrary/definition/Arbitrary.js:59:39)
    at /home/tools/rendezvous/node_modules/fast-check/lib/check/arbitrary/definition/Arbitrary.js:47:38
    at mapHelper (/home/tools/rendezvous/node_modules/fast-check/lib/stream/StreamHelpers.js:24:15)
    at mapHelper.next (<anonymous>)

Node.js v20.16.0

or maybe i didn't apply the patch correctly.

@BowTiedRadone
Copy link
Collaborator

Thank you for the update, @hassan-truscova! As mentioned earlier, this issue will continue occurring due to the behavior of clarinet-sdk. The commit I mentioned didn’t fully resolve it but only mitigated the impact. The commit @fariedt ran (a1467dd) also includes 3229143. For how long has Rendezvous been running on your machine?

@moodmosaic moodmosaic changed the title crash on long-running tests: recursive use of an object detected which would lead to unsafe aliasing in rust Crash on long-running Simnet sessions Jan 2, 2025
@moodmosaic
Copy link
Member

Thanks everyone for the feedback! These reports are incredibly helpful for understanding usage patterns and rv limits.

We'll monitor the upstream clarinet-sdk issue, but I must admit that the crash may not be our most pressing concern. The run counts being used (1.8M+) are much higher than what's typically practical for stateful/property-based/invariant testing.

We should probably adjust rv to:

  1. Set a more reasonable default of 50-100k runs (higher than current)
  2. Keep the option for longer runs, with stability warnings

IME, here's why such long runs aren't usually worthwhile:

  • Most critical invariant violations tend to show up in earlier runs
  • Coverage tends to plateau past ~100k runs. For example:
    • First 1000 runs: covers 60% of code
    • Next 1000 runs: gets to 75%
    • Next 1000 runs: reaches 78%
    • Next 10000 runs: maybe only gets to 80%
      At this point, you're getting very little new coverage despite running many more tests.
  • Those cycles are usually better spent on:
    • Different seeds
    • More invariant properties
    • Focused properties/invariants around suspicious behaviors

That said, thanks for pushing the limits! It helps us understand where things break. (In fact, #80, and others to come while we'll be monitoring hirosystems/clarinet#1636.)

@moodmosaic moodmosaic added the help wanted Extra attention is needed label Jan 2, 2025
@BowTiedRadone
Copy link
Collaborator

BowTiedRadone commented Jan 3, 2025

After yesterday's research, here's an update on this issue:

What’s Happening:

  • Simnet runs out of memory after a set number of blockchain operations (e.g. mineBlock, callPublicFunction, etc.), though getAccounts is not affected.
  • Once this memory limit is hit, any interaction with the Simnet session triggers a crash.
  • All Simnet interactions following the OOM are now handled within the fast-check infrastructure for both property and invariant testing types. This fix was introduced in Optimize Simnet usage accross the app #80.

Thanks to these changes, generator-level crashes that caused unhandled errors have been resolved. Our part is now complete, and we await a resolution to the upstream issue.

@moodmosaic
Copy link
Member

Now that #80 is merged, you can run as many iterations as clarinet-sdk supports. The limit is essentially determined by clarinet-sdk (tracked in hirosystems/clarinet#1636).

Given the discussion above and hirosystems/clarinet#1636, there are no actions left on our end. I’m closing this.

The issue remains searchable and can be reopened if needed. Thanks for your feedback—looking forward to more use cases! ❤️

@hugocaillard
Copy link

@fariedt @moodmosaic @BowTiedRadone
There's a 4gb memory limit on Wasm programs.
Any long running Simnet sessions (thousands of blocks with or without write transactions will eventually reach this limit (sooner if it contains many or big write transactions, which are very likely to happen in some real world examples).

In order to mitigate this issue, we could try to improve the memory usage of the Simnet (this is not a priority for us right now). A more ideal solutions would be to write the fuzzer in Rust directly.

With the current architecture, rendezvous will always suffer from this problem.
I think you should explore the possibility to have a mechanism that re-initialize the Simnet (loosing the state of the previous session and restart the Simnet from block 0).
It would be less ideal, but would allow to run the fuzzer for days on multiple simnet sessions)

@moodmosaic
Copy link
Member

A more ideal solutions would be to write the fuzzer in Rust directly.

This presents at least three significant challenges, despite my appeal from Haskell/Hedgehog background and the initial exploration with Clarity VM in #19:

  1. Rewriting would require rebuilding Clarinet's deployment planning system or finding a compatible alternative. The current system handles contract references across environments (mainnet, testnet) since it works with Clarinet.toml
  2. While proptest exists in Rust, the current approach using fast-check has proven reliability through actual use in ([pox-4] Pool Stacking stateful property-based tests stacks-core#4550) and validation in
  3. The existing solution supports both invariant-based (inside-out) and PoX-4 style (outside-in) fuzzing approaches (e.g. Add printed events to the context for assertions #59 can be resolved outside-in).

The sensible path forward is maintaining the current implementation. Rust may sound technically superior but such rewrite costs outweigh the benefits, especially given the existing battle-tested tooling. That said, a Rust implementation might be beneficial for other scenarios, such as providing built-in fuzzing support directly from clarinet as discussed in hirosystems/clarinet#398 (comment).

re-initialize the Simnet (loosing the state of the previous session and restart the Simnet from block 0)

Yes 👍 In reference to running 1.8M+ tests, as noted in #78 (comment), it's more effective to split the runs into multiple sessions with different seeds.

@hassan-truscova
Copy link

@moodmosaic @BowTiedRadone @hugocaillard Thanks a lot for actively solving the issue, i really appreciate it.

As for what @moodmosaic said,

Thanks everyone for the feedback! These reports are incredibly helpful for understanding usage patterns and rv limits.

We'll monitor the upstream clarinet-sdk issue, but I must admit that the crash may not be our most pressing concern. The run counts being used (1.8M+) are much higher than what's typically practical for stateful/property-based/invariant testing.

We should probably adjust rv to:

1. Set a more reasonable default of 50-100k runs (higher than current)

2. Keep the option for longer runs, with stability warnings

IME, here's why such long runs aren't usually worthwhile:

* Most critical invariant violations tend to show up in earlier runs

* Coverage tends to plateau past ~100k runs. For example:
  
  * First 1000 runs: covers 60% of code
  * Next 1000 runs: gets to 75%
  * Next 1000 runs: reaches 78%
  * Next 10000 runs: maybe only gets to 80%
    At this point, you're getting very little new coverage despite running many more tests.

* Those cycles are usually better spent on:
  
  * Different seeds
  * More invariant properties
  * Focused properties/invariants around suspicious behaviors

That said, thanks for pushing the limits! It helps us understand where things break. (In fact, #80, and others to come while we'll be monitoring hirosystems/clarinet#1636.)

I agree that the additional cycles are better spent on advanced techniques or exploration rather than trying to get lucky in breaking the invariant. In some cases, the fuzzer might not be able to break it, hence, it doesn't make sense to waste compute. Since you mentioned coverage, i assume Rendezvous will also have coverage guidance at some point, right?

Anyway, i will pull the changes and start the example again. In case anything breaks, i will report. Thanks again :)

@moodmosaic
Copy link
Member

also have coverage guidance at some point

@hassan-truscova, that'd be great. I'm not fully sure if that's a Clarinet feature or something from the Clarinet SDK.

@hugocaillard, do we have coverage support for (Clarity) code in the Clarinet SDK? That link throws a 404.

@hugocaillard
Copy link

@moodmosaic Yes you can get the coverage

@hugocaillard
Copy link

yes it's possible for you to get the the coverage. It's not well documented since it's internal stuff (not meant to be used by end users) but it's definitely doable

@moodmosaic
Copy link
Member

@hugocaillard any example usage?

@hugocaillard
Copy link

@moodmosaic
The reports (code coverage and costs) are collected here

And the lcov file is written when the test is ended, it's happening in the vitest-environement-clarinet package, here

You can get it to work in rendezvous as well with the collectReport function simnet.collectReport(false, "");. Make sure to init the simnet with the options to enable coverage initSimnet(manifestPath, false, { trackCosts: false, trackCoverage: true })

@mali-tintash
Copy link

@moodmosaic @BowTiedRadone @hugocaillard Thanks a lot for actively solving the issue, i really appreciate it.

As for what @moodmosaic said,

Thanks everyone for the feedback! These reports are incredibly helpful for understanding usage patterns and rv limits.
We'll monitor the upstream clarinet-sdk issue, but I must admit that the crash may not be our most pressing concern. The run counts being used (1.8M+) are much higher than what's typically practical for stateful/property-based/invariant testing.
We should probably adjust rv to:

1. Set a more reasonable default of 50-100k runs (higher than current)

2. Keep the option for longer runs, with stability warnings

IME, here's why such long runs aren't usually worthwhile:

* Most critical invariant violations tend to show up in earlier runs

* Coverage tends to plateau past ~100k runs. For example:
  
  * First 1000 runs: covers 60% of code
  * Next 1000 runs: gets to 75%
  * Next 1000 runs: reaches 78%
  * Next 10000 runs: maybe only gets to 80%
    At this point, you're getting very little new coverage despite running many more tests.

* Those cycles are usually better spent on:
  
  * Different seeds
  * More invariant properties
  * Focused properties/invariants around suspicious behaviors

That said, thanks for pushing the limits! It helps us understand where things break. (In fact, #80, and others to come while we'll be monitoring hirosystems/clarinet#1636.)

I agree that the additional cycles are better spent on advanced techniques or exploration rather than trying to get lucky in breaking the invariant. In some cases, the fuzzer might not be able to break it, hence, it doesn't make sense to waste compute. Since you mentioned coverage, i assume Rendezvous will also have coverage guidance at some point, right?

Anyway, i will pull the changes and start the example again. In case anything breaks, i will report. Thanks again :)

I think beyond a certain point you'll end up needing formal verification.

@moodmosaic
Copy link
Member

@mali-tintash: perhaps we could apply formal verification in Clarity through property/invariant checking. Are there existing tools for this? @jcnelson @obycode might find this relevant alongside the static analysis work for stacks-core#5360.

@hassan-truscova
Copy link

hassan-truscova commented Jan 10, 2025

@moodmosaic @hugocaillard apologies for the delayed update. I understand the previous comments but just want to post an update. It seems, it is still crashing. The time it took to crash is also written at the end.

₿   735215 Ӿ  3670011   wallet_2 [PASS] counter invariant-counter-gt-zero 
₿   735215 Ӿ  3670013   wallet_1        counter add 1606970158
₿   735215 Ӿ  3670013   wallet_7 [PASS] counter invariant-counter-gt-zero 
₿   735215 Ӿ  3670015   wallet_2        counter decrement 
₿   735215 Ӿ  3670015   wallet_4 [PASS] counter invariant-counter-gt-zero 

Error: Property failed after 1468880 tests.
Seed : 114669316

Counterexample:
- Contract : counter
- Function : add (public)
- Arguments: [0]
- Caller   : wallet_6
- Outputs  : {"type":{"response":{"ok":"bool","error":"uint128"}}}
- Invariant: invariant-counter-gt-zero (read_only)
- Arguments: []
- Caller   : wallet_6

What happened? Rendezvous went on a rampage and found a weak spot:

The invariant "invariant-counter-gt-zero" returned:

    Error: recursive use of an object detected which would lead to unsafe aliasing in rust
        at module.exports.__wbindgen_throw (/home/tools/rendezvous/node_modules/@hirosystems/clarinet-sdk-wasm/clarinet_sdk.js:1575:11)
        at wasm://wasm/01dccebe:wasm-function[5035]:0x4f8024
        at wasm://wasm/01dccebe:wasm-function[5034]:0x4f8019
        at wasm://wasm/01dccebe:wasm-function[5344]:0x50b9f3
        at wasm://wasm/01dccebe:wasm-function[3013]:0x42d041
        at get burnBlockHeight (/home/tools/rendezvous/node_modules/@hirosystems/clarinet-sdk-wasm/clarinet_sdk.js:466:26)
        at Reflect.get (<anonymous>)
        at Object.get (/home/tools/rendezvous/node_modules/@hirosystems/clarinet-sdk/dist/cjs/node/src/sdkProxy.js:93:28)
        at /home/tools/rendezvous/dist/invariant.js:137:71
        at Property.predicate (/home/tools/rendezvous/node_modules/fast-check/lib/check/property/Property.js:17:86)


real	54m2,787s
user	103m40,820s
sys	0m40,506s

I am on fe90575a6bed3a6441306ca41567eaa1528c2ddb.
It is just an update as i wrote in my comment that i will run it on my side and give an update.

@hassan-truscova
Copy link

hassan-truscova commented Jan 10, 2025

@mali-tintash i don't understand the context. secondly, as pointed out by @moodmosaic here, are there any existing tools for formal verification in Stacks ecosystem or for Clarity? i was checking online recently and also had a short informative discussion with @moodmosaic , it seems apparently there are no formal verification tools for Clarity/Stacks chain. I started a discussion thread recently to ask for formal verification tools for Clarity. so far there are no responses, or maybe soon someone will respond as holidays come to an end. I won't clutter this thread with formal verification discussion. If you know some tools or want to talk about it, please join the discussion thread.

@moodmosaic
Copy link
Member

@hassan-truscova, re: #78 (comment)

Yes, that's correct. If someone wants to run rendezvous 24/7—which shouldn't really—they'll hit the 4 GB cap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants