SIGSEGV with 1.26.2 and cache_mode=true #4554

odoucet · 2025-02-04T08:14:18Z

Opening a new issue because #4497 is closed.

This issue seems not totally fixed.

here is output with Dragonfly 1.26.2 launched with :
--dbfilename=dump - --cache_mode=true - --vmodule=streamer=1,dflycmd=1,replica=1

It affects replica instance.

I20250204 08:06:42.981666     1 dfly_main.cc:691] Starting dragonfly df-v1.26.2-096fde172300de91850c42dab24aa09ffee254d0
I20250204 08:06:42.981844     1 dfly_main.cc:735] maxmemory has not been specified. Deciding myself....
I20250204 08:06:42.981854     1 dfly_main.cc:744] Found 2.00GiB available memory. Setting maxmemory to 1.60GiB
W20250204 08:06:42.981900     1 dfly_main.cc:368] Weird error 1 switching to epoll
I20250204 08:06:43.059904     1 proactor_pool.cc:147] Running 2 io threads
I20250204 08:06:43.062503     1 dfly_main.cc:272] Listening on admin socket any:9999
I20250204 08:06:43.063760     1 server_family.cc:835] Host OS: Linux 5.14.0-503.22.1.el9_5.x86_64 x86_64 with 2 threads
I20250204 08:06:43.068403     1 snapshot_storage.cc:185] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
I20250204 08:06:43.073527     1 server_family.cc:1114] Loading /dragonfly/snapshots/dump-summary.dfs
I20250204 08:06:43.087914     6 listener_interface.cc:101] sock[8] AcceptServer - listening on port 6379
I20250204 08:06:43.089705     7 listener_interface.cc:101] sock[7] AcceptServer - listening on port 9999
I20250204 08:06:43.089743     7 listener_interface.cc:101] sock[9] AcceptServer - listening on port 11211
I20250204 08:06:44.393589     6 server_family.cc:1154] Load finished, num keys read: 572201
I20250204 08:06:56.683905     7 server_family.cc:2769] Replicating 10.233.137.19:9999
F20250204 08:06:56.685020     7 db_slice.cc:784] Check failed: fetched_items_.empty() 
*** Check failure stack trace: ***
    @     0x558c1fd18923  google::LogMessage::SendToLog()
    @     0x558c1fd110e7  google::LogMessage::Flush()
    @     0x558c1fd12a6f  google::LogMessageFatal::~LogMessageFatal()
    @     0x558c1f550aef  dfly::DbSlice::FlushDbIndexes()
    @     0x558c1f550c82  dfly::DbSlice::FlushDb()
    @     0x558c1f2d02ba  _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly12ServerFamily8DrakarysEPNS3_11TransactionEtEUlS6_PNS3_11EngineShardEE_NS5_14RunnableResultEJS6_S8_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x558c1f590bf9  dfly::Transaction::RunCallback()
    @     0x558c1f5939db  dfly::Transaction::RunInShard()
    @     0x558c1f4d1b60  dfly::EngineShard::PollExecution()
    @     0x558c1f58c7e1  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction11DispatchHopEvEUlvE1_E9_M_invokeERKSt9_Any_data
    @     0x558c1fafe395  util::fb2::FiberQueue::Run()
    @     0x558c1f5deb70  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x558c1fb1d69f  make_fcontext
*** SIGABRT received at time=1738656416 on cpu 1 ***
PC: @     0x7ff0adabe9fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7ff0ada50898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1738656416 on cpu 1 ***
PC: @     0x7ff0ada50898  (unknown)  abort

The pod crash "sometimes". Here, it crashed once and is now running correctly :

I20250204 08:07:18.717423     1 server_family.cc:835] Host OS: Linux 5.14.0-503.22.1.el9_5.x86_64 x86_64 with 2 threads
I20250204 08:07:18.721555     1 snapshot_storage.cc:185] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
I20250204 08:07:18.724478     1 server_family.cc:1114] Loading /dragonfly/snapshots/dump-summary.dfs
I20250204 08:07:18.733318     6 listener_interface.cc:101] sock[8] AcceptServer - listening on port 6379
I20250204 08:07:18.733692     7 listener_interface.cc:101] sock[7] AcceptServer - listening on port 9999
I20250204 08:07:18.733723     7 listener_interface.cc:101] sock[9] AcceptServer - listening on port 11211
I20250204 08:07:20.000798     6 server_family.cc:1154] Load finished, num keys read: 572201
I20250204 08:07:36.684935     7 server_family.cc:2769] Replicating 10.233.137.19:9999
I20250204 08:07:36.743476     7 replica.cc:94] Starting replication
I20250204 08:07:36.743537     7 replica.cc:116] Resolving master DNS
I20250204 08:07:36.743656     7 replica.cc:121] Connecting to master
I20250204 08:07:36.744189     7 replica.cc:126] Greeting
I20250204 08:07:36.744204     7 replica.cc:281] greeting message handling
I20250204 08:07:36.745363     7 replica.cc:360] Master id: 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786, sync id: SYNC1, num journals: 2, version: 3
I20250204 08:07:36.745939     7 replica.cc:193] Main replication fiber started
I20250204 08:07:36.746312     7 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC1 1
I20250204 08:07:36.746456     6 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC1 0
I20250204 08:07:36.747501     7 replica.cc:724] Sending: DFLY SYNC SYNC1
I20250204 08:07:36.750665     7 replica.cc:569] Started full sync with 10.233.137.19:9999
I20250204 08:07:36.750690     7 replica.cc:573] Waiting for all full sync cut confirmations
I20250204 08:07:36.803468     7 server_family.cc:2769] Replicating 10.233.137.19:9999
I20250204 08:07:36.803522     7 replica.cc:148] Stopping replication
I20250204 08:07:36.809708     6 replica.cc:843] FullSyncDflyFb finished after reading 3067928 bytes
I20250204 08:07:36.815779     7 replica.cc:843] FullSyncDflyFb finished after reading 3485050 bytes
W20250204 08:07:36.815847     7 replica.cc:246] Error syncing with 10.233.137.19:9999 generic:125 Operation canceled
I20250204 08:07:36.815882     7 replica.cc:276] Main replication fiber finished
I20250204 08:07:36.825311     7 replica.cc:94] Starting replication
I20250204 08:07:36.825345     7 replica.cc:116] Resolving master DNS
I20250204 08:07:36.825459     7 replica.cc:121] Connecting to master
I20250204 08:07:36.825807     7 replica.cc:126] Greeting
I20250204 08:07:36.825820     7 replica.cc:281] greeting message handling
I20250204 08:07:36.826691     7 replica.cc:360] Master id: 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786, sync id: SYNC2, num journals: 2, version: 3
I20250204 08:07:36.827033     7 replica.cc:193] Main replication fiber started
I20250204 08:07:36.827302     7 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC2 1
I20250204 08:07:36.827404     6 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC2 0
I20250204 08:07:36.828095     7 replica.cc:724] Sending: DFLY SYNC SYNC2
I20250204 08:07:36.830502     7 replica.cc:569] Started full sync with 10.233.137.19:9999
I20250204 08:07:36.830538     7 replica.cc:573] Waiting for all full sync cut confirmations
I20250204 08:07:37.742475     7 replica.cc:724] Sending: DFLY STARTSTABLE SYNC2
I20250204 08:07:37.743093     7 replica.cc:843] FullSyncDflyFb finished after reading 75630512 bytes
I20250204 08:07:37.743252     6 replica.cc:843] FullSyncDflyFb finished after reading 56034881 bytes
I20250204 08:07:37.744261     7 replica.cc:589] full sync finished in 917 ms
I20250204 08:07:37.744330     7 replica.cc:679] Transitioned into stable sync
I20250204 08:07:37.745183     7 replica.cc:921] Sending an ACK with offset=17 forced=0
I20250204 08:07:37.745688     6 replica.cc:921] Sending an ACK with offset=18 forced=0
I20250204 08:07:38.745687     7 replica.cc:921] Sending an ACK with offset=64 forced=0
I20250204 08:07:38.745963     6 replica.cc:921] Sending an ACK with offset=36 forced=0
I20250204 08:07:39.745767     7 replica.cc:921] Sending an ACK with offset=139 forced=0
I20250204 08:07:39.746300     6 replica.cc:921] Sending an ACK with offset=104 forced=0
I20250204 08:07:40.746562     7 replica.cc:921] Sending an ACK with offset=209 forced=0
I20250204 08:07:40.746655     6 replica.cc:921] Sending an ACK with offset=172 forced=0
I20250204 08:07:41.245782     7 server_family.cc:2769] Replicating 10.233.137.19:9999
I20250204 08:07:41.245841     7 replica.cc:148] Stopping replication
I20250204 08:07:41.246042     7 replica.cc:703] Exit stable sync
I20250204 08:07:41.246091     7 replica.cc:276] Main replication fiber finished
I20250204 08:07:41.318176     7 replica.cc:94] Starting replication
I20250204 08:07:41.318238     7 replica.cc:116] Resolving master DNS
I20250204 08:07:41.318380     7 replica.cc:121] Connecting to master
I20250204 08:07:41.318848     7 replica.cc:126] Greeting
I20250204 08:07:41.318864     7 replica.cc:281] greeting message handling
I20250204 08:07:41.320183     7 replica.cc:360] Master id: 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786, sync id: SYNC3, num journals: 2, version: 3
I20250204 08:07:41.321235     7 replica.cc:193] Main replication fiber started
I20250204 08:07:41.321892     7 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC3 1
I20250204 08:07:41.322031     6 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC3 0
I20250204 08:07:41.323082     7 replica.cc:724] Sending: DFLY SYNC SYNC3
I20250204 08:07:41.326880     7 replica.cc:569] Started full sync with 10.233.137.19:9999
I20250204 08:07:41.326922     7 replica.cc:573] Waiting for all full sync cut confirmations
I20250204 08:07:42.240965     7 replica.cc:724] Sending: DFLY STARTSTABLE SYNC3
I20250204 08:07:42.241829     7 replica.cc:843] FullSyncDflyFb finished after reading 75717753 bytes
I20250204 08:07:42.242180     6 replica.cc:843] FullSyncDflyFb finished after reading 55223393 bytes
I20250204 08:07:42.243268     7 replica.cc:589] full sync finished in 921 ms
I20250204 08:07:42.243342     7 replica.cc:679] Transitioned into stable sync
I20250204 08:07:42.244382     7 replica.cc:921] Sending an ACK with offset=256 forced=0
I20250204 08:07:42.245657     6 replica.cc:921] Sending an ACK with offset=203 forced=0
I20250204 08:07:43.245344     7 replica.cc:921] Sending an ACK with offset=281 forced=0
I20250204 08:07:43.246227     6 replica.cc:921] Sending an ACK with offset=222 forced=0
I20250204 08:07:44.245911     7 replica.cc:921] Sending an ACK with offset=383 forced=0
I20250204 08:07:44.246526     6 replica.cc:921] Sending an ACK with offset=299 forced=0
[ last mesage repeating ]

The text was updated successfully, but these errors were encountered:

chakaz · 2025-02-04T12:14:53Z

Hey @odoucet, thanks for reporting this issue. I'm sorry that the previous fix does not cover this case :(

A few follow up questions to help us focus:

How frequently does it crash?
Does it only crash during the initial replication process (following REPLICAOF), or can it also crash in arbitrary/random points in time?
Can you please confirm that if you run the replica without --cache_mode=true it never crashes?
Same question as above, but if you run it with --dbfilename= (i.e. that it won't try to save/load RDB snapshots before it starts to replicate)
Would you be able to share the RDB file with us, privately if needed, so we can examine it?

Thanks!

odoucet · 2025-02-04T16:03:00Z

How frequently does it crash?

Once every few hours, but 100% after first restart.

Does it only crash during the initial replication process (following REPLICAOF), or can it also crash in arbitrary/random points in time?

The first question answers this I guess : the master has a high uptime, the replica is failing at every restart + once in a while.

Can you please confirm that if you run the replica without --cache_mode=true it never crashes?

Cannot confirm yet, will try later.

Same question as above, but if you run it with --dbfilename= (i.e. that it won't try to save/load RDB snapshots before it starts to replicate)

Cannot confirm yet, will try later.

Would you be able to share the RDB file with us, privately if needed, so we can examine it?

I would need to reproduce this on our lab but definitely doable, I get back to you when I know more.

odoucet · 2025-02-10T13:07:00Z

Sorry for late reply.
With 1.26.2, using --dbfilename= makes it work correctly : replica is working and no crash occurs at startup.

kostasrim · 2025-02-10T13:10:37Z

@odoucet no worries. We still had some small issue which I patched. You won't need --dbfilename="" with the fixes once they are merged :)

odoucet added the bug Something isn't working label Feb 4, 2025

kostasrim self-assigned this Feb 5, 2025

kostasrim linked a pull request Feb 5, 2025 that will close this issue

fix: race when bumping items while loading a snapshot #4564

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV with 1.26.2 and cache_mode=true #4554

SIGSEGV with 1.26.2 and cache_mode=true #4554

odoucet commented Feb 4, 2025

chakaz commented Feb 4, 2025

odoucet commented Feb 4, 2025

odoucet commented Feb 10, 2025

kostasrim commented Feb 10, 2025

SIGSEGV with 1.26.2 and cache_mode=true #4554

SIGSEGV with 1.26.2 and cache_mode=true #4554

Comments

odoucet commented Feb 4, 2025

chakaz commented Feb 4, 2025

odoucet commented Feb 4, 2025

odoucet commented Feb 10, 2025

kostasrim commented Feb 10, 2025