Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV with 1.26.2 and cache_mode=true #4554

Open
odoucet opened this issue Feb 4, 2025 · 4 comments · May be fixed by #4564
Open

SIGSEGV with 1.26.2 and cache_mode=true #4554

odoucet opened this issue Feb 4, 2025 · 4 comments · May be fixed by #4564
Assignees
Labels
bug Something isn't working

Comments

@odoucet
Copy link

odoucet commented Feb 4, 2025

Opening a new issue because #4497 is closed.

This issue seems not totally fixed.

here is output with Dragonfly 1.26.2 launched with :
--dbfilename=dump - --cache_mode=true - --vmodule=streamer=1,dflycmd=1,replica=1

It affects replica instance.

I20250204 08:06:42.981666     1 dfly_main.cc:691] Starting dragonfly df-v1.26.2-096fde172300de91850c42dab24aa09ffee254d0
I20250204 08:06:42.981844     1 dfly_main.cc:735] maxmemory has not been specified. Deciding myself....
I20250204 08:06:42.981854     1 dfly_main.cc:744] Found 2.00GiB available memory. Setting maxmemory to 1.60GiB
W20250204 08:06:42.981900     1 dfly_main.cc:368] Weird error 1 switching to epoll
I20250204 08:06:43.059904     1 proactor_pool.cc:147] Running 2 io threads
I20250204 08:06:43.062503     1 dfly_main.cc:272] Listening on admin socket any:9999
I20250204 08:06:43.063760     1 server_family.cc:835] Host OS: Linux 5.14.0-503.22.1.el9_5.x86_64 x86_64 with 2 threads
I20250204 08:06:43.068403     1 snapshot_storage.cc:185] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
I20250204 08:06:43.073527     1 server_family.cc:1114] Loading /dragonfly/snapshots/dump-summary.dfs
I20250204 08:06:43.087914     6 listener_interface.cc:101] sock[8] AcceptServer - listening on port 6379
I20250204 08:06:43.089705     7 listener_interface.cc:101] sock[7] AcceptServer - listening on port 9999
I20250204 08:06:43.089743     7 listener_interface.cc:101] sock[9] AcceptServer - listening on port 11211
I20250204 08:06:44.393589     6 server_family.cc:1154] Load finished, num keys read: 572201
I20250204 08:06:56.683905     7 server_family.cc:2769] Replicating 10.233.137.19:9999
F20250204 08:06:56.685020     7 db_slice.cc:784] Check failed: fetched_items_.empty() 
*** Check failure stack trace: ***
    @     0x558c1fd18923  google::LogMessage::SendToLog()
    @     0x558c1fd110e7  google::LogMessage::Flush()
    @     0x558c1fd12a6f  google::LogMessageFatal::~LogMessageFatal()
    @     0x558c1f550aef  dfly::DbSlice::FlushDbIndexes()
    @     0x558c1f550c82  dfly::DbSlice::FlushDb()
    @     0x558c1f2d02ba  _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly12ServerFamily8DrakarysEPNS3_11TransactionEtEUlS6_PNS3_11EngineShardEE_NS5_14RunnableResultEJS6_S8_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE
    @     0x558c1f590bf9  dfly::Transaction::RunCallback()
    @     0x558c1f5939db  dfly::Transaction::RunInShard()
    @     0x558c1f4d1b60  dfly::EngineShard::PollExecution()
    @     0x558c1f58c7e1  _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction11DispatchHopEvEUlvE1_E9_M_invokeERKSt9_Any_data
    @     0x558c1fafe395  util::fb2::FiberQueue::Run()
    @     0x558c1f5deb70  _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE
    @     0x558c1fb1d69f  make_fcontext
*** SIGABRT received at time=1738656416 on cpu 1 ***
PC: @     0x7ff0adabe9fc  (unknown)  pthread_kill
[failure_signal_handler.cc : 345] RAW: Signal 11 raised at PC=0x7ff0ada50898 while already in AbslFailureSignalHandler()
*** SIGSEGV received at time=1738656416 on cpu 1 ***
PC: @     0x7ff0ada50898  (unknown)  abort

The pod crash "sometimes". Here, it crashed once and is now running correctly :

I20250204 08:07:18.717423     1 server_family.cc:835] Host OS: Linux 5.14.0-503.22.1.el9_5.x86_64 x86_64 with 2 threads
I20250204 08:07:18.721555     1 snapshot_storage.cc:185] Load snapshot: Searching for snapshot in directory: "/dragonfly/snapshots"
I20250204 08:07:18.724478     1 server_family.cc:1114] Loading /dragonfly/snapshots/dump-summary.dfs
I20250204 08:07:18.733318     6 listener_interface.cc:101] sock[8] AcceptServer - listening on port 6379
I20250204 08:07:18.733692     7 listener_interface.cc:101] sock[7] AcceptServer - listening on port 9999
I20250204 08:07:18.733723     7 listener_interface.cc:101] sock[9] AcceptServer - listening on port 11211
I20250204 08:07:20.000798     6 server_family.cc:1154] Load finished, num keys read: 572201
I20250204 08:07:36.684935     7 server_family.cc:2769] Replicating 10.233.137.19:9999
I20250204 08:07:36.743476     7 replica.cc:94] Starting replication
I20250204 08:07:36.743537     7 replica.cc:116] Resolving master DNS
I20250204 08:07:36.743656     7 replica.cc:121] Connecting to master
I20250204 08:07:36.744189     7 replica.cc:126] Greeting
I20250204 08:07:36.744204     7 replica.cc:281] greeting message handling
I20250204 08:07:36.745363     7 replica.cc:360] Master id: 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786, sync id: SYNC1, num journals: 2, version: 3
I20250204 08:07:36.745939     7 replica.cc:193] Main replication fiber started
I20250204 08:07:36.746312     7 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC1 1
I20250204 08:07:36.746456     6 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC1 0
I20250204 08:07:36.747501     7 replica.cc:724] Sending: DFLY SYNC SYNC1
I20250204 08:07:36.750665     7 replica.cc:569] Started full sync with 10.233.137.19:9999
I20250204 08:07:36.750690     7 replica.cc:573] Waiting for all full sync cut confirmations
I20250204 08:07:36.803468     7 server_family.cc:2769] Replicating 10.233.137.19:9999
I20250204 08:07:36.803522     7 replica.cc:148] Stopping replication
I20250204 08:07:36.809708     6 replica.cc:843] FullSyncDflyFb finished after reading 3067928 bytes
I20250204 08:07:36.815779     7 replica.cc:843] FullSyncDflyFb finished after reading 3485050 bytes
W20250204 08:07:36.815847     7 replica.cc:246] Error syncing with 10.233.137.19:9999 generic:125 Operation canceled
I20250204 08:07:36.815882     7 replica.cc:276] Main replication fiber finished
I20250204 08:07:36.825311     7 replica.cc:94] Starting replication
I20250204 08:07:36.825345     7 replica.cc:116] Resolving master DNS
I20250204 08:07:36.825459     7 replica.cc:121] Connecting to master
I20250204 08:07:36.825807     7 replica.cc:126] Greeting
I20250204 08:07:36.825820     7 replica.cc:281] greeting message handling
I20250204 08:07:36.826691     7 replica.cc:360] Master id: 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786, sync id: SYNC2, num journals: 2, version: 3
I20250204 08:07:36.827033     7 replica.cc:193] Main replication fiber started
I20250204 08:07:36.827302     7 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC2 1
I20250204 08:07:36.827404     6 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC2 0
I20250204 08:07:36.828095     7 replica.cc:724] Sending: DFLY SYNC SYNC2
I20250204 08:07:36.830502     7 replica.cc:569] Started full sync with 10.233.137.19:9999
I20250204 08:07:36.830538     7 replica.cc:573] Waiting for all full sync cut confirmations
I20250204 08:07:37.742475     7 replica.cc:724] Sending: DFLY STARTSTABLE SYNC2
I20250204 08:07:37.743093     7 replica.cc:843] FullSyncDflyFb finished after reading 75630512 bytes
I20250204 08:07:37.743252     6 replica.cc:843] FullSyncDflyFb finished after reading 56034881 bytes
I20250204 08:07:37.744261     7 replica.cc:589] full sync finished in 917 ms
I20250204 08:07:37.744330     7 replica.cc:679] Transitioned into stable sync
I20250204 08:07:37.745183     7 replica.cc:921] Sending an ACK with offset=17 forced=0
I20250204 08:07:37.745688     6 replica.cc:921] Sending an ACK with offset=18 forced=0
I20250204 08:07:38.745687     7 replica.cc:921] Sending an ACK with offset=64 forced=0
I20250204 08:07:38.745963     6 replica.cc:921] Sending an ACK with offset=36 forced=0
I20250204 08:07:39.745767     7 replica.cc:921] Sending an ACK with offset=139 forced=0
I20250204 08:07:39.746300     6 replica.cc:921] Sending an ACK with offset=104 forced=0
I20250204 08:07:40.746562     7 replica.cc:921] Sending an ACK with offset=209 forced=0
I20250204 08:07:40.746655     6 replica.cc:921] Sending an ACK with offset=172 forced=0
I20250204 08:07:41.245782     7 server_family.cc:2769] Replicating 10.233.137.19:9999
I20250204 08:07:41.245841     7 replica.cc:148] Stopping replication
I20250204 08:07:41.246042     7 replica.cc:703] Exit stable sync
I20250204 08:07:41.246091     7 replica.cc:276] Main replication fiber finished
I20250204 08:07:41.318176     7 replica.cc:94] Starting replication
I20250204 08:07:41.318238     7 replica.cc:116] Resolving master DNS
I20250204 08:07:41.318380     7 replica.cc:121] Connecting to master
I20250204 08:07:41.318848     7 replica.cc:126] Greeting
I20250204 08:07:41.318864     7 replica.cc:281] greeting message handling
I20250204 08:07:41.320183     7 replica.cc:360] Master id: 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786, sync id: SYNC3, num journals: 2, version: 3
I20250204 08:07:41.321235     7 replica.cc:193] Main replication fiber started
I20250204 08:07:41.321892     7 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC3 1
I20250204 08:07:41.322031     6 replica.cc:741] Sending on flow 9d33ab3e04153c86a6b2f8f4825a4a7e03b92786 SYNC3 0
I20250204 08:07:41.323082     7 replica.cc:724] Sending: DFLY SYNC SYNC3
I20250204 08:07:41.326880     7 replica.cc:569] Started full sync with 10.233.137.19:9999
I20250204 08:07:41.326922     7 replica.cc:573] Waiting for all full sync cut confirmations
I20250204 08:07:42.240965     7 replica.cc:724] Sending: DFLY STARTSTABLE SYNC3
I20250204 08:07:42.241829     7 replica.cc:843] FullSyncDflyFb finished after reading 75717753 bytes
I20250204 08:07:42.242180     6 replica.cc:843] FullSyncDflyFb finished after reading 55223393 bytes
I20250204 08:07:42.243268     7 replica.cc:589] full sync finished in 921 ms
I20250204 08:07:42.243342     7 replica.cc:679] Transitioned into stable sync
I20250204 08:07:42.244382     7 replica.cc:921] Sending an ACK with offset=256 forced=0
I20250204 08:07:42.245657     6 replica.cc:921] Sending an ACK with offset=203 forced=0
I20250204 08:07:43.245344     7 replica.cc:921] Sending an ACK with offset=281 forced=0
I20250204 08:07:43.246227     6 replica.cc:921] Sending an ACK with offset=222 forced=0
I20250204 08:07:44.245911     7 replica.cc:921] Sending an ACK with offset=383 forced=0
I20250204 08:07:44.246526     6 replica.cc:921] Sending an ACK with offset=299 forced=0
[ last mesage repeating ]
@odoucet odoucet added the bug Something isn't working label Feb 4, 2025
@chakaz
Copy link
Collaborator

chakaz commented Feb 4, 2025

Hey @odoucet, thanks for reporting this issue. I'm sorry that the previous fix does not cover this case :(

A few follow up questions to help us focus:

  • How frequently does it crash?
  • Does it only crash during the initial replication process (following REPLICAOF), or can it also crash in arbitrary/random points in time?
  • Can you please confirm that if you run the replica without --cache_mode=true it never crashes?
  • Same question as above, but if you run it with --dbfilename= (i.e. that it won't try to save/load RDB snapshots before it starts to replicate)
  • Would you be able to share the RDB file with us, privately if needed, so we can examine it?

Thanks!

@odoucet
Copy link
Author

odoucet commented Feb 4, 2025

How frequently does it crash?

Once every few hours, but 100% after first restart.

Does it only crash during the initial replication process (following REPLICAOF), or can it also crash in arbitrary/random points in time?

The first question answers this I guess : the master has a high uptime, the replica is failing at every restart + once in a while.

Can you please confirm that if you run the replica without --cache_mode=true it never crashes?

Cannot confirm yet, will try later.

Same question as above, but if you run it with --dbfilename= (i.e. that it won't try to save/load RDB snapshots before it starts to replicate)

Cannot confirm yet, will try later.

Would you be able to share the RDB file with us, privately if needed, so we can examine it?

I would need to reproduce this on our lab but definitely doable, I get back to you when I know more.

@kostasrim kostasrim self-assigned this Feb 5, 2025
@odoucet
Copy link
Author

odoucet commented Feb 10, 2025

Sorry for late reply.
With 1.26.2, using --dbfilename= makes it work correctly : replica is working and no crash occurs at startup.

@kostasrim
Copy link
Contributor

@odoucet no worries. We still had some small issue which I patched. You won't need --dbfilename="" with the fixes once they are merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants