Log says 'SYNCED' but wsrep_cluster_state = 1 (JOINING) #143

freddielunchbird · 2014-09-24T17:47:58Z

Hi,

I have a two node cluster:
Ubuntu 14.04
PXC 5.6.20-68.0 , galera 3.7(r7f44a18)
xtrabackup 2.2.4

I have a problem to add a third node (same software as for the other two nodes). The node gets stuck in 'joining', but the error log reports 'SYNCED'.

Joiner error log:

140924 17:31:51 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140924 17:31:51 mysqld_safe Skipping wsrep-recover for empty datadir: /var/lib/mysql
140924 17:31:51 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
2014-09-24 17:31:51 0 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
2014-09-24 17:31:51 15952 [Note] WSREP: Read nil XID from storage engines, skipping position init
2014-09-24 17:31:51 15952 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2014-09-24 17:31:51 15952 [Note] WSREP: wsrep_load(): Galera 3.7(r7f44a18) by Codership Oy [email protected] loaded successfully.
2014-09-24 17:31:51 15952 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2014-09-24 17:31:51 15952 [Warning] WSREP: Could not open saved state file for reading: /var/lib/mysql//grastate.dat
2014-09-24 17:31:51 15952 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:31:51 15952 [Note] WSREP: Passing config to GCS: base_host = 10.177.194.12; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.w
2014-09-24 17:31:51 15952 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:31:51 15952 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2014-09-24 17:31:51 15952 [Note] WSREP: wsrep_sst_grab()
2014-09-24 17:31:51 15952 [Note] WSREP: Start replication
2014-09-24 17:31:51 15952 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:31:51 15952 [Note] WSREP: protonet asio version 0
2014-09-24 17:31:51 15952 [Note] WSREP: Using CRC-32C for message checksums.
2014-09-24 17:31:51 15952 [Note] WSREP: backend: asio
2014-09-24 17:31:51 15952 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
2014-09-24 17:31:51 15952 [Note] WSREP: restore pc from disk failed
2014-09-24 17:31:51 15952 [Note] WSREP: GMCast version 0
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2014-09-24 17:31:51 15952 [Note] WSREP: EVS version 0
2014-09-24 17:31:51 15952 [Note] WSREP: PC version 0
2014-09-24 17:31:51 15952 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.177.192.227:4567,10.179.2.66:4567,10.177.194.12:4567'
2014-09-24 17:31:51 15952 [Warning] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' points to own listening address, blacklisting
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping
2014-09-24 17:31:52 15952 [Note] WSREP: declaring 5fea8f01 at tcp://10.177.192.227:4567 stable
2014-09-24 17:31:52 15952 [Note] WSREP: declaring ce5494a7 at tcp://10.179.2.66:4567 stable
2014-09-24 17:31:52 15952 [Note] WSREP: Node 5fea8f01 state prim
2014-09-24 17:31:52 15952 [Note] WSREP: save pc into disk
2014-09-24 17:31:52 15952 [Note] WSREP: gcomm: connected
2014-09-24 17:31:52 15952 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2014-09-24 17:31:52 15952 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2014-09-24 17:31:52 15952 [Note] WSREP: Opened channel 'my_wsrep_cluster'
2014-09-24 17:31:52 15952 [Note] WSREP: Waiting for SST to complete.
2014-09-24 17:31:52 15952 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: sent state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6
2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: got state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6 from 0 (server03)
2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: got state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6 from 2 (server06)
2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: got state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6 from 1 (server01)
2014-09-24 17:31:52 15952 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 159,
members = 2/3 (joined/total),
act_id = 102139,
last_appl. = -1,
protocols = 0/6/3 (gcs/repl/appl),
group UUID = 01ec4764-3fea-11e4-887f-da10e62749e4
2014-09-24 17:31:52 15952 [Note] WSREP: Flow-control interval: [28, 28]
2014-09-24 17:31:52 15952 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 102139)
2014-09-24 17:31:52 15952 [Note] WSREP: State transfer required:
Group state: 01ec4764-3fea-11e4-887f-da10e62749e4:102139
Local state: 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:31:52 15952 [Note] WSREP: New cluster view: global state: 01ec4764-3fea-11e4-887f-da10e62749e4:102139, view# 160: Primary, number of nodes: 3, my index: 1, protocol version 3
2014-09-24 17:31:52 15952 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-09-24 17:31:52 15952 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.177.194.12' --auth 'root:password' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '15952' '' '
WSREP_SST: [INFO] Streaming with xbstream (20140924 17:31:52.963)
WSREP_SST: [INFO] Using socat as streamer (20140924 17:31:52.965)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:31:53.301)
2014-09-24 17:31:53 15952 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.177.194.12:4444/xtrabackup_sst
2014-09-24 17:31:53 15952 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-09-24 17:31:53 15952 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-09-24 17:31:53 15952 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:31:53 15952 [Note] WSREP: Assign initial position for certification: 102139, protocol version: 3
2014-09-24 17:31:53 15952 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:31:53 15952 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (01ec4764-3fea-11e4-887f-da10e62749e4): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():455. IST will be unavailable.
2014-09-24 17:31:53 15952 [Note] WSREP: Member 1.0 (server01) requested state transfer from 'any'. Selected 0.0 (server03)(SYNCED) as donor.
2014-09-24 17:31:53 15952 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 102139)
2014-09-24 17:31:53 15952 [Note] WSREP: Requesting state transfer: success, donor: 0
WSREP_SST: [INFO] Proceeding with SST (20140924 17:31:54.298)
WSREP_SST: [INFO] Cleaning the existing datadir (20140924 17:31:54.302)
removed ‘/var/lib/mysql/gvwstate.dat’
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:31:54.319)
2014-09-24 17:31:55 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') turning message relay requesting off
2014-09-24 17:33:42 15952 [Note] WSREP: 0.0 (server03): State transfer to 1.0 (server01) complete.
2014-09-24 17:33:42 15952 [Note] WSREP: Member 0.0 (server03) synced with group.
WSREP_SST: [INFO] Preparing the backup at /var/lib/mysql/ (20140924 17:33:42.257)
WSREP_SST: [INFO] Evaluating innobackupex --no-version-check --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20140924 17:33:42.264)
WSREP_SST: [INFO] Total time on joiner: 0 seconds (20140924 17:33:50.769)
WSREP_SST: [INFO] Removing the sst_in_progress file (20140924 17:33:50.775)
2014-09-24 17:33:50 15952 [Note] WSREP: SST complete, seqno: 102137
2014-09-24 17:33:50 15952 [Note] Plugin 'FEDERATED' is disabled.
2014-09-24 17:33:50 15952 [Note] InnoDB: Using atomics to ref count buffer pool pages
2014-09-24 17:33:50 15952 [Note] InnoDB: The InnoDB memory heap is disabled
2014-09-24 17:33:50 15952 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2014-09-24 17:33:50 15952 [Note] InnoDB: Memory barrier is not used
2014-09-24 17:33:50 15952 [Note] InnoDB: Compressed tables use zlib 1.2.3.4
2014-09-24 17:33:50 15952 [Note] InnoDB: Using Linux native AIO
2014-09-24 17:33:50 15952 [Note] InnoDB: Not using CPU crc32 instructions
2014-09-24 17:33:50 15952 [Note] InnoDB: Initializing buffer pool, size = 908.0M
2014-09-24 17:33:50 15952 [Note] InnoDB: Completed initialization of buffer pool
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18311 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18320 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18312 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18319 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18318 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18317 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18315 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18314 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18316 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18313 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Note] InnoDB: Highest supported file format is Barracuda.
2014-09-24 17:33:51 15952 [Note] InnoDB: 128 rollback segment(s) are active.
2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18325 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:51 15952 [Note] InnoDB: Waiting for purge to start
2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18326 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18327 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18328 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:51 15952 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.20-68.0 started; log sequence number 767569480
2014-09-24 17:33:51 15952 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: f306c24e-4410-11e4-a863-bc764e08d443.
2014-09-24 17:33:51 15952 [Note] RSA private key file not found: /var/lib/mysql//private_key.pem. Some authentication plugins will not work.
2014-09-24 17:33:51 15952 [Note] RSA public key file not found: /var/lib/mysql//public_key.pem. Some authentication plugins will not work.
2014-09-24 17:33:51 15952 [Note] Server hostname (bind-address): '*'; port: 3306
2014-09-24 17:33:51 15952 [Note] IPv6 is available.
2014-09-24 17:33:51 15952 [Note] - '::' resolves to '::';
2014-09-24 17:33:51 15952 [Note] Server socket created on IP: '::'.
2014-09-24 17:33:51 15952 [Warning] 'proxies_priv' entry '@ root@server04a' ignored in --skip-name-resolve mode.
2014-09-24 17:33:51 15952 [Note] Event Scheduler: Loaded 0 events
2014-09-24 17:33:51 15952 [Note] WSREP: Signalling provider to continue.
2014-09-24 17:33:51 15952 [Note] WSREP: inited wsrep sidno 1
2014-09-24 17:33:51 15952 [Note] WSREP: SST received: 01ec4764-3fea-11e4-887f-da10e62749e4:102137
2014-09-24 17:33:51 15952 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.20-68.0-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release 25.7, wsrep_25.7.r4126
2014-09-24 17:33:51 15952 [Note] WSREP: 1.0 (server01): State transfer from 0.0 (server03) complete.
2014-09-24 17:33:51 15952 [Note] WSREP: Shifting JOINER -> JOINED (TO: 102139)
2014-09-24 17:33:51 15952 [Note] WSREP: Member 1.0 (server01) synced with group.
2014-09-24 17:33:51 15952 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 102139)

mysql> show +----------------------- | Variable_name +----------------------- | wsrep_local_state_uuid | wsrep_protocol_version | wsrep_last_committed | wsrep_replicated | wsrep_replicated_bytes | wsrep_repl_keys | wsrep_repl_keys_bytes | wsrep_repl_data_bytes | wsrep_repl_other_bytes | wsrep_received | wsrep_received_bytes | wsrep_local_commits | wsrep_local_cert_failures | wsrep_local_replays | wsrep_local_send_queue | wsrep_local_send_queue_max | 2 | wsrep_local_send_queue_min | 0 | wsrep_local_send_queue_avg | wsrep_local_recv_queue | wsrep_local_recv_queue_max | 1 | wsrep_local_recv_queue_min | 0 | wsrep_local_recv_queue_avg | wsrep_local_cached_downto | wsrep_flow_control_paused_ns | 0 | wsrep_flow_control_paused | wsrep_flow_control_sent | wsrep_flow_control_recv | wsrep_cert_deps_distance | wsrep_apply_oooe | wsrep_apply_oool | wsrep_apply_window | wsrep_commit_oooe | wsrep_commit_oool | wsrep_commit_window | wsrep_local_state | wsrep_local_state_comment | wsrep_cert_index_size | wsrep_causal_reads | wsrep_cert_interval | wsrep_incoming_addresses | wsrep_evs_repl_latency | wsrep_cluster_conf_id | wsrep_cluster_size | wsrep_cluster_state_uuid | wsrep_cluster_status | wsrep_connected | wsrep_local_bf_aborts | wsrep_local_index | wsrep_provider_name | wsrep_provider_vendor | wsrep_provider_version | wsrep_ready +----------------------- 52 rows in set (0.00 sec) global status like '%wsrep%';
-------+---------------------------------------------------------+
| Value |
-------+---------------------------------------------------------+
| 01ec4764-3fea-11e4-887f-da10e62749e4 |
| 6 |
| 102137 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 3 |
| 298 |
| 0 |
| 0 |
| 0 |
| 0 |
|
|
| 0.333333 |
| 0 |
|
|
| 0.000000 |
| 18446744073709551615 |
|
| 0.000000 |
| 0 |
| 0 |
| 0.000000 |
| 0.000000 |
| 0.000000 |
| 0.000000 |
| 0.000000 |
| 0.000000 |
| 0.000000 |
| 1 |
| Joining |
| 0 |
| 0 |
| 0.000000 |
| 10.177.192.227:3306,10.177.194.12:3306,10.179.2.66:3306 |
| 0/0/0/0/0 |
| 160 |
| 3 |
| 01ec4764-3fea-11e4-887f-da10e62749e4 |
| Primary |
| ON |
| 0 |
| 1 |
| Galera |
| Codership Oy [email protected] |
| 3.7(r7f44a18) |
| OFF |
-------+---------------------------------------------------------+

Running any query on the "joining" node gives "Unknown command", so it is not Synced (and wsrep isn't ready).

But if i have writes on the Cluster (writing to the node that is not the Donor nor the joiner):

140924 17:43:51 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140924 17:43:51 mysqld_safe Skipping wsrep-recover for empty datadir: /var/lib/mysql
140924 17:43:51 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
2014-09-24 17:43:51 0 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
2014-09-24 17:43:51 27304 [Note] WSREP: Read nil XID from storage engines, skipping position init
2014-09-24 17:43:51 27304 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2014-09-24 17:43:51 27304 [Note] WSREP: wsrep_load(): Galera 3.7(r7f44a18) by Codership Oy [email protected] loaded successfully.
2014-09-24 17:43:51 27304 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2014-09-24 17:43:51 27304 [Warning] WSREP: Could not open saved state file for reading: /var/lib/mysql//grastate.dat
2014-09-24 17:43:51 27304 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:43:51 27304 [Note] WSREP: Passing config to GCS: base_host = 10.177.194.12; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.w
2014-09-24 17:43:51 27304 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:43:51 27304 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2014-09-24 17:43:51 27304 [Note] WSREP: wsrep_sst_grab()
2014-09-24 17:43:51 27304 [Note] WSREP: Start replication
2014-09-24 17:43:51 27304 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:43:51 27304 [Note] WSREP: protonet asio version 0
2014-09-24 17:43:51 27304 [Note] WSREP: Using CRC-32C for message checksums.
2014-09-24 17:43:51 27304 [Note] WSREP: backend: asio
2014-09-24 17:43:51 27304 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
2014-09-24 17:43:51 27304 [Note] WSREP: restore pc from disk failed
2014-09-24 17:43:51 27304 [Note] WSREP: GMCast version 0
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2014-09-24 17:43:51 27304 [Note] WSREP: EVS version 0
2014-09-24 17:43:51 27304 [Note] WSREP: PC version 0
2014-09-24 17:43:51 27304 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.177.192.227:4567,10.179.2.66:4567,10.177.194.12:4567'
2014-09-24 17:43:51 27304 [Warning] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' points to own listening address, blacklisting
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping
2014-09-24 17:43:51 27304 [Note] WSREP: declaring 5fea8f01 at tcp://10.177.192.227:4567 stable
2014-09-24 17:43:51 27304 [Note] WSREP: declaring ce5494a7 at tcp://10.179.2.66:4567 stable
2014-09-24 17:43:51 27304 [Note] WSREP: Node 5fea8f01 state prim
2014-09-24 17:43:51 27304 [Note] WSREP: save pc into disk
2014-09-24 17:43:52 27304 [Note] WSREP: gcomm: connected
2014-09-24 17:43:52 27304 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2014-09-24 17:43:52 27304 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2014-09-24 17:43:52 27304 [Note] WSREP: Opened channel 'my_wsrep_cluster'
2014-09-24 17:43:52 27304 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2014-09-24 17:43:52 27304 [Note] WSREP: Waiting for SST to complete.
2014-09-24 17:43:52 27304 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 58f731ba-4412-11e4-970e-0ff58de80f93
2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: sent state msg: 58f731ba-4412-11e4-970e-0ff58de80f93
2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: got state msg: 58f731ba-4412-11e4-970e-0ff58de80f93 from 0 (server01)
2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: got state msg: 58f731ba-4412-11e4-970e-0ff58de80f93 from 1 (server03)
2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: got state msg: 58f731ba-4412-11e4-970e-0ff58de80f93 from 2 (server06)
2014-09-24 17:43:52 27304 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 161,
members = 2/3 (joined/total),
act_id = 104692,
last_appl. = -1,
protocols = 0/6/3 (gcs/repl/appl),
group UUID = 01ec4764-3fea-11e4-887f-da10e62749e4
2014-09-24 17:43:52 27304 [Note] WSREP: Flow-control interval: [28, 28]
2014-09-24 17:43:52 27304 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 104692)
2014-09-24 17:43:52 27304 [Note] WSREP: State transfer required:
Group state: 01ec4764-3fea-11e4-887f-da10e62749e4:104692
Local state: 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:43:52 27304 [Note] WSREP: New cluster view: global state: 01ec4764-3fea-11e4-887f-da10e62749e4:104692, view# 162: Primary, number of nodes: 3, my index: 0, protocol version 3
2014-09-24 17:43:52 27304 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-09-24 17:43:52 27304 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.177.194.12' --auth 'root:password' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '27304' '' '
WSREP_SST: [INFO] Streaming with xbstream (20140924 17:43:52.285)
WSREP_SST: [INFO] Using socat as streamer (20140924 17:43:52.288)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:43:52.648)
2014-09-24 17:43:52 27304 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.177.194.12:4444/xtrabackup_sst
2014-09-24 17:43:52 27304 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-09-24 17:43:52 27304 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-09-24 17:43:52 27304 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:43:52 27304 [Note] WSREP: Assign initial position for certification: 104692, protocol version: 3
2014-09-24 17:43:52 27304 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:43:52 27304 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (01ec4764-3fea-11e4-887f-da10e62749e4): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():455. IST will be unavailable.
2014-09-24 17:43:52 27304 [Note] WSREP: Member 0.0 (server01) requested state transfer from 'any'. Selected 1.0 (server03)(SYNCED) as donor.
2014-09-24 17:43:52 27304 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 104781)
2014-09-24 17:43:52 27304 [Note] WSREP: Requesting state transfer: success, donor: 1
WSREP_SST: [INFO] Proceeding with SST (20140924 17:43:53.812)
WSREP_SST: [INFO] Cleaning the existing datadir (20140924 17:43:53.818)
removed ‘/var/lib/mysql/gvwstate.dat’
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:43:53.844)
2014-09-24 17:43:55 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') turning message relay requesting off
2014-09-24 17:45:43 27304 [Note] WSREP: 1.0 (server03): State transfer to 0.0 (server01) complete.
2014-09-24 17:45:43 27304 [Note] WSREP: Member 1.0 (server03) synced with group.
WSREP_SST: [INFO] Preparing the backup at /var/lib/mysql/ (20140924 17:45:43.511)
WSREP_SST: [INFO] Evaluating innobackupex --no-version-check --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20140924 17:45:43.520)
WSREP_SST: [INFO] Total time on joiner: 0 seconds (20140924 17:45:55.201)
WSREP_SST: [INFO] Removing the sst_in_progress file (20140924 17:45:55.209)
2014-09-24 17:45:55 27304 [Note] WSREP: SST complete, seqno: 113877
2014-09-24 17:45:55 27304 [Note] Plugin 'FEDERATED' is disabled.
2014-09-24 17:45:55 27304 [Note] InnoDB: Using atomics to ref count buffer pool pages
2014-09-24 17:45:55 27304 [Note] InnoDB: The InnoDB memory heap is disabled
2014-09-24 17:45:55 27304 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2014-09-24 17:45:55 27304 [Note] InnoDB: Memory barrier is not used
2014-09-24 17:45:55 27304 [Note] InnoDB: Compressed tables use zlib 1.2.3.4
2014-09-24 17:45:55 27304 [Note] InnoDB: Using Linux native AIO
2014-09-24 17:45:55 27304 [Note] InnoDB: Not using CPU crc32 instructions
2014-09-24 17:45:55 27304 [Note] InnoDB: Initializing buffer pool, size = 908.0M
2014-09-24 17:45:55 27304 [Note] InnoDB: Completed initialization of buffer pool
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29744 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29746 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29752 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29745 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29753 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29751 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29750 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29748 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29747 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29749 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Note] InnoDB: Highest supported file format is Barracuda.
2014-09-24 17:45:55 27304 [Note] InnoDB: 128 rollback segment(s) are active.
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29759 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Note] InnoDB: Waiting for purge to start
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29760 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29761 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29762 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.20-68.0 started; log sequence number 784149207
2014-09-24 17:45:55 27304 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: a2d4c7d5-4412-11e4-a92d-bc764e08d443.
2014-09-24 17:45:55 27304 [Note] RSA private key file not found: /var/lib/mysql//private_key.pem. Some authentication plugins will not work.
2014-09-24 17:45:55 27304 [Note] RSA public key file not found: /var/lib/mysql//public_key.pem. Some authentication plugins will not work.
2014-09-24 17:45:55 27304 [Note] Server hostname (bind-address): '*'; port: 3306
2014-09-24 17:45:55 27304 [Note] IPv6 is available.
2014-09-24 17:45:55 27304 [Note] - '::' resolves to '::';
2014-09-24 17:45:55 27304 [Note] Server socket created on IP: '::'.
2014-09-24 17:45:55 27304 [Warning] 'proxies_priv' entry '@ root@server04a' ignored in --skip-name-resolve mode.
2014-09-24 17:45:56 27304 [Note] Event Scheduler: Loaded 0 events
2014-09-24 17:45:56 27304 [Note] WSREP: Signalling provider to continue.
2014-09-24 17:45:56 27304 [Note] WSREP: inited wsrep sidno 1
2014-09-24 17:45:56 27304 [Note] WSREP: SST received: 01ec4764-3fea-11e4-887f-da10e62749e4:113877
2014-09-24 17:45:56 27304 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.20-68.0-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release 25.7, wsrep_25.7.r4126
2014-09-24 17:45:56 27304 [Note] WSREP: 0.0 (server01): State transfer from 1.0 (server03) complete.
2014-09-24 17:45:56 27304 [Note] WSREP: Shifting JOINER -> JOINED (TO: 115797)
2014-09-24 17:45:57 27304 [Note] WSREP: Member 0.0 (server01) synced with group.
2014-09-24 17:45:57 27304 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 115972)
2014-09-24 17:45:57 27304 [Note] WSREP: Synchronized with group, ready for connections
2014-09-24 17:45:57 27304 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

mysql> show global status like '%wsrep%';
+------------------------------+---------------------------------------------------------+
| Variable_name | Value |
+------------------------------+---------------------------------------------------------+
| wsrep_local_state_uuid | 01ec4764-3fea-11e4-887f-da10e62749e4 |
| wsrep_protocol_version | 6 |
| wsrep_last_committed | 117348 |
| wsrep_replicated | 14 |
| wsrep_replicated_bytes | 3814 |
| wsrep_repl_keys | 14 |
| wsrep_repl_keys_bytes | 434 |
| wsrep_repl_data_bytes | 2484 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 12674 |
| wsrep_received_bytes | 16708102 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_max | 2 |
| wsrep_local_send_queue_min | 0 |
| wsrep_local_send_queue_avg | 0.008696 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_max | 11118 |
| wsrep_local_recv_queue_min | 0 |
| wsrep_local_recv_queue_avg | 4917.729604 |
| wsrep_local_cached_downto | 104693 |
| wsrep_flow_control_paused_ns | 541185 |
| wsrep_flow_control_paused | 0.000004 |
| wsrep_flow_control_sent | 1 |
| wsrep_flow_control_recv | 1 |
| wsrep_cert_deps_distance | 1301.121760 |
| wsrep_apply_oooe | 0.651109 |
| wsrep_apply_oool | 0.005186 |
| wsrep_apply_window | 2.845866 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 2.267070 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 143 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.394991 |
| wsrep_incoming_addresses | 10.177.194.12:3306,10.177.192.227:3306,10.179.2.66:3306 |
| wsrep_evs_repl_latency | 0.000593955/0.00157661/0.00837575/0.00108117/116 |
| wsrep_cluster_conf_id | 162 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | 01ec4764-3fea-11e4-887f-da10e62749e4 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy [email protected] |
| wsrep_provider_version | 3.7(r7f44a18) |
| wsrep_ready | ON |
+------------------------------+---------------------------------------------------------+
52 rows in set (0.00 sec)

freddielunchbird · 2014-09-25T11:20:49Z

I think it has to do with xtrabackup 2.2.4 somehow, with xtrabackup 2.2.3 i dont experience any problems.
Filed a bug report here:
https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1373796

dirtysalt · 2014-10-06T04:08:10Z

To my intuition, maybe it's a galera bug.

Although joiner logs says it's in SYNCED state, it's not actually. There are several layers in Galera(GCOMM, GCS, REPLICATOR levels). That message means it's in SYNCED state at GCS level. But it's not in SYNCED state at REPLICATOR level. Because the status shows

| wsrep_local_state_comment | Joining |
| wsrep_ready | OFF |

And normally at the tail of log, there should be

2014-09-24 17:45:57 27304 [Note] WSREP: Synchronized with group, ready for connections
2014-09-24 17:45:57 27304 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

which is the true evidence that it's in SYNCED state at REPLICATOR level. However those messages can not be found in joiner log.

Could you reproduce this bug every time when using xtrabackup 2.2.4 ?

castro1688 · 2014-10-06T18:59:52Z

I'm having a very similar issue with xtrabackup 2.2.5 and 2.2.4. I'm setting the wsrep_sst_method to rsync as a work around. Seems like this bug isn't getting much attention here or in launchpad. Rolling back xtrabackup to 2.2.3 has also corrected the issue.

castro1688 · 2014-10-07T23:14:10Z

I've added this bug to xtrabackup in launchpad: https://bugs.launchpad.net/percona-xtrabackup/+bug/1378138

freddielunchbird · 2014-10-09T06:26:17Z

@castro1688 : Great! It looks like they are starting to work on the bug reports now so let's see what happens.

@dirtysalt : It is reproducable almost every time. If you stop a node, remove the filesystem of that node, start the node (JOINER) then this is very likely to happen.

dbadapt · 2014-10-16T14:38:18Z

More information here: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1373796/comments/10

The problem may be in gcs/src/gcs.cpp: gcs_recv_thread(void *arg)

I believe the wsrep 'synced' callback is not being sent back to the application.

joffrey92 · 2014-11-20T11:33:38Z

Hi,

I had the issue with xtrabackup 2.2.6 debian today.
http://pastebin.com/G3hGbcLx

Thanks,
Joffrey

ronin13 · 2014-11-26T15:26:00Z

@joffrey92 Thanks for the backtrace. If you still have the coredump, can you provide 'thread apply all bt full' from it? (and also install debuginfo packages for mysqld if possible).

ronin13 · 2014-12-24T14:43:24Z

Is #133 related to this issue? ie. are the race conditions responsible the same ones.

mike-sirs · 2015-02-10T20:57:59Z

hello
i have a
xtrabackup version 2.2.8 based on MySQL server 5.6.22 Linux (x86_64) (revision id: )

problem still
2015-02-10 20:25:57 8459 [Note] WSREP: 2.0 (pnode1): State transfer from 0.0 (pnode0) complete.
2015-02-10 20:25:57 8459 [Note] WSREP: Shifting JOINER -> JOINED (TO: 124879)
2015-02-10 20:25:57 8459 [Note] WSREP: Member 2.0 (pnode1) synced with group.
2015-02-10 20:25:57 8459 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 124879)
2015-02-10 20:25:57 8459 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.21-70.1-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release rel70.1, Revision 938, WSREP version 25.8, wsrep_25.8.r4150

show status like 'wsrep%';
wsrep_local_state_comment | Joining

castro1688 · 2015-04-21T19:35:15Z

Well this still exists.. I'm reproducing with xtrabackup 2.2.10

dbadapt · 2015-04-22T20:52:57Z

Hi Castro, This bug is my 'big one that gets away'. I am thinking of naming it. Something like "Ol' Oscar" or Goliath... Seriously, I am still unsure as to whether it's an issue with the Galera engine or the WSREP interface on the mysqld side. I have seen it and been able to re-produce it but it quickly disappears when one of the components is slightly changed.

You can always tell this bug by the missing log entry from the WSREP side after SST:

... [Note] WSREP: Synchronized with group, ready for connections

This final message is sent by the callback to sql/wsrep_mysqld.cc:wsrep_synced_cb from Galera and it doesn't happen sometimes for some unknown reason. Perhaps the thread is being clobbered or there is a conditional hiccup that we are not seeing (I've looked).

My belief is that it is triggered by a very subtle I/O bound race condition between components (either fast or slow) and that changing one component, changes the dynamic ever so slightly that it disappears. I have seen it on VM's and unsuccessfully tried reproduce on bare metal. I have also tried to change VM processor counts and mysqld parameters without success.

castro1688 · 2015-04-22T22:00:25Z

So i've really been digging today.. In addition to missing the "Syncronized with group" message, i've also noticed the full "heading" is missing when you log into mysql.

I'm starting to think the problem could be a result of the donor being a node that was started with bootstrap-pxc and was never restarted. I pointed wsrep_sst_donor to my other two "working" nodes. One started with bootstrap-pxc, and the other with a normal start. I could reproduce the problem only with bootstrap-pxc node. Thoughts?

Broken - Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 34
Server version: 5.6.22-72.0-56-log

Working - Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 34
Server version: 5.6.22-72.0-56-log Percona XtraDB Cluster (GPL), Release rel72.0, Revision 978, WSREP version 25.8, wsrep_25.8.r4150

castro1688 · 2015-04-22T22:13:44Z

David, one last question.. Do you have a work around? I downgraded xtrabackup to 2.2.3 to get by this with PXC 5.6.20. Now i'm trying to upgrade to 5.6.22 and it requires xtrabackup > 2.2.5.

soluwalana · 2015-05-15T21:07:02Z

I'm having a similar issue when attempting to deploy the cluster to SSD, is there any known workaround for this issue?

samos123 · 2015-07-25T13:35:15Z

I'm using 2.1.8-1 of ubuntu 14.04 any workaround?

luksch · 2015-09-21T07:27:57Z

now this bug hit me too. 3 node cluster that was running fine before. Now after a server crash I can't get the node to properly join. the wsrep_local_state_comment stays on "Joined" forever.

ayurchen · 2015-09-21T08:32:41Z

is the cluster idle or running a load?
is comment "Joined" or "Joining"?

luksch · 2015-09-21T08:47:01Z

It is a production node under quite some load (about 50MB write load per minute). Most writes are supposed to go to this node, but now the two remaining and working nodes cover it.

tomashejatko · 2016-01-07T23:35:42Z

Hi, in my setup this is always reproducible - I'm not able to get third member back into the cluster.
I'm running Galera inside Docker container on VM, which is on SSD

My versions are :

innobackupex version 2.3.3 Linux (x86_64) (revision id: 525ca7d)
Server version: 5.6.26-74.0-56 Percona XtraDB Cluster (GPL), Release rel74.0, Revision 624ef81, WSREP version 25.12, wsrep_25.12

ayurchen · 2016-01-23T00:57:23Z

Whoever is experiencing this problem with PXC or MariaDB, could you try with Codership's binaries: http://galeracluster.com/downloads/

MariaDBJones · 2016-03-22T13:54:58Z

i am also experiencing the very same problem at a client's using rsync as an sst method. so it does not seem to be xtrabackup related. i will try and provide you with some logs asap

in my case the cluster is stalled and not responding to write operations despite the node1 stating it is synced. it still serves read.

version are : MariaDB 5.5.41, galera provider version : 3.5 (rXXXX) from RHEL packages on OpenStack.

in the logs it is stated as JOINING -> JOINED as the last line, but the wsrep_local_state is Joining. what we also saw with a netstat -an is this line :
tcp 90002 0 local-ip:4567 ist_node_ip ESTABLISHED

as if the MariaDB node was unable to read anything from the socket.

ethaniel · 2016-03-22T14:54:04Z

I am done with MariaDB and it's cluster problems. Successfully moved to Percona Cluster and haven't regretted it even once. Setup was supereasy. And cluster commits work WAY faster. Same galera engine.

rasschaert · 2016-03-22T15:06:31Z

@ethaniel, the reporter of this very issue was also running Percona XtraDB Cluster, so I'm not sure how you think you're impervious to this bug. You even said yourself it's the same Galera engine, which is where this bug (still) is.

But hey if switching from MariaDB to PXC helps you sleep at night, good for you.

Please post your snark on Twitter instead of in Github issues next time.

refs codership/mysql-wsrep-features#200 - looks fixed!

fraenki · 2021-05-21T18:53:28Z

Ugh, this is a really old bug report. I'm seeing the same issue the original reporter described back in 2014, but on mysqld-wsrep 5.7-5.7.33-25.25. Is anyone else still experiencing this and found a workaround?

The debug log output is interesting:

2021-05-21T19:00:07.370469+01:00 0 [Note] WSREP: SST complete, seqno: 40917804340
...
2021-05-21T19:00:48.163817+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/replicator_str.cpp:sst_received():38: SST received: ac673ca1-97f8-11e9-a0b3-5b765e435c14:40917804340
2021-05-21T19:00:48.163936+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/saved_state.cpp:set():187: Not writing state: unsafe counter is 1
2021-05-21T19:00:48.165466+01:00 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.7.33-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server - (GPL), wsrep_25.25
2021-05-21T19:00:48.165469+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/replicator_str.cpp:request_state_transfer():719: Installed new state: ac673ca1-97f8-11e9-a0b3-5b765e435c14:40917804340
2021-05-21T19:00:48.165516+01:00 3 [Note] WSREP: SYSTEM_THREAD_COMPRESS_GTID_TABLE declared as non wsrep_on
2021-05-21T19:00:48.165541+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/saved_state.cpp:set():187: Not writing state: unsafe counter is 1
2021-05-21T19:00:48.167212+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():801: 0.0 (galera6): State transfer from 1.0 (galera5) complete.
2021-05-21T19:00:48.167247+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_handle_state_change():1051: Got 'JOIN' dated 40917804340
2021-05-21T19:00:48.167260+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_fc_cont_end():524: SKIPPED FC_CONT sending: stop_sent = 0
2021-05-21T19:00:48.167271+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_shift_state():635: Shifting JOINER -> JOINED (TO: 40920891251)
2021-05-21T19:00:48.167281+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_become_joined():806: Become joined, FC offset 13917607
2021-05-21T19:00:48.167291+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_send_sync_end():568: SENDING SYNC
2021-05-21T19:00:48.168191+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs_group.cpp:gcs_group_handle_sync_msg():847: Member 0.0 (galera6) synced with group.
2021-05-21T19:00:48.168225+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_handle_state_change():1051: Got 'SYNC' dated 0
2021-05-21T19:00:48.168239+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_shift_state():635: Shifting JOINED -> SYNCED (TO: 40920891253)
2021-05-21T19:00:48.168252+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_become_synced():826: Become synced, FC offset 13917556
2021-05-21T19:01:09.711228+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/certification.cpp:append_trx():1068: seqno gap, position: 40907017112 trx seqno 40917804341
2021-05-21T19:19:01.143739+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcomm/src/evs_proto.cpp:send_user():1514: adjusted seq range to: 1

It looks like immediately after SST is complete, the WSREP provider shifts state from JOINER to JOINED and then to SYNCED. But apparently the node is NOT synced: wsrep_local_state_comment still says "Joining". And wsrep_local_recv_queue is always increasing and never decreasing.

The message ready for connections is also not true, because mysqld will still deny any query:
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

So the node is stuck in this state forever.

@temeo Could you possibly have a look? Thank you :)

DanielWeeber · 2021-11-24T17:55:14Z

Same problem here with 10.6.5 (galera 4 26.4.9)
Except that wsrep_local_recv_queue stays empty.

fraenki · 2021-11-25T10:44:14Z

So the node is stuck in this state forever.

To follow-up on my own report: I was able to fix this issue. In my case it was triggered by some super aggressive MySQL tuning I've applied to that node. I couldn't nail it down to a specific setting (because this would be too time-consuming), but reverting back to a MySQL config that is closer to the default solved this issue for me.

I'm not saying that there is no bug, but if there is a bug, then it's probably only triggered with non-default configurations. So it's probably worth to try again with a different MySQL/MariaDB config.

tchernomax · 2023-06-07T19:09:02Z

In our test environment, the problem was triggered by wsrep_provider_options="gcache.size=2G" (on hosts with 4GiB of RAM).
Removing the setting (using the default of 128MiB) fixed the problem.

janlindstrom pushed a commit to MariaDB/galera that referenced this issue Oct 13, 2018

Merge pull request codership#143 from codership/wsrep-features-gh200

d4c3157

refs codership/mysql-wsrep-features#200 - looks fixed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log says 'SYNCED' but wsrep_cluster_state = 1 (JOINING) #143

Log says 'SYNCED' but wsrep_cluster_state = 1 (JOINING) #143

freddielunchbird commented Sep 24, 2014

freddielunchbird commented Sep 25, 2014

dirtysalt commented Oct 6, 2014

castro1688 commented Oct 6, 2014

castro1688 commented Oct 7, 2014

freddielunchbird commented Oct 9, 2014

dbadapt commented Oct 16, 2014

joffrey92 commented Nov 20, 2014

ronin13 commented Nov 26, 2014

ronin13 commented Dec 24, 2014

mike-sirs commented Feb 10, 2015

castro1688 commented Apr 21, 2015

dbadapt commented Apr 22, 2015

castro1688 commented Apr 22, 2015

castro1688 commented Apr 22, 2015

soluwalana commented May 15, 2015

samos123 commented Jul 25, 2015

luksch commented Sep 21, 2015

ayurchen commented Sep 21, 2015

luksch commented Sep 21, 2015

tomashejatko commented Jan 7, 2016

ayurchen commented Jan 23, 2016

MariaDBJones commented Mar 22, 2016

ethaniel commented Mar 22, 2016

rasschaert commented Mar 22, 2016

fraenki commented May 21, 2021 •

edited

Loading

DanielWeeber commented Nov 24, 2021 •

edited

Loading

fraenki commented Nov 25, 2021

tchernomax commented Jun 7, 2023

Log says 'SYNCED' but wsrep_cluster_state = 1 (JOINING) #143

Log says 'SYNCED' but wsrep_cluster_state = 1 (JOINING) #143

Comments

freddielunchbird commented Sep 24, 2014

freddielunchbird commented Sep 25, 2014

dirtysalt commented Oct 6, 2014

castro1688 commented Oct 6, 2014

castro1688 commented Oct 7, 2014

freddielunchbird commented Oct 9, 2014

dbadapt commented Oct 16, 2014

joffrey92 commented Nov 20, 2014

ronin13 commented Nov 26, 2014

ronin13 commented Dec 24, 2014

mike-sirs commented Feb 10, 2015

castro1688 commented Apr 21, 2015

dbadapt commented Apr 22, 2015

castro1688 commented Apr 22, 2015

castro1688 commented Apr 22, 2015

soluwalana commented May 15, 2015

samos123 commented Jul 25, 2015

luksch commented Sep 21, 2015

ayurchen commented Sep 21, 2015

luksch commented Sep 21, 2015

tomashejatko commented Jan 7, 2016

ayurchen commented Jan 23, 2016

MariaDBJones commented Mar 22, 2016

ethaniel commented Mar 22, 2016

rasschaert commented Mar 22, 2016

fraenki commented May 21, 2021 • edited Loading

DanielWeeber commented Nov 24, 2021 • edited Loading

fraenki commented Nov 25, 2021

tchernomax commented Jun 7, 2023

fraenki commented May 21, 2021 •

edited

Loading

DanielWeeber commented Nov 24, 2021 •

edited

Loading