Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log says 'SYNCED' but wsrep_cluster_state = 1 (JOINING) #143

Open
freddielunchbird opened this issue Sep 24, 2014 · 28 comments
Open

Log says 'SYNCED' but wsrep_cluster_state = 1 (JOINING) #143

freddielunchbird opened this issue Sep 24, 2014 · 28 comments

Comments

@freddielunchbird
Copy link

Hi,

I have a two node cluster:
Ubuntu 14.04
PXC 5.6.20-68.0 , galera 3.7(r7f44a18)
xtrabackup 2.2.4

I have a problem to add a third node (same software as for the other two nodes). The node gets stuck in 'joining', but the error log reports 'SYNCED'.

Joiner error log:

140924 17:31:51 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140924 17:31:51 mysqld_safe Skipping wsrep-recover for empty datadir: /var/lib/mysql
140924 17:31:51 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
2014-09-24 17:31:51 0 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
2014-09-24 17:31:51 15952 [Note] WSREP: Read nil XID from storage engines, skipping position init
2014-09-24 17:31:51 15952 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2014-09-24 17:31:51 15952 [Note] WSREP: wsrep_load(): Galera 3.7(r7f44a18) by Codership Oy [email protected] loaded successfully.
2014-09-24 17:31:51 15952 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2014-09-24 17:31:51 15952 [Warning] WSREP: Could not open saved state file for reading: /var/lib/mysql//grastate.dat
2014-09-24 17:31:51 15952 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:31:51 15952 [Note] WSREP: Passing config to GCS: base_host = 10.177.194.12; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.w
2014-09-24 17:31:51 15952 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:31:51 15952 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2014-09-24 17:31:51 15952 [Note] WSREP: wsrep_sst_grab()
2014-09-24 17:31:51 15952 [Note] WSREP: Start replication
2014-09-24 17:31:51 15952 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:31:51 15952 [Note] WSREP: protonet asio version 0
2014-09-24 17:31:51 15952 [Note] WSREP: Using CRC-32C for message checksums.
2014-09-24 17:31:51 15952 [Note] WSREP: backend: asio
2014-09-24 17:31:51 15952 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
2014-09-24 17:31:51 15952 [Note] WSREP: restore pc from disk failed
2014-09-24 17:31:51 15952 [Note] WSREP: GMCast version 0
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2014-09-24 17:31:51 15952 [Note] WSREP: EVS version 0
2014-09-24 17:31:51 15952 [Note] WSREP: PC version 0
2014-09-24 17:31:51 15952 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.177.192.227:4567,10.179.2.66:4567,10.177.194.12:4567'
2014-09-24 17:31:51 15952 [Warning] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' points to own listening address, blacklisting
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping
2014-09-24 17:31:51 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid ab9c8371 is blacklisted, skipping
2014-09-24 17:31:52 15952 [Note] WSREP: declaring 5fea8f01 at tcp://10.177.192.227:4567 stable
2014-09-24 17:31:52 15952 [Note] WSREP: declaring ce5494a7 at tcp://10.179.2.66:4567 stable
2014-09-24 17:31:52 15952 [Note] WSREP: Node 5fea8f01 state prim
2014-09-24 17:31:52 15952 [Note] WSREP: save pc into disk
2014-09-24 17:31:52 15952 [Note] WSREP: gcomm: connected
2014-09-24 17:31:52 15952 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2014-09-24 17:31:52 15952 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2014-09-24 17:31:52 15952 [Note] WSREP: Opened channel 'my_wsrep_cluster'
2014-09-24 17:31:52 15952 [Note] WSREP: Waiting for SST to complete.
2014-09-24 17:31:52 15952 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: sent state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6
2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: got state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6 from 0 (server03)
2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: got state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6 from 2 (server06)
2014-09-24 17:31:52 15952 [Note] WSREP: STATE EXCHANGE: got state msg: b0a379a9-4410-11e4-be8e-76386af2d3c6 from 1 (server01)
2014-09-24 17:31:52 15952 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 159,
members = 2/3 (joined/total),
act_id = 102139,
last_appl. = -1,
protocols = 0/6/3 (gcs/repl/appl),
group UUID = 01ec4764-3fea-11e4-887f-da10e62749e4
2014-09-24 17:31:52 15952 [Note] WSREP: Flow-control interval: [28, 28]
2014-09-24 17:31:52 15952 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 102139)
2014-09-24 17:31:52 15952 [Note] WSREP: State transfer required:
Group state: 01ec4764-3fea-11e4-887f-da10e62749e4:102139
Local state: 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:31:52 15952 [Note] WSREP: New cluster view: global state: 01ec4764-3fea-11e4-887f-da10e62749e4:102139, view# 160: Primary, number of nodes: 3, my index: 1, protocol version 3
2014-09-24 17:31:52 15952 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-09-24 17:31:52 15952 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.177.194.12' --auth 'root:password' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '15952' '' '
WSREP_SST: [INFO] Streaming with xbstream (20140924 17:31:52.963)
WSREP_SST: [INFO] Using socat as streamer (20140924 17:31:52.965)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:31:53.301)
2014-09-24 17:31:53 15952 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.177.194.12:4444/xtrabackup_sst
2014-09-24 17:31:53 15952 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-09-24 17:31:53 15952 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-09-24 17:31:53 15952 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:31:53 15952 [Note] WSREP: Assign initial position for certification: 102139, protocol version: 3
2014-09-24 17:31:53 15952 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:31:53 15952 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (01ec4764-3fea-11e4-887f-da10e62749e4): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():455. IST will be unavailable.
2014-09-24 17:31:53 15952 [Note] WSREP: Member 1.0 (server01) requested state transfer from 'any'. Selected 0.0 (server03)(SYNCED) as donor.
2014-09-24 17:31:53 15952 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 102139)
2014-09-24 17:31:53 15952 [Note] WSREP: Requesting state transfer: success, donor: 0
WSREP_SST: [INFO] Proceeding with SST (20140924 17:31:54.298)
WSREP_SST: [INFO] Cleaning the existing datadir (20140924 17:31:54.302)
removed ‘/var/lib/mysql/gvwstate.dat’
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:31:54.319)
2014-09-24 17:31:55 15952 [Note] WSREP: (ab9c8371, 'tcp://0.0.0.0:4567') turning message relay requesting off
2014-09-24 17:33:42 15952 [Note] WSREP: 0.0 (server03): State transfer to 1.0 (server01) complete.
2014-09-24 17:33:42 15952 [Note] WSREP: Member 0.0 (server03) synced with group.
WSREP_SST: [INFO] Preparing the backup at /var/lib/mysql/ (20140924 17:33:42.257)
WSREP_SST: [INFO] Evaluating innobackupex --no-version-check --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20140924 17:33:42.264)
WSREP_SST: [INFO] Total time on joiner: 0 seconds (20140924 17:33:50.769)
WSREP_SST: [INFO] Removing the sst_in_progress file (20140924 17:33:50.775)
2014-09-24 17:33:50 15952 [Note] WSREP: SST complete, seqno: 102137
2014-09-24 17:33:50 15952 [Note] Plugin 'FEDERATED' is disabled.
2014-09-24 17:33:50 15952 [Note] InnoDB: Using atomics to ref count buffer pool pages
2014-09-24 17:33:50 15952 [Note] InnoDB: The InnoDB memory heap is disabled
2014-09-24 17:33:50 15952 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2014-09-24 17:33:50 15952 [Note] InnoDB: Memory barrier is not used
2014-09-24 17:33:50 15952 [Note] InnoDB: Compressed tables use zlib 1.2.3.4
2014-09-24 17:33:50 15952 [Note] InnoDB: Using Linux native AIO
2014-09-24 17:33:50 15952 [Note] InnoDB: Not using CPU crc32 instructions
2014-09-24 17:33:50 15952 [Note] InnoDB: Initializing buffer pool, size = 908.0M
2014-09-24 17:33:50 15952 [Note] InnoDB: Completed initialization of buffer pool
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18311 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18320 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18312 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18319 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18318 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18317 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18315 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18314 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18316 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Warning] InnoDB: Setting thread 18313 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:50 15952 [Note] InnoDB: Highest supported file format is Barracuda.
2014-09-24 17:33:51 15952 [Note] InnoDB: 128 rollback segment(s) are active.
2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18325 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:51 15952 [Note] InnoDB: Waiting for purge to start
2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18326 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18327 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:51 15952 [Warning] InnoDB: Setting thread 18328 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:33:51 15952 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.20-68.0 started; log sequence number 767569480
2014-09-24 17:33:51 15952 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: f306c24e-4410-11e4-a863-bc764e08d443.
2014-09-24 17:33:51 15952 [Note] RSA private key file not found: /var/lib/mysql//private_key.pem. Some authentication plugins will not work.
2014-09-24 17:33:51 15952 [Note] RSA public key file not found: /var/lib/mysql//public_key.pem. Some authentication plugins will not work.
2014-09-24 17:33:51 15952 [Note] Server hostname (bind-address): '*'; port: 3306
2014-09-24 17:33:51 15952 [Note] IPv6 is available.
2014-09-24 17:33:51 15952 [Note] - '::' resolves to '::';
2014-09-24 17:33:51 15952 [Note] Server socket created on IP: '::'.
2014-09-24 17:33:51 15952 [Warning] 'proxies_priv' entry '@ root@server04a' ignored in --skip-name-resolve mode.
2014-09-24 17:33:51 15952 [Note] Event Scheduler: Loaded 0 events
2014-09-24 17:33:51 15952 [Note] WSREP: Signalling provider to continue.
2014-09-24 17:33:51 15952 [Note] WSREP: inited wsrep sidno 1
2014-09-24 17:33:51 15952 [Note] WSREP: SST received: 01ec4764-3fea-11e4-887f-da10e62749e4:102137
2014-09-24 17:33:51 15952 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.20-68.0-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release 25.7, wsrep_25.7.r4126
2014-09-24 17:33:51 15952 [Note] WSREP: 1.0 (server01): State transfer from 0.0 (server03) complete.
2014-09-24 17:33:51 15952 [Note] WSREP: Shifting JOINER -> JOINED (TO: 102139)
2014-09-24 17:33:51 15952 [Note] WSREP: Member 1.0 (server01) synced with group.
2014-09-24 17:33:51 15952 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 102139)

mysql> show global status like '%wsrep%';
+------------------------------+---------------------------------------------------------+
| Variable_name | Value |
+------------------------------+---------------------------------------------------------+
| wsrep_local_state_uuid | 01ec4764-3fea-11e4-887f-da10e62749e4 |
| wsrep_protocol_version | 6 |
| wsrep_last_committed | 102137 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_repl_keys | 0 |
| wsrep_repl_keys_bytes | 0 |
| wsrep_repl_data_bytes | 0 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 3 |
| wsrep_received_bytes | 298 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_max | 2 |
| wsrep_local_send_queue_min | 0 |
| wsrep_local_send_queue_avg | 0.333333 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_max | 1 |
| wsrep_local_recv_queue_min | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_local_cached_downto | 18446744073709551615 |
| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 1 |
| wsrep_local_state_comment | Joining |
| wsrep_cert_index_size | 0 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.000000 |
| wsrep_incoming_addresses | 10.177.192.227:3306,10.177.194.12:3306,10.179.2.66:3306 |
| wsrep_evs_repl_latency | 0/0/0/0/0 |
| wsrep_cluster_conf_id | 160 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | 01ec4764-3fea-11e4-887f-da10e62749e4 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy [email protected] |
| wsrep_provider_version | 3.7(r7f44a18) |
| wsrep_ready | OFF |
+------------------------------+---------------------------------------------------------+
52 rows in set (0.00 sec)

Running any query on the "joining" node gives "Unknown command", so it is not Synced (and wsrep isn't ready).

But if i have writes on the Cluster (writing to the node that is not the Donor nor the joiner):

140924 17:43:51 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140924 17:43:51 mysqld_safe Skipping wsrep-recover for empty datadir: /var/lib/mysql
140924 17:43:51 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
2014-09-24 17:43:51 0 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
2014-09-24 17:43:51 27304 [Note] WSREP: Read nil XID from storage engines, skipping position init
2014-09-24 17:43:51 27304 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2014-09-24 17:43:51 27304 [Note] WSREP: wsrep_load(): Galera 3.7(r7f44a18) by Codership Oy [email protected] loaded successfully.
2014-09-24 17:43:51 27304 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2014-09-24 17:43:51 27304 [Warning] WSREP: Could not open saved state file for reading: /var/lib/mysql//grastate.dat
2014-09-24 17:43:51 27304 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:43:51 27304 [Note] WSREP: Passing config to GCS: base_host = 10.177.194.12; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.w
2014-09-24 17:43:51 27304 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:43:51 27304 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2014-09-24 17:43:51 27304 [Note] WSREP: wsrep_sst_grab()
2014-09-24 17:43:51 27304 [Note] WSREP: Start replication
2014-09-24 17:43:51 27304 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:43:51 27304 [Note] WSREP: protonet asio version 0
2014-09-24 17:43:51 27304 [Note] WSREP: Using CRC-32C for message checksums.
2014-09-24 17:43:51 27304 [Note] WSREP: backend: asio
2014-09-24 17:43:51 27304 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
2014-09-24 17:43:51 27304 [Note] WSREP: restore pc from disk failed
2014-09-24 17:43:51 27304 [Note] WSREP: GMCast version 0
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2014-09-24 17:43:51 27304 [Note] WSREP: EVS version 0
2014-09-24 17:43:51 27304 [Note] WSREP: PC version 0
2014-09-24 17:43:51 27304 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.177.192.227:4567,10.179.2.66:4567,10.177.194.12:4567'
2014-09-24 17:43:51 27304 [Warning] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' points to own listening address, blacklisting
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping
2014-09-24 17:43:51 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') address 'tcp://10.177.194.12:4567' pointing to uuid 58aa13d7 is blacklisted, skipping
2014-09-24 17:43:51 27304 [Note] WSREP: declaring 5fea8f01 at tcp://10.177.192.227:4567 stable
2014-09-24 17:43:51 27304 [Note] WSREP: declaring ce5494a7 at tcp://10.179.2.66:4567 stable
2014-09-24 17:43:51 27304 [Note] WSREP: Node 5fea8f01 state prim
2014-09-24 17:43:51 27304 [Note] WSREP: save pc into disk
2014-09-24 17:43:52 27304 [Note] WSREP: gcomm: connected
2014-09-24 17:43:52 27304 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2014-09-24 17:43:52 27304 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2014-09-24 17:43:52 27304 [Note] WSREP: Opened channel 'my_wsrep_cluster'
2014-09-24 17:43:52 27304 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2014-09-24 17:43:52 27304 [Note] WSREP: Waiting for SST to complete.
2014-09-24 17:43:52 27304 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 58f731ba-4412-11e4-970e-0ff58de80f93
2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: sent state msg: 58f731ba-4412-11e4-970e-0ff58de80f93
2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: got state msg: 58f731ba-4412-11e4-970e-0ff58de80f93 from 0 (server01)
2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: got state msg: 58f731ba-4412-11e4-970e-0ff58de80f93 from 1 (server03)
2014-09-24 17:43:52 27304 [Note] WSREP: STATE EXCHANGE: got state msg: 58f731ba-4412-11e4-970e-0ff58de80f93 from 2 (server06)
2014-09-24 17:43:52 27304 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 161,
members = 2/3 (joined/total),
act_id = 104692,
last_appl. = -1,
protocols = 0/6/3 (gcs/repl/appl),
group UUID = 01ec4764-3fea-11e4-887f-da10e62749e4
2014-09-24 17:43:52 27304 [Note] WSREP: Flow-control interval: [28, 28]
2014-09-24 17:43:52 27304 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 104692)
2014-09-24 17:43:52 27304 [Note] WSREP: State transfer required:
Group state: 01ec4764-3fea-11e4-887f-da10e62749e4:104692
Local state: 00000000-0000-0000-0000-000000000000:-1
2014-09-24 17:43:52 27304 [Note] WSREP: New cluster view: global state: 01ec4764-3fea-11e4-887f-da10e62749e4:104692, view# 162: Primary, number of nodes: 3, my index: 0, protocol version 3
2014-09-24 17:43:52 27304 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-09-24 17:43:52 27304 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.177.194.12' --auth 'root:password' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '27304' '' '
WSREP_SST: [INFO] Streaming with xbstream (20140924 17:43:52.285)
WSREP_SST: [INFO] Using socat as streamer (20140924 17:43:52.288)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:43:52.648)
2014-09-24 17:43:52 27304 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.177.194.12:4444/xtrabackup_sst
2014-09-24 17:43:52 27304 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-09-24 17:43:52 27304 [Note] WSREP: REPL Protocols: 6 (3, 2)
2014-09-24 17:43:52 27304 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:43:52 27304 [Note] WSREP: Assign initial position for certification: 104692, protocol version: 3
2014-09-24 17:43:52 27304 [Note] WSREP: Service thread queue flushed.
2014-09-24 17:43:52 27304 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (01ec4764-3fea-11e4-887f-da10e62749e4): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():455. IST will be unavailable.
2014-09-24 17:43:52 27304 [Note] WSREP: Member 0.0 (server01) requested state transfer from 'any'. Selected 1.0 (server03)(SYNCED) as donor.
2014-09-24 17:43:52 27304 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 104781)
2014-09-24 17:43:52 27304 [Note] WSREP: Requesting state transfer: success, donor: 1
WSREP_SST: [INFO] Proceeding with SST (20140924 17:43:53.812)
WSREP_SST: [INFO] Cleaning the existing datadir (20140924 17:43:53.818)
removed ‘/var/lib/mysql/gvwstate.dat’
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140924 17:43:53.844)
2014-09-24 17:43:55 27304 [Note] WSREP: (58aa13d7, 'tcp://0.0.0.0:4567') turning message relay requesting off
2014-09-24 17:45:43 27304 [Note] WSREP: 1.0 (server03): State transfer to 0.0 (server01) complete.
2014-09-24 17:45:43 27304 [Note] WSREP: Member 1.0 (server03) synced with group.
WSREP_SST: [INFO] Preparing the backup at /var/lib/mysql/ (20140924 17:45:43.511)
WSREP_SST: [INFO] Evaluating innobackupex --no-version-check --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20140924 17:45:43.520)
WSREP_SST: [INFO] Total time on joiner: 0 seconds (20140924 17:45:55.201)
WSREP_SST: [INFO] Removing the sst_in_progress file (20140924 17:45:55.209)
2014-09-24 17:45:55 27304 [Note] WSREP: SST complete, seqno: 113877
2014-09-24 17:45:55 27304 [Note] Plugin 'FEDERATED' is disabled.
2014-09-24 17:45:55 27304 [Note] InnoDB: Using atomics to ref count buffer pool pages
2014-09-24 17:45:55 27304 [Note] InnoDB: The InnoDB memory heap is disabled
2014-09-24 17:45:55 27304 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2014-09-24 17:45:55 27304 [Note] InnoDB: Memory barrier is not used
2014-09-24 17:45:55 27304 [Note] InnoDB: Compressed tables use zlib 1.2.3.4
2014-09-24 17:45:55 27304 [Note] InnoDB: Using Linux native AIO
2014-09-24 17:45:55 27304 [Note] InnoDB: Not using CPU crc32 instructions
2014-09-24 17:45:55 27304 [Note] InnoDB: Initializing buffer pool, size = 908.0M
2014-09-24 17:45:55 27304 [Note] InnoDB: Completed initialization of buffer pool
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29744 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29746 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29752 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29745 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29753 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29751 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29750 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29748 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29747 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29749 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Note] InnoDB: Highest supported file format is Barracuda.
2014-09-24 17:45:55 27304 [Note] InnoDB: 128 rollback segment(s) are active.
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29759 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Note] InnoDB: Waiting for purge to start
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29760 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29761 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Warning] InnoDB: Setting thread 29762 nice to 0 failed, current nice 10, errno 13
2014-09-24 17:45:55 27304 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.20-68.0 started; log sequence number 784149207
2014-09-24 17:45:55 27304 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: a2d4c7d5-4412-11e4-a92d-bc764e08d443.
2014-09-24 17:45:55 27304 [Note] RSA private key file not found: /var/lib/mysql//private_key.pem. Some authentication plugins will not work.
2014-09-24 17:45:55 27304 [Note] RSA public key file not found: /var/lib/mysql//public_key.pem. Some authentication plugins will not work.
2014-09-24 17:45:55 27304 [Note] Server hostname (bind-address): '*'; port: 3306
2014-09-24 17:45:55 27304 [Note] IPv6 is available.
2014-09-24 17:45:55 27304 [Note] - '::' resolves to '::';
2014-09-24 17:45:55 27304 [Note] Server socket created on IP: '::'.
2014-09-24 17:45:55 27304 [Warning] 'proxies_priv' entry '@ root@server04a' ignored in --skip-name-resolve mode.
2014-09-24 17:45:56 27304 [Note] Event Scheduler: Loaded 0 events
2014-09-24 17:45:56 27304 [Note] WSREP: Signalling provider to continue.
2014-09-24 17:45:56 27304 [Note] WSREP: inited wsrep sidno 1
2014-09-24 17:45:56 27304 [Note] WSREP: SST received: 01ec4764-3fea-11e4-887f-da10e62749e4:113877
2014-09-24 17:45:56 27304 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.20-68.0-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release 25.7, wsrep_25.7.r4126
2014-09-24 17:45:56 27304 [Note] WSREP: 0.0 (server01): State transfer from 1.0 (server03) complete.
2014-09-24 17:45:56 27304 [Note] WSREP: Shifting JOINER -> JOINED (TO: 115797)
2014-09-24 17:45:57 27304 [Note] WSREP: Member 0.0 (server01) synced with group.
2014-09-24 17:45:57 27304 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 115972)
2014-09-24 17:45:57 27304 [Note] WSREP: Synchronized with group, ready for connections
2014-09-24 17:45:57 27304 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

mysql> show global status like '%wsrep%';
+------------------------------+---------------------------------------------------------+
| Variable_name | Value |
+------------------------------+---------------------------------------------------------+
| wsrep_local_state_uuid | 01ec4764-3fea-11e4-887f-da10e62749e4 |
| wsrep_protocol_version | 6 |
| wsrep_last_committed | 117348 |
| wsrep_replicated | 14 |
| wsrep_replicated_bytes | 3814 |
| wsrep_repl_keys | 14 |
| wsrep_repl_keys_bytes | 434 |
| wsrep_repl_data_bytes | 2484 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 12674 |
| wsrep_received_bytes | 16708102 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_max | 2 |
| wsrep_local_send_queue_min | 0 |
| wsrep_local_send_queue_avg | 0.008696 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_max | 11118 |
| wsrep_local_recv_queue_min | 0 |
| wsrep_local_recv_queue_avg | 4917.729604 |
| wsrep_local_cached_downto | 104693 |
| wsrep_flow_control_paused_ns | 541185 |
| wsrep_flow_control_paused | 0.000004 |
| wsrep_flow_control_sent | 1 |
| wsrep_flow_control_recv | 1 |
| wsrep_cert_deps_distance | 1301.121760 |
| wsrep_apply_oooe | 0.651109 |
| wsrep_apply_oool | 0.005186 |
| wsrep_apply_window | 2.845866 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 2.267070 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 143 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.394991 |
| wsrep_incoming_addresses | 10.177.194.12:3306,10.177.192.227:3306,10.179.2.66:3306 |
| wsrep_evs_repl_latency | 0.000593955/0.00157661/0.00837575/0.00108117/116 |
| wsrep_cluster_conf_id | 162 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | 01ec4764-3fea-11e4-887f-da10e62749e4 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy [email protected] |
| wsrep_provider_version | 3.7(r7f44a18) |
| wsrep_ready | ON |
+------------------------------+---------------------------------------------------------+
52 rows in set (0.00 sec)

@freddielunchbird
Copy link
Author

I think it has to do with xtrabackup 2.2.4 somehow, with xtrabackup 2.2.3 i dont experience any problems.
Filed a bug report here:
https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1373796

@dirtysalt
Copy link
Contributor

To my intuition, maybe it's a galera bug.

Although joiner logs says it's in SYNCED state, it's not actually. There are several layers in Galera(GCOMM, GCS, REPLICATOR levels). That message means it's in SYNCED state at GCS level. But it's not in SYNCED state at REPLICATOR level. Because the status shows

| wsrep_local_state_comment | Joining |
| wsrep_ready | OFF |

And normally at the tail of log, there should be

2014-09-24 17:45:57 27304 [Note] WSREP: Synchronized with group, ready for connections
2014-09-24 17:45:57 27304 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

which is the true evidence that it's in SYNCED state at REPLICATOR level. However those messages can not be found in joiner log.

Could you reproduce this bug every time when using xtrabackup 2.2.4 ?

@castro1688
Copy link

I'm having a very similar issue with xtrabackup 2.2.5 and 2.2.4. I'm setting the wsrep_sst_method to rsync as a work around. Seems like this bug isn't getting much attention here or in launchpad. Rolling back xtrabackup to 2.2.3 has also corrected the issue.

@castro1688
Copy link

I've added this bug to xtrabackup in launchpad: https://bugs.launchpad.net/percona-xtrabackup/+bug/1378138

@freddielunchbird
Copy link
Author

@castro1688 : Great! It looks like they are starting to work on the bug reports now so let's see what happens.

@dirtysalt : It is reproducable almost every time. If you stop a node, remove the filesystem of that node, start the node (JOINER) then this is very likely to happen.

@dbadapt
Copy link

dbadapt commented Oct 16, 2014

More information here: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1373796/comments/10

The problem may be in gcs/src/gcs.cpp: gcs_recv_thread(void *arg)

I believe the wsrep 'synced' callback is not being sent back to the application.

@joffrey92
Copy link

Hi,

I had the issue with xtrabackup 2.2.6 debian today.
http://pastebin.com/G3hGbcLx

Thanks,
Joffrey

@ronin13
Copy link

ronin13 commented Nov 26, 2014

@joffrey92 Thanks for the backtrace. If you still have the coredump, can you provide 'thread apply all bt full' from it? (and also install debuginfo packages for mysqld if possible).

@ronin13
Copy link

ronin13 commented Dec 24, 2014

Is #133 related to this issue? ie. are the race conditions responsible the same ones.

@mike-sirs
Copy link

hello
i have a
xtrabackup version 2.2.8 based on MySQL server 5.6.22 Linux (x86_64) (revision id: )

problem still
2015-02-10 20:25:57 8459 [Note] WSREP: 2.0 (pnode1): State transfer from 0.0 (pnode0) complete.
2015-02-10 20:25:57 8459 [Note] WSREP: Shifting JOINER -> JOINED (TO: 124879)
2015-02-10 20:25:57 8459 [Note] WSREP: Member 2.0 (pnode1) synced with group.
2015-02-10 20:25:57 8459 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 124879)
2015-02-10 20:25:57 8459 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.21-70.1-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release rel70.1, Revision 938, WSREP version 25.8, wsrep_25.8.r4150

show status like 'wsrep%';
wsrep_local_state_comment | Joining

@castro1688
Copy link

Well this still exists.. I'm reproducing with xtrabackup 2.2.10

@dbadapt
Copy link

dbadapt commented Apr 22, 2015

Hi Castro, This bug is my 'big one that gets away'. I am thinking of naming it. Something like "Ol' Oscar" or Goliath... Seriously, I am still unsure as to whether it's an issue with the Galera engine or the WSREP interface on the mysqld side. I have seen it and been able to re-produce it but it quickly disappears when one of the components is slightly changed.

You can always tell this bug by the missing log entry from the WSREP side after SST:

... [Note] WSREP: Synchronized with group, ready for connections

This final message is sent by the callback to sql/wsrep_mysqld.cc:wsrep_synced_cb from Galera and it doesn't happen sometimes for some unknown reason. Perhaps the thread is being clobbered or there is a conditional hiccup that we are not seeing (I've looked).

My belief is that it is triggered by a very subtle I/O bound race condition between components (either fast or slow) and that changing one component, changes the dynamic ever so slightly that it disappears. I have seen it on VM's and unsuccessfully tried reproduce on bare metal. I have also tried to change VM processor counts and mysqld parameters without success.

@castro1688
Copy link

So i've really been digging today.. In addition to missing the "Syncronized with group" message, i've also noticed the full "heading" is missing when you log into mysql.

I'm starting to think the problem could be a result of the donor being a node that was started with bootstrap-pxc and was never restarted. I pointed wsrep_sst_donor to my other two "working" nodes. One started with bootstrap-pxc, and the other with a normal start. I could reproduce the problem only with bootstrap-pxc node. Thoughts?

Broken - Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 34
Server version: 5.6.22-72.0-56-log

Working - Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 34
Server version: 5.6.22-72.0-56-log Percona XtraDB Cluster (GPL), Release rel72.0, Revision 978, WSREP version 25.8, wsrep_25.8.r4150

@castro1688
Copy link

David, one last question.. Do you have a work around? I downgraded xtrabackup to 2.2.3 to get by this with PXC 5.6.20. Now i'm trying to upgrade to 5.6.22 and it requires xtrabackup > 2.2.5.

@soluwalana
Copy link

I'm having a similar issue when attempting to deploy the cluster to SSD, is there any known workaround for this issue?

@samos123
Copy link

I'm using 2.1.8-1 of ubuntu 14.04 any workaround?

@luksch
Copy link

luksch commented Sep 21, 2015

now this bug hit me too. 3 node cluster that was running fine before. Now after a server crash I can't get the node to properly join. the wsrep_local_state_comment stays on "Joined" forever.

@ayurchen
Copy link
Member

  1. is the cluster idle or running a load?
  2. is comment "Joined" or "Joining"?

@luksch
Copy link

luksch commented Sep 21, 2015

It is a production node under quite some load (about 50MB write load per minute). Most writes are supposed to go to this node, but now the two remaining and working nodes cover it.

@tomashejatko
Copy link

Hi, in my setup this is always reproducible - I'm not able to get third member back into the cluster.
I'm running Galera inside Docker container on VM, which is on SSD

My versions are :

innobackupex version 2.3.3 Linux (x86_64) (revision id: 525ca7d)
Server version: 5.6.26-74.0-56 Percona XtraDB Cluster (GPL), Release rel74.0, Revision 624ef81, WSREP version 25.12, wsrep_25.12

@ayurchen
Copy link
Member

Whoever is experiencing this problem with PXC or MariaDB, could you try with Codership's binaries: http://galeracluster.com/downloads/

@MariaDBJones
Copy link

i am also experiencing the very same problem at a client's using rsync as an sst method. so it does not seem to be xtrabackup related. i will try and provide you with some logs asap

in my case the cluster is stalled and not responding to write operations despite the node1 stating it is synced. it still serves read.

version are : MariaDB 5.5.41, galera provider version : 3.5 (rXXXX) from RHEL packages on OpenStack.

in the logs it is stated as JOINING -> JOINED as the last line, but the wsrep_local_state is Joining. what we also saw with a netstat -an is this line :
tcp 90002 0 local-ip:4567 ist_node_ip ESTABLISHED

as if the MariaDB node was unable to read anything from the socket.

@ethaniel
Copy link

I am done with MariaDB and it's cluster problems. Successfully moved to Percona Cluster and haven't regretted it even once. Setup was supereasy. And cluster commits work WAY faster. Same galera engine.

@rasschaert
Copy link

@ethaniel, the reporter of this very issue was also running Percona XtraDB Cluster, so I'm not sure how you think you're impervious to this bug. You even said yourself it's the same Galera engine, which is where this bug (still) is.

But hey if switching from MariaDB to PXC helps you sleep at night, good for you.

Please post your snark on Twitter instead of in Github issues next time.

janlindstrom pushed a commit to MariaDB/galera that referenced this issue Oct 13, 2018
@fraenki
Copy link

fraenki commented May 21, 2021

Ugh, this is a really old bug report. I'm seeing the same issue the original reporter described back in 2014, but on mysqld-wsrep 5.7-5.7.33-25.25. Is anyone else still experiencing this and found a workaround?

The debug log output is interesting:

2021-05-21T19:00:07.370469+01:00 0 [Note] WSREP: SST complete, seqno: 40917804340
...
2021-05-21T19:00:48.163817+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/replicator_str.cpp:sst_received():38: SST received: ac673ca1-97f8-11e9-a0b3-5b765e435c14:40917804340
2021-05-21T19:00:48.163936+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/saved_state.cpp:set():187: Not writing state: unsafe counter is 1
2021-05-21T19:00:48.165466+01:00 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.7.33-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server - (GPL), wsrep_25.25
2021-05-21T19:00:48.165469+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/replicator_str.cpp:request_state_transfer():719: Installed new state: ac673ca1-97f8-11e9-a0b3-5b765e435c14:40917804340
2021-05-21T19:00:48.165516+01:00 3 [Note] WSREP: SYSTEM_THREAD_COMPRESS_GTID_TABLE declared as non wsrep_on
2021-05-21T19:00:48.165541+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/saved_state.cpp:set():187: Not writing state: unsafe counter is 1
2021-05-21T19:00:48.167212+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():801: 0.0 (galera6): State transfer from 1.0 (galera5) complete.
2021-05-21T19:00:48.167247+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_handle_state_change():1051: Got 'JOIN' dated 40917804340
2021-05-21T19:00:48.167260+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_fc_cont_end():524: SKIPPED FC_CONT sending: stop_sent = 0
2021-05-21T19:00:48.167271+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_shift_state():635: Shifting JOINER -> JOINED (TO: 40920891251)
2021-05-21T19:00:48.167281+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_become_joined():806: Become joined, FC offset 13917607
2021-05-21T19:00:48.167291+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_send_sync_end():568: SENDING SYNC
2021-05-21T19:00:48.168191+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs_group.cpp:gcs_group_handle_sync_msg():847: Member 0.0 (galera6) synced with group.
2021-05-21T19:00:48.168225+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_handle_state_change():1051: Got 'SYNC' dated 0
2021-05-21T19:00:48.168239+01:00 0 [Note] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_shift_state():635: Shifting JOINED -> SYNCED (TO: 40920891253)
2021-05-21T19:00:48.168252+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcs/src/gcs.cpp:gcs_become_synced():826: Become synced, FC offset 13917556
2021-05-21T19:01:09.711228+01:00 1 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/galera/src/certification.cpp:append_trx():1068: seqno gap, position: 40907017112 trx seqno 40917804341
2021-05-21T19:19:01.143739+01:00 0 [Note] [Debug] WSREP: /home/galera/rpmbuild/BUILD/galera-3-25.3.33/gcomm/src/evs_proto.cpp:send_user():1514: adjusted seq range to: 1

It looks like immediately after SST is complete, the WSREP provider shifts state from JOINER to JOINED and then to SYNCED. But apparently the node is NOT synced: wsrep_local_state_comment still says "Joining". And wsrep_local_recv_queue is always increasing and never decreasing.

The message ready for connections is also not true, because mysqld will still deny any query:
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

So the node is stuck in this state forever.

@temeo Could you possibly have a look? Thank you :)

@DanielWeeber
Copy link

DanielWeeber commented Nov 24, 2021

Same problem here with 10.6.5 (galera 4 26.4.9)
Except that wsrep_local_recv_queue stays empty.

@fraenki
Copy link

fraenki commented Nov 25, 2021

So the node is stuck in this state forever.

To follow-up on my own report: I was able to fix this issue. In my case it was triggered by some super aggressive MySQL tuning I've applied to that node. I couldn't nail it down to a specific setting (because this would be too time-consuming), but reverting back to a MySQL config that is closer to the default solved this issue for me.

I'm not saying that there is no bug, but if there is a bug, then it's probably only triggered with non-default configurations. So it's probably worth to try again with a different MySQL/MariaDB config.

@tchernomax
Copy link

In our test environment, the problem was triggered by wsrep_provider_options="gcache.size=2G" (on hosts with 4GiB of RAM).
Removing the setting (using the default of 128MiB) fixed the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests