Description of problem: $ tail -n 80 controller-0/var/log/containers/mysql/mysqld.log 2020-04-15 14:38:03 0 [Note] WSREP: Received NON-PRIMARY. 2020-04-15 14:38:03 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 2, my index: 1, protocol version 3 2020-04-15 14:38:03 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2020-04-15 14:38:03 2189 [Warning] Aborted connection 2189 to db: 'nova' user: 'nova' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 2188 [Warning] Aborted connection 2188 to db: 'nova' user: 'nova' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 2186 [Warning] Aborted connection 2186 to db: 'nova' user: 'nova' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 1868 [Warning] Aborted connection 1868 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 1864 [Warning] WSREP: Send action {(nil), 656, TORDERED} returned -107 (Transport endpoint is not connected) 2020-04-15 14:38:03 0 [Note] WSREP: Closed send monitor. 2020-04-15 14:38:03 0 [Note] WSREP: gcomm: terminating thread 2020-04-15 14:38:03 0 [Note] WSREP: gcomm: joining thread 2020-04-15 14:38:03 0 [Note] WSREP: gcomm: closing backend 2020-04-15 14:38:03 5784 [Warning] Aborted connection 5784 to db: 'cinder' user: 'cinder' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 1867 [Warning] Aborted connection 1867 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 1849 [Warning] Aborted connection 1849 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 1859 [Warning] Aborted connection 1859 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 1864 [Warning] Aborted connection 1864 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 1846 [Warning] Aborted connection 1846 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 1865 [Warning] Aborted connection 1865 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:03 2187 [Warning] Aborted connection 2187 to db: 'nova' user: 'nova' host: '172.17.1.89' (Got an error reading communication packets) 2020-04-15 14:38:06 0 [Note] WSREP: (cd351c32, 'tcp://172.17.1.89:4567') connection to peer b5f0871c with addr tcp://172.17.1.133:4567 timed out, no messages seen in PT3S 2020-04-15 14:38:07 0 [Note] WSREP: (cd351c32, 'tcp://172.17.1.89:4567') reconnecting to b5f0871c (tcp://172.17.1.133:4567), attempt 0 2020-04-15 14:38:08 0 [Note] WSREP: evs::proto(cd351c32, LEAVING, view_id(REG,b5f0871c,4)) suspecting node: b5f0871c 2020-04-15 14:38:08 0 [Note] WSREP: evs::proto(cd351c32, LEAVING, view_id(REG,b5f0871c,4)) suspected node without join message, declaring inactive 2020-04-15 14:38:08 0 [Note] WSREP: view(view_id(NON_PRIM,b5f0871c,4) memb { cd351c32,0 } joined { } left { } partitioned { b5f0871c,0 c72c8185,0 d4fee301,0 }) 2020-04-15 14:38:08 0 [Note] WSREP: view((empty)) 2020-04-15 14:38:08 0 [Note] WSREP: gcomm: closed 2020-04-15 14:38:08 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2020-04-15 14:38:08 0 [Note] WSREP: Flow-control interval: [16, 16] 2020-04-15 14:38:08 0 [Note] WSREP: Trying to continue unpaused monitor 2020-04-15 14:38:08 0 [Note] WSREP: Received NON-PRIMARY. 2020-04-15 14:38:08 0 [Note] WSREP: Received self-leave message. 2020-04-15 14:38:08 0 [Note] WSREP: Flow-control interval: [0, 0] 2020-04-15 14:38:08 0 [Note] WSREP: Trying to continue unpaused monitor 2020-04-15 14:38:08 0 [Note] WSREP: Received SELF-LEAVE. Closing connection. 2020-04-15 14:38:08 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 15749) 2020-04-15 14:38:08 0 [Note] WSREP: RECV thread exiting 0: Success 2020-04-15 14:38:08 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3 2020-04-15 14:38:08 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2020-04-15 14:38:08 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3 2020-04-15 14:38:08 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2020-04-15 14:38:08 0 [Note] WSREP: recv_thread() joined. 2020-04-15 14:38:08 0 [Note] WSREP: Closing replication queue. 2020-04-15 14:38:08 0 [Note] WSREP: Closing slave action queue. 2020-04-15 14:38:08 2 [Note] WSREP: applier thread exiting (code:0) 2020-04-15 14:38:10 1 [Note] WSREP: rollbacker thread exiting 2020-04-15 14:38:10 0 [Note] Event Scheduler: Purging the queue. 0 events 2020-04-15 14:38:10 0 [Note] InnoDB: FTS optimize thread exiting. 2020-04-15 14:38:10 0 [Note] WSREP: dtor state: CLOSED 2020-04-15 14:38:10 0 [Note] WSREP: mon: entered 15749 oooe fraction 0 oool fraction 0 2020-04-15 14:38:10 0 [Note] WSREP: mon: entered 15749 oooe fraction 0.0477491 oool fraction 0 2020-04-15 14:38:10 0 [Note] WSREP: mon: entered 16002 oooe fraction 0 oool fraction 6.24922e-05 2020-04-15 14:38:10 0 [Note] WSREP: cert index usage at exit 0 2020-04-15 14:38:10 0 [Note] WSREP: cert trx map usage at exit 36 2020-04-15 14:38:10 0 [Note] WSREP: deps set usage at exit 0 2020-04-15 14:38:10 0 [Note] WSREP: avg deps dist 29.5995 2020-04-15 14:38:10 0 [Note] WSREP: avg cert interval 0.0821639 2020-04-15 14:38:10 0 [Note] WSREP: cert index size 141 2020-04-15 14:38:10 0 [Note] WSREP: Service thread queue flushed. 2020-04-15 14:38:10 0 [Note] WSREP: wsdb trx map usage 0 conn query map usage 0 2020-04-15 14:38:10 0 [Note] WSREP: MemPool(LocalTrxHandle): hit ratio: 0.991443, misses: 135, in use: 0, in pool: 135 2020-04-15 14:38:10 0 [Note] WSREP: MemPool(SlaveTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0 2020-04-15 14:38:10 0 [Note] WSREP: Shifting CLOSED -> DESTROYED (TO: 15749) 2020-04-15 14:38:10 0 [Note] WSREP: Flushing memory map to disk... 2020-04-15 14:38:10 0 [Note] InnoDB: Starting shutdown... 2020-04-15 14:38:10 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool 2020-04-15 14:38:10 0 [Note] InnoDB: Instance 0, restricted to 2048 pages due to innodb_buf_pool_dump_pct=25 2020-04-15 14:38:10 0 [Note] InnoDB: Buffer pool(s) dump completed at 200415 14:38:10 2020-04-15 14:38:11 0 [Note] InnoDB: Shutdown completed; log sequence number 49493499; transaction id 53623 2020-04-15 14:38:11 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 2020-04-15 14:38:11 0 [Note] /usr/libexec/mysqld: Shutdown complete turul@redding /tmp/newwtf $ tail -n 80 controller-1/var/log/containers/mysql/mysqld.log 2020-04-15 14:37:47 0 [Note] WSREP: mon: entered 15720 oooe fraction 0 oool fraction 0 2020-04-15 14:37:47 0 [Note] WSREP: mon: entered 15720 oooe fraction 0 oool fraction 0 2020-04-15 14:37:47 0 [Note] WSREP: mon: entered 15998 oooe fraction 0 oool fraction 6.25078e-05 2020-04-15 14:37:47 0 [Note] WSREP: cert index usage at exit 0 2020-04-15 14:37:47 0 [Note] WSREP: cert trx map usage at exit 36 2020-04-15 14:37:47 0 [Note] WSREP: deps set usage at exit 0 2020-04-15 14:37:47 0 [Note] WSREP: avg deps dist 29.6013 2020-04-15 14:37:47 0 [Note] WSREP: avg cert interval 0.0821692 2020-04-15 14:37:47 0 [Note] WSREP: cert index size 141 2020-04-15 14:37:47 0 [Note] WSREP: Service thread queue flushed. 2020-04-15 14:37:47 0 [Note] WSREP: wsdb trx map usage 0 conn query map usage 0 2020-04-15 14:37:47 0 [Note] WSREP: MemPool(LocalTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0 2020-04-15 14:37:47 0 [Note] WSREP: MemPool(SlaveTrxHandle): hit ratio: 0.991554, misses: 133, in use: 0, in pool: 133 2020-04-15 14:37:47 0 [Note] WSREP: Shifting CLOSED -> DESTROYED (TO: 15749) 2020-04-15 14:37:47 0 [Note] WSREP: Flushing memory map to disk... 2020-04-15 14:37:47 0 [Note] InnoDB: Starting shutdown... 2020-04-15 14:37:47 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool 2020-04-15 14:37:47 0 [Note] InnoDB: Instance 0, restricted to 2048 pages due to innodb_buf_pool_dump_pct=25 2020-04-15 14:37:47 0 [Note] InnoDB: Buffer pool(s) dump completed at 200415 14:37:47 2020-04-15 14:37:49 0 [Note] InnoDB: Shutdown completed; log sequence number 45691765; transaction id 47411 2020-04-15 14:37:49 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 2020-04-15 14:37:49 0 [Note] /usr/libexec/mysqld: Shutdown complete 2020-04-15 14:38:02 0 [Note] WSREP: Read nil XID from storage engines, skipping position init 2020-04-15 14:38:02 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so' 2020-04-15 14:38:02 0 [Note] WSREP: wsrep_load(): Galera 3.26(rXXXX) by Codership Oy <info> loaded successfully. 2020-04-15 14:38:02 0 [Note] WSREP: CRC-32C: using hardware acceleration. 2020-04-15 14:38:02 0 [Note] WSREP: Found saved state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, safe_to_bootstrap: 0 2020-04-15 14:38:02 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = controller-1.internalapi.redhat.local; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://172.17.1.133:45 2020-04-15 14:38:02 0 [Note] WSREP: GCache history reset: c72cce62-7f21-11ea-9a47-fe4688d64721:0 -> c72cce62-7f21-11ea-9a47-fe4688d64721:15749 2020-04-15 14:38:02 0 [Note] WSREP: Assign initial position for certification: 15749, protocol version: -1 2020-04-15 14:38:02 0 [Note] WSREP: wsrep_sst_grab() 2020-04-15 14:38:02 0 [Note] WSREP: Start replication 2020-04-15 14:38:02 0 [Note] WSREP: Setting initial position to c72cce62-7f21-11ea-9a47-fe4688d64721:15749 2020-04-15 14:38:02 0 [Note] WSREP: protonet asio version 0 2020-04-15 14:38:02 0 [Note] WSREP: Using CRC-32C for message checksums. 2020-04-15 14:38:02 0 [Note] WSREP: backend: asio 2020-04-15 14:38:02 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 2020-04-15 14:38:02 0 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory) 2020-04-15 14:38:02 0 [Note] WSREP: restore pc from disk failed 2020-04-15 14:38:02 0 [Note] WSREP: GMCast version 0 2020-04-15 14:38:02 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') listening at tcp://172.17.1.133:4567 2020-04-15 14:38:02 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') multicast: , ttl: 1 2020-04-15 14:38:02 0 [Note] WSREP: EVS version 0 2020-04-15 14:38:02 0 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer 'controller-0.internalapi.redhat.local:,controller-1.internalapi.redhat.local:,controller-2.internalapi.redhat.local:' 2020-04-15 14:38:02 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') connection established to cd351c32 tcp://172.17.1.89:4567 2020-04-15 14:38:02 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') turning message relay requesting on, nonlive peers: 2020-04-15 14:38:03 0 [Note] WSREP: declaring cd351c32 at tcp://172.17.1.89:4567 stable 2020-04-15 14:38:03 0 [Note] WSREP: view(view_id(NON_PRIM,b5f0871c,4) memb { b5f0871c,0 cd351c32,0 } joined { } left { } partitioned { c72c8185,0 d4fee301,0 }) 2020-04-15 14:38:03 0 [Note] WSREP: forgetting cd351c32 (tcp://172.17.1.89:4567) 2020-04-15 14:38:03 0 [Note] WSREP: view(view_id(NON_PRIM,b5f0871c,5) memb { b5f0871c,0 } joined { } left { } partitioned { c72c8185,0 cd351c32,0 d4fee301,0 }) 2020-04-15 14:38:05 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') turning message relay requesting off 2020-04-15 14:38:07 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') connection established to cd351c32 tcp://172.17.1.89:4567 2020-04-15 14:38:07 0 [Warning] WSREP: discarding established (time wait) cd351c32 (tcp://172.17.1.89:4567) 2020-04-15 14:38:08 0 [Note] WSREP: cleaning up cd351c32 (tcp://172.17.1.89:4567) 2020-04-15 14:38:33 0 [Note] WSREP: view((empty)) 2020-04-15 14:38:33 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) at gcomm/src/pc.cpp:connect():158 2020-04-15 14:38:33 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():209: Failed to open backend connection: -110 (Connection timed out) 2020-04-15 14:38:33 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'galera_cluster' at 'gcomm://controller-0.internalapi.redhat.local,controller-1.internalapi.redhat.local,controller-2.internalapi.redhat.local': -110 (Connection timed out) 2020-04-15 14:38:33 0 [ERROR] WSREP: gcs connect failed: Connection timed out 2020-04-15 14:38:33 0 [ERROR] WSREP: wsrep::connect(gcomm://controller-0.internalapi.redhat.local,controller-1.internalapi.redhat.local,controller-2.internalapi.redhat.local) failed: 7 2020-04-15 14:38:33 0 [ERROR] Aborting turul@redding /tmp/newwtf $ tail -n 80 controller-2/var/log/containers/mysql/mysqld.log 2020-04-15 14:37:40 0 [Note] WSREP: evs::proto(c72c8185, GATHER, view_id(REG,c72c8185,3)) suspecting node: cd351c32 2020-04-15 14:37:40 0 [Note] WSREP: evs::proto(c72c8185, GATHER, view_id(REG,c72c8185,3)) suspected node without join message, declaring inactive 2020-04-15 14:37:41 0 [Note] /usr/libexec/mysqld (initiated by: unknown): Normal shutdown 2020-04-15 14:37:41 0 [Note] WSREP: Stop replication 2020-04-15 14:37:41 0 [Note] WSREP: Closing send monitor... 2020-04-15 14:37:41 0 [Note] WSREP: Closed send monitor. 2020-04-15 14:37:41 0 [Note] WSREP: gcomm: terminating thread 2020-04-15 14:37:41 0 [Note] WSREP: gcomm: joining thread 2020-04-15 14:37:41 0 [Note] WSREP: gcomm: closing backend 2020-04-15 14:37:41 0 [Note] WSREP: view(view_id(NON_PRIM,c72c8185,3) memb { c72c8185,0 } joined { } left { } partitioned { cd351c32,0 d4fee301,0 }) 2020-04-15 14:37:41 0 [Note] WSREP: forgetting d4fee301 (tcp://172.17.1.133:4567) 2020-04-15 14:37:41 0 [Note] WSREP: view(view_id(NON_PRIM,c72c8185,4) memb { c72c8185,0 } joined { } left { } partitioned { cd351c32,0 d4fee301,0 }) 2020-04-15 14:37:41 0 [Note] WSREP: view((empty)) 2020-04-15 14:37:41 0 [Note] WSREP: gcomm: closed 2020-04-15 14:37:41 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2020-04-15 14:37:41 0 [Note] WSREP: Flow-control interval: [16, 16] 2020-04-15 14:37:41 0 [Note] WSREP: Trying to continue unpaused monitor 2020-04-15 14:37:41 0 [Note] WSREP: Received NON-PRIMARY. 2020-04-15 14:37:41 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 15749) 2020-04-15 14:37:41 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2020-04-15 14:37:41 0 [Note] WSREP: Flow-control interval: [16, 16] 2020-04-15 14:37:41 0 [Note] WSREP: Trying to continue unpaused monitor 2020-04-15 14:37:41 0 [Note] WSREP: Received NON-PRIMARY. 2020-04-15 14:37:41 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3 2020-04-15 14:37:41 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2020-04-15 14:37:41 0 [Note] WSREP: Received self-leave message. 2020-04-15 14:37:41 0 [Note] WSREP: Flow-control interval: [0, 0] 2020-04-15 14:37:41 0 [Note] WSREP: Trying to continue unpaused monitor 2020-04-15 14:37:41 0 [Note] WSREP: Received SELF-LEAVE. Closing connection. 2020-04-15 14:37:41 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 15749) 2020-04-15 14:37:41 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3 2020-04-15 14:37:41 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2020-04-15 14:37:41 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3 2020-04-15 14:37:41 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2020-04-15 14:37:41 0 [Note] WSREP: RECV thread exiting 0: Success 2020-04-15 14:37:41 0 [Note] WSREP: recv_thread() joined. 2020-04-15 14:37:41 0 [Note] WSREP: Closing replication queue. 2020-04-15 14:37:41 0 [Note] WSREP: Closing slave action queue. 2020-04-15 14:37:41 2 [Note] WSREP: applier thread exiting (code:0) 2020-04-15 14:37:43 1 [Note] WSREP: rollbacker thread exiting 2020-04-15 14:37:43 0 [Note] Event Scheduler: Purging the queue. 0 events 2020-04-15 14:37:43 0 [Note] InnoDB: FTS optimize thread exiting. 2020-04-15 14:37:43 0 [Note] WSREP: dtor state: CLOSED 2020-04-15 14:37:43 0 [Note] WSREP: mon: entered 15749 oooe fraction 0 oool fraction 0 2020-04-15 14:37:43 0 [Note] WSREP: mon: entered 15749 oooe fraction 0 oool fraction 0 2020-04-15 14:37:43 0 [Note] WSREP: mon: entered 16009 oooe fraction 0 oool fraction 0 2020-04-15 14:37:43 0 [Note] WSREP: cert index usage at exit 0 2020-04-15 14:37:43 0 [Note] WSREP: cert trx map usage at exit 36 2020-04-15 14:37:43 0 [Note] WSREP: deps set usage at exit 0 2020-04-15 14:37:43 0 [Note] WSREP: avg deps dist 29.5995 2020-04-15 14:37:43 0 [Note] WSREP: avg cert interval 0.0821639 2020-04-15 14:37:43 0 [Note] WSREP: cert index size 141 2020-04-15 14:37:43 0 [Note] WSREP: Service thread queue flushed. 2020-04-15 14:37:43 0 [Note] WSREP: wsdb trx map usage 0 conn query map usage 0 2020-04-15 14:37:43 0 [Note] WSREP: MemPool(LocalTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0 2020-04-15 14:37:43 0 [Note] WSREP: MemPool(SlaveTrxHandle): hit ratio: 0.991555, misses: 133, in use: 0, in pool: 133 2020-04-15 14:37:43 0 [Note] WSREP: Shifting CLOSED -> DESTROYED (TO: 15749) 2020-04-15 14:37:43 0 [Note] WSREP: Flushing memory map to disk... 2020-04-15 14:37:43 0 [Note] InnoDB: Starting shutdown... 2020-04-15 14:37:43 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool 2020-04-15 14:37:43 0 [Note] InnoDB: Instance 0, restricted to 2048 pages due to innodb_buf_pool_dump_pct=25 2020-04-15 14:37:43 0 [Note] InnoDB: Buffer pool(s) dump completed at 200415 14:37:43 2020-04-15 14:37:45 0 [Note] InnoDB: Shutdown completed; log sequence number 47553255; transaction id 47427 2020-04-15 14:37:45 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 2020-04-15 14:37:45 0 [Note] /usr/libexec/mysqld: Shutdown complete Looks like they went down after around 35 minute of the first log entry. head controller-2/var/log/containers/mysql/mysqld.log 2020-04-15 14:02:44 0 [Note] WSREP: Read nil XID from storage engines, skipping position init 2020-04-15 14:02:44 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so' 2020-04-15 14:02:44 0 [Note] WSREP: wsrep_load(): Galera 3.26(rXXXX) by Codership Oy <info> loaded successfully. 2020-04-15 14:02:44 0 [Note] WSREP: CRC-32C: using hardware acceleration. 2020-04-15 14:02:44 0 [Warning] WSREP: Could not open state file for reading: '/var/lib/mysql//grastate.dat' 2020-04-15 14:02:44 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 1 2020-04-15 14:02:44 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = controller-2.internalapi.redhat.local; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://172.17.1.138:45 2020-04-15 14:02:44 0 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 00000000-0000-0000-0000-000000000000:-1 2020-04-15 14:02:44 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1 2020-04-15 14:02:44 0 [Note] WSREP: wsrep_sst_grab() Version-Release number of selected component (if applicable): 16.1 RHOS-16.1-RHEL-8-20200414.n.0 How reproducible: Unknown In one job looks like always, however this type of issue may happen in more 3 controller job. The frequent/always job is closest to a new style reference deployment. Actual results: Test are passing, not failing on connecting to keystone. [Wed Apr 15 14:41:57.262644 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] dbapi_connection = rec.get_connection() [Wed Apr 15 14:41:57.262648 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 584, in get_connection [Wed Apr 15 14:41:57.262651 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] self.__connect() [Wed Apr 15 14:41:57.262654 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect [Wed Apr 15 14:41:57.262657 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] connection = pool._invoke_creator(self) [Wed Apr 15 14:41:57.262661 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect [Wed Apr 15 14:41:57.262663 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] return dialect.connect(*cargs, **cparams) [Wed Apr 15 14:41:57.262667 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect [Wed Apr 15 14:41:57.262669 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] return self.dbapi.connect(*cargs, **cparams) [Wed Apr 15 14:41:57.262673 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] File "/usr/lib/python3.6/site-packages/pymysql/__init__.py", line 90, in Connect [Wed Apr 15 14:41:57.262676 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] return Connection(*args, **kwargs) [Wed Apr 15 14:41:57.262680 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 699, in __init__ [Wed Apr 15 14:41:57.262682 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] self.connect() [Wed Apr 15 14:41:57.262686 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 935, in connect [Wed Apr 15 14:41:57.262688 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] self._get_server_information() [Wed Apr 15 14:41:57.262692 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1249, in _get_server_information [Wed Apr 15 14:41:57.262696 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] packet = self._read_packet() [Wed Apr 15 14:41:57.262700 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 991, in _read_packet [Wed Apr 15 14:41:57.262703 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] packet_header = self._read_bytes(4) [Wed Apr 15 14:41:57.262707 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1037, in _read_bytes [Wed Apr 15 14:41:57.262709 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] CR.CR_SERVER_LOST, "Lost connection to MySQL server during query") [Wed Apr 15 14:41:57.262738 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
https://bugzilla.redhat.com/show_bug.cgi?id=1821185 may effect the system, however segfault messages are not in the messages log.
TL;DR: the network gets partitionned 30 minutes after the initial services like pacemaker, galera and rabbit have been started. Apr 15 14:37:31 [38220] controller-1 corosync info [KNET ] link: host: 3 link: 0 is down Apr 15 14:37:31 [38220] controller-1 corosync info [KNET ] link: host: 1 link: 0 is down Apr 15 14:37:31 [38220] controller-1 corosync info [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Apr 15 14:37:31 [38220] controller-1 corosync warning [KNET ] host: host: 3 has no active links Apr 15 14:37:31 [38220] controller-1 corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Apr 15 14:37:31 [38220] controller-1 corosync warning [KNET ] host: host: 1 has no active links Apr 15 14:37:32 [38220] controller-1 corosync notice [TOTEM ] Token has not been received in 1144 ms Apr 15 14:37:32 [38220] controller-1 corosync notice [TOTEM ] A processor failed, forming new configuration. Apr 15 14:37:32 [38220] controller-1 corosync info [KNET ] rx: host: 1 link: 0 is up Apr 15 14:37:32 [38220] controller-1 corosync info [KNET ] rx: host: 3 link: 0 is up Apr 15 14:37:32 [38220] controller-1 corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Apr 15 14:37:32 [38220] controller-1 corosync info [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Apr 15 14:37:34 [38220] controller-1 corosync notice [TOTEM ] A new membership (2.11) was formed. Members left: 1 3 Apr 15 14:37:34 [38220] controller-1 corosync notice [TOTEM ] Failed to receive the leave message. failed: 1 3 Apr 15 14:37:34 [38220] controller-1 corosync warning [CPG ] downlist left_list: 2 received Apr 15 14:37:34 [38220] controller-1 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. At this time, pacemaker lost quorum and stopped the resources it manages (like galera). the network disruption seems to keep happening for the next 20 minutes Apr 15 14:37:50 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:37:50 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 15 14:39:01 [38282] controller-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 15 14:39:01 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:39:03 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:39:09 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:39:09 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 15 14:44:08 [38282] controller-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 15 14:44:08 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:44:10 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:44:10 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 15 14:46:39 [38282] controller-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 15 14:46:39 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:46:48 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:46:48 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 15 14:52:54 [38282] controller-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 15 14:52:54 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:52:56 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:52:59 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:53:00 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:53:00 [38282] controller-2 corosync notice [QUORUM] Members[2]: 2 3 Apr 15 14:53:03 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 15 14:56:33 [38282] controller-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 15 14:56:33 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:56:39 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:56:41 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:56:41 [38282] controller-2 corosync notice [QUORUM] Members[2]: 1 3 Apr 15 14:56:46 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Ultimately, it doesn't look to me that pacemaker is able to restart galera (i see no trace of restart in the mysqld.log), but the captured tarball are not enough to figure out the pacemaker state. we need sosreports for that. So I'd say the first thing to figure out is why network got cut in the first place. In parallel I'll have a look at the pacemaker behaviour, but I think it belongs to a separate bz.
The reason for the network partition is probably linked to those crashes i can see in the logs: [ospdeploy@pidone-host-1 bz1824847]$ grep 'ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV' controller-*/var/log/messages controller-0/var/log/messages:Apr 15 14:37:39 controller-0 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-0/var/log/messages:Apr 15 14:56:34 controller-0 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-1/var/log/messages:Apr 15 14:44:08 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-1/var/log/messages:Apr 15 14:56:40 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-1/var/log/messages:Apr 15 14:56:41 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-2/var/log/messages:Apr 15 14:44:08 controller-2 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-2/var/log/messages:Apr 15 14:44:10 controller-2 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV
Likely dup of https://bugzilla.redhat.com/show_bug.cgi?id=1821185
Confirmed, the linked job is using python3-rhosp-openvswitch-2.13-7.el8ost.noarch.rpm which forces the use of OVS 2.13 *** This bug has been marked as a duplicate of bug 1821185 ***