Bug 1824847

Summary:	network is getting partitioned in new osp16.1/rhel8.2 deployments
Product:	Red Hat OpenStack	Reporter:	Attila Fazekas <afazekas>
Component:	openstack-tripleo	Assignee:	James Slagle <jslagle>
Status:	CLOSED DUPLICATE	QA Contact:	Arik Chernetsky <achernet>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	16.0 (Train)	CC:	dalvarez, dciabrin, lmiccini, mburns, michele
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-17 12:38:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Attila Fazekas 2020-04-16 14:39:00 UTC

Description of problem:

$ tail -n 80 controller-0/var/log/containers/mysql/mysqld.log 
2020-04-15 14:38:03 0 [Note] WSREP: Received NON-PRIMARY.
2020-04-15 14:38:03 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 2, my index: 1, protocol version 3
2020-04-15 14:38:03 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-04-15 14:38:03 2189 [Warning] Aborted connection 2189 to db: 'nova' user: 'nova' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 2188 [Warning] Aborted connection 2188 to db: 'nova' user: 'nova' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 2186 [Warning] Aborted connection 2186 to db: 'nova' user: 'nova' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 1868 [Warning] Aborted connection 1868 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 1864 [Warning] WSREP: Send action {(nil), 656, TORDERED} returned -107 (Transport endpoint is not connected)
2020-04-15 14:38:03 0 [Note] WSREP: Closed send monitor.
2020-04-15 14:38:03 0 [Note] WSREP: gcomm: terminating thread
2020-04-15 14:38:03 0 [Note] WSREP: gcomm: joining thread
2020-04-15 14:38:03 0 [Note] WSREP: gcomm: closing backend
2020-04-15 14:38:03 5784 [Warning] Aborted connection 5784 to db: 'cinder' user: 'cinder' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 1867 [Warning] Aborted connection 1867 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 1849 [Warning] Aborted connection 1849 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 1859 [Warning] Aborted connection 1859 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 1864 [Warning] Aborted connection 1864 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 1846 [Warning] Aborted connection 1846 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 1865 [Warning] Aborted connection 1865 to db: 'heat' user: 'heat' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:03 2187 [Warning] Aborted connection 2187 to db: 'nova' user: 'nova' host: '172.17.1.89' (Got an error reading communication packets)
2020-04-15 14:38:06 0 [Note] WSREP: (cd351c32, 'tcp://172.17.1.89:4567') connection to peer b5f0871c with addr tcp://172.17.1.133:4567 timed out, no messages seen in PT3S
2020-04-15 14:38:07 0 [Note] WSREP: (cd351c32, 'tcp://172.17.1.89:4567') reconnecting to b5f0871c (tcp://172.17.1.133:4567), attempt 0
2020-04-15 14:38:08 0 [Note] WSREP: evs::proto(cd351c32, LEAVING, view_id(REG,b5f0871c,4)) suspecting node: b5f0871c
2020-04-15 14:38:08 0 [Note] WSREP: evs::proto(cd351c32, LEAVING, view_id(REG,b5f0871c,4)) suspected node without join message, declaring inactive
2020-04-15 14:38:08 0 [Note] WSREP: view(view_id(NON_PRIM,b5f0871c,4) memb {
	cd351c32,0
} joined {
} left {
} partitioned {
	b5f0871c,0
	c72c8185,0
	d4fee301,0
})
2020-04-15 14:38:08 0 [Note] WSREP: view((empty))
2020-04-15 14:38:08 0 [Note] WSREP: gcomm: closed
2020-04-15 14:38:08 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2020-04-15 14:38:08 0 [Note] WSREP: Flow-control interval: [16, 16]
2020-04-15 14:38:08 0 [Note] WSREP: Trying to continue unpaused monitor
2020-04-15 14:38:08 0 [Note] WSREP: Received NON-PRIMARY.
2020-04-15 14:38:08 0 [Note] WSREP: Received self-leave message.
2020-04-15 14:38:08 0 [Note] WSREP: Flow-control interval: [0, 0]
2020-04-15 14:38:08 0 [Note] WSREP: Trying to continue unpaused monitor
2020-04-15 14:38:08 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2020-04-15 14:38:08 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 15749)
2020-04-15 14:38:08 0 [Note] WSREP: RECV thread exiting 0: Success
2020-04-15 14:38:08 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2020-04-15 14:38:08 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-04-15 14:38:08 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3
2020-04-15 14:38:08 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-04-15 14:38:08 0 [Note] WSREP: recv_thread() joined.
2020-04-15 14:38:08 0 [Note] WSREP: Closing replication queue.
2020-04-15 14:38:08 0 [Note] WSREP: Closing slave action queue.
2020-04-15 14:38:08 2 [Note] WSREP: applier thread exiting (code:0)
2020-04-15 14:38:10 1 [Note] WSREP: rollbacker thread exiting
2020-04-15 14:38:10 0 [Note] Event Scheduler: Purging the queue. 0 events
2020-04-15 14:38:10 0 [Note] InnoDB: FTS optimize thread exiting.
2020-04-15 14:38:10 0 [Note] WSREP: dtor state: CLOSED
2020-04-15 14:38:10 0 [Note] WSREP: mon: entered 15749 oooe fraction 0 oool fraction 0
2020-04-15 14:38:10 0 [Note] WSREP: mon: entered 15749 oooe fraction 0.0477491 oool fraction 0
2020-04-15 14:38:10 0 [Note] WSREP: mon: entered 16002 oooe fraction 0 oool fraction 6.24922e-05
2020-04-15 14:38:10 0 [Note] WSREP: cert index usage at exit 0
2020-04-15 14:38:10 0 [Note] WSREP: cert trx map usage at exit 36
2020-04-15 14:38:10 0 [Note] WSREP: deps set usage at exit 0
2020-04-15 14:38:10 0 [Note] WSREP: avg deps dist 29.5995
2020-04-15 14:38:10 0 [Note] WSREP: avg cert interval 0.0821639
2020-04-15 14:38:10 0 [Note] WSREP: cert index size 141
2020-04-15 14:38:10 0 [Note] WSREP: Service thread queue flushed.
2020-04-15 14:38:10 0 [Note] WSREP: wsdb trx map usage 0 conn query map usage 0
2020-04-15 14:38:10 0 [Note] WSREP: MemPool(LocalTrxHandle): hit ratio: 0.991443, misses: 135, in use: 0, in pool: 135
2020-04-15 14:38:10 0 [Note] WSREP: MemPool(SlaveTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0
2020-04-15 14:38:10 0 [Note] WSREP: Shifting CLOSED -> DESTROYED (TO: 15749)
2020-04-15 14:38:10 0 [Note] WSREP: Flushing memory map to disk...
2020-04-15 14:38:10 0 [Note] InnoDB: Starting shutdown...
2020-04-15 14:38:10 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool
2020-04-15 14:38:10 0 [Note] InnoDB: Instance 0, restricted to 2048 pages due to innodb_buf_pool_dump_pct=25
2020-04-15 14:38:10 0 [Note] InnoDB: Buffer pool(s) dump completed at 200415 14:38:10
2020-04-15 14:38:11 0 [Note] InnoDB: Shutdown completed; log sequence number 49493499; transaction id 53623
2020-04-15 14:38:11 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2020-04-15 14:38:11 0 [Note] /usr/libexec/mysqld: Shutdown complete

turul@redding /tmp/newwtf $ tail -n 80 controller-1/var/log/containers/mysql/mysqld.log 
2020-04-15 14:37:47 0 [Note] WSREP: mon: entered 15720 oooe fraction 0 oool fraction 0
2020-04-15 14:37:47 0 [Note] WSREP: mon: entered 15720 oooe fraction 0 oool fraction 0
2020-04-15 14:37:47 0 [Note] WSREP: mon: entered 15998 oooe fraction 0 oool fraction 6.25078e-05
2020-04-15 14:37:47 0 [Note] WSREP: cert index usage at exit 0
2020-04-15 14:37:47 0 [Note] WSREP: cert trx map usage at exit 36
2020-04-15 14:37:47 0 [Note] WSREP: deps set usage at exit 0
2020-04-15 14:37:47 0 [Note] WSREP: avg deps dist 29.6013
2020-04-15 14:37:47 0 [Note] WSREP: avg cert interval 0.0821692
2020-04-15 14:37:47 0 [Note] WSREP: cert index size 141
2020-04-15 14:37:47 0 [Note] WSREP: Service thread queue flushed.
2020-04-15 14:37:47 0 [Note] WSREP: wsdb trx map usage 0 conn query map usage 0
2020-04-15 14:37:47 0 [Note] WSREP: MemPool(LocalTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0
2020-04-15 14:37:47 0 [Note] WSREP: MemPool(SlaveTrxHandle): hit ratio: 0.991554, misses: 133, in use: 0, in pool: 133
2020-04-15 14:37:47 0 [Note] WSREP: Shifting CLOSED -> DESTROYED (TO: 15749)
2020-04-15 14:37:47 0 [Note] WSREP: Flushing memory map to disk...
2020-04-15 14:37:47 0 [Note] InnoDB: Starting shutdown...
2020-04-15 14:37:47 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool
2020-04-15 14:37:47 0 [Note] InnoDB: Instance 0, restricted to 2048 pages due to innodb_buf_pool_dump_pct=25
2020-04-15 14:37:47 0 [Note] InnoDB: Buffer pool(s) dump completed at 200415 14:37:47
2020-04-15 14:37:49 0 [Note] InnoDB: Shutdown completed; log sequence number 45691765; transaction id 47411
2020-04-15 14:37:49 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2020-04-15 14:37:49 0 [Note] /usr/libexec/mysqld: Shutdown complete

2020-04-15 14:38:02 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
2020-04-15 14:38:02 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2020-04-15 14:38:02 0 [Note] WSREP: wsrep_load(): Galera 3.26(rXXXX) by Codership Oy <info> loaded successfully.
2020-04-15 14:38:02 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2020-04-15 14:38:02 0 [Note] WSREP: Found saved state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, safe_to_bootstrap: 0
2020-04-15 14:38:02 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = controller-1.internalapi.redhat.local; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://172.17.1.133:45
2020-04-15 14:38:02 0 [Note] WSREP: GCache history reset: c72cce62-7f21-11ea-9a47-fe4688d64721:0 -> c72cce62-7f21-11ea-9a47-fe4688d64721:15749
2020-04-15 14:38:02 0 [Note] WSREP: Assign initial position for certification: 15749, protocol version: -1
2020-04-15 14:38:02 0 [Note] WSREP: wsrep_sst_grab()
2020-04-15 14:38:02 0 [Note] WSREP: Start replication
2020-04-15 14:38:02 0 [Note] WSREP: Setting initial position to c72cce62-7f21-11ea-9a47-fe4688d64721:15749
2020-04-15 14:38:02 0 [Note] WSREP: protonet asio version 0
2020-04-15 14:38:02 0 [Note] WSREP: Using CRC-32C for message checksums.
2020-04-15 14:38:02 0 [Note] WSREP: backend: asio
2020-04-15 14:38:02 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2020-04-15 14:38:02 0 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2020-04-15 14:38:02 0 [Note] WSREP: restore pc from disk failed
2020-04-15 14:38:02 0 [Note] WSREP: GMCast version 0
2020-04-15 14:38:02 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') listening at tcp://172.17.1.133:4567
2020-04-15 14:38:02 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') multicast: , ttl: 1
2020-04-15 14:38:02 0 [Note] WSREP: EVS version 0
2020-04-15 14:38:02 0 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer 'controller-0.internalapi.redhat.local:,controller-1.internalapi.redhat.local:,controller-2.internalapi.redhat.local:'
2020-04-15 14:38:02 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') connection established to cd351c32 tcp://172.17.1.89:4567
2020-04-15 14:38:02 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') turning message relay requesting on, nonlive peers: 
2020-04-15 14:38:03 0 [Note] WSREP: declaring cd351c32 at tcp://172.17.1.89:4567 stable
2020-04-15 14:38:03 0 [Note] WSREP: view(view_id(NON_PRIM,b5f0871c,4) memb {
	b5f0871c,0
	cd351c32,0
} joined {
} left {
} partitioned {
	c72c8185,0
	d4fee301,0
})
2020-04-15 14:38:03 0 [Note] WSREP: forgetting cd351c32 (tcp://172.17.1.89:4567)
2020-04-15 14:38:03 0 [Note] WSREP: view(view_id(NON_PRIM,b5f0871c,5) memb {
	b5f0871c,0
} joined {
} left {
} partitioned {
	c72c8185,0
	cd351c32,0
	d4fee301,0
})
2020-04-15 14:38:05 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') turning message relay requesting off
2020-04-15 14:38:07 0 [Note] WSREP: (b5f0871c, 'tcp://172.17.1.133:4567') connection established to cd351c32 tcp://172.17.1.89:4567
2020-04-15 14:38:07 0 [Warning] WSREP: discarding established (time wait) cd351c32 (tcp://172.17.1.89:4567) 
2020-04-15 14:38:08 0 [Note] WSREP:  cleaning up cd351c32 (tcp://172.17.1.89:4567)
2020-04-15 14:38:33 0 [Note] WSREP: view((empty))
2020-04-15 14:38:33 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
	 at gcomm/src/pc.cpp:connect():158
2020-04-15 14:38:33 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():209: Failed to open backend connection: -110 (Connection timed out)
2020-04-15 14:38:33 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'galera_cluster' at 'gcomm://controller-0.internalapi.redhat.local,controller-1.internalapi.redhat.local,controller-2.internalapi.redhat.local': -110 (Connection timed out)
2020-04-15 14:38:33 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2020-04-15 14:38:33 0 [ERROR] WSREP: wsrep::connect(gcomm://controller-0.internalapi.redhat.local,controller-1.internalapi.redhat.local,controller-2.internalapi.redhat.local) failed: 7
2020-04-15 14:38:33 0 [ERROR] Aborting

turul@redding /tmp/newwtf $ tail -n 80 controller-2/var/log/containers/mysql/mysqld.log 
2020-04-15 14:37:40 0 [Note] WSREP: evs::proto(c72c8185, GATHER, view_id(REG,c72c8185,3)) suspecting node: cd351c32
2020-04-15 14:37:40 0 [Note] WSREP: evs::proto(c72c8185, GATHER, view_id(REG,c72c8185,3)) suspected node without join message, declaring inactive
2020-04-15 14:37:41 0 [Note] /usr/libexec/mysqld (initiated by: unknown): Normal shutdown
2020-04-15 14:37:41 0 [Note] WSREP: Stop replication
2020-04-15 14:37:41 0 [Note] WSREP: Closing send monitor...
2020-04-15 14:37:41 0 [Note] WSREP: Closed send monitor.
2020-04-15 14:37:41 0 [Note] WSREP: gcomm: terminating thread
2020-04-15 14:37:41 0 [Note] WSREP: gcomm: joining thread
2020-04-15 14:37:41 0 [Note] WSREP: gcomm: closing backend
2020-04-15 14:37:41 0 [Note] WSREP: view(view_id(NON_PRIM,c72c8185,3) memb {
	c72c8185,0
} joined {
} left {
} partitioned {
	cd351c32,0
	d4fee301,0
})
2020-04-15 14:37:41 0 [Note] WSREP: forgetting d4fee301 (tcp://172.17.1.133:4567)
2020-04-15 14:37:41 0 [Note] WSREP: view(view_id(NON_PRIM,c72c8185,4) memb {
	c72c8185,0
} joined {
} left {
} partitioned {
	cd351c32,0
	d4fee301,0
})
2020-04-15 14:37:41 0 [Note] WSREP: view((empty))
2020-04-15 14:37:41 0 [Note] WSREP: gcomm: closed
2020-04-15 14:37:41 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2020-04-15 14:37:41 0 [Note] WSREP: Flow-control interval: [16, 16]
2020-04-15 14:37:41 0 [Note] WSREP: Trying to continue unpaused monitor
2020-04-15 14:37:41 0 [Note] WSREP: Received NON-PRIMARY.
2020-04-15 14:37:41 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 15749)
2020-04-15 14:37:41 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2020-04-15 14:37:41 0 [Note] WSREP: Flow-control interval: [16, 16]
2020-04-15 14:37:41 0 [Note] WSREP: Trying to continue unpaused monitor
2020-04-15 14:37:41 0 [Note] WSREP: Received NON-PRIMARY.
2020-04-15 14:37:41 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2020-04-15 14:37:41 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-04-15 14:37:41 0 [Note] WSREP: Received self-leave message.
2020-04-15 14:37:41 0 [Note] WSREP: Flow-control interval: [0, 0]
2020-04-15 14:37:41 0 [Note] WSREP: Trying to continue unpaused monitor
2020-04-15 14:37:41 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2020-04-15 14:37:41 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 15749)
2020-04-15 14:37:41 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2020-04-15 14:37:41 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-04-15 14:37:41 2 [Note] WSREP: New cluster view: global state: c72cce62-7f21-11ea-9a47-fe4688d64721:15749, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3
2020-04-15 14:37:41 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-04-15 14:37:41 0 [Note] WSREP: RECV thread exiting 0: Success
2020-04-15 14:37:41 0 [Note] WSREP: recv_thread() joined.
2020-04-15 14:37:41 0 [Note] WSREP: Closing replication queue.
2020-04-15 14:37:41 0 [Note] WSREP: Closing slave action queue.
2020-04-15 14:37:41 2 [Note] WSREP: applier thread exiting (code:0)
2020-04-15 14:37:43 1 [Note] WSREP: rollbacker thread exiting
2020-04-15 14:37:43 0 [Note] Event Scheduler: Purging the queue. 0 events
2020-04-15 14:37:43 0 [Note] InnoDB: FTS optimize thread exiting.
2020-04-15 14:37:43 0 [Note] WSREP: dtor state: CLOSED
2020-04-15 14:37:43 0 [Note] WSREP: mon: entered 15749 oooe fraction 0 oool fraction 0
2020-04-15 14:37:43 0 [Note] WSREP: mon: entered 15749 oooe fraction 0 oool fraction 0
2020-04-15 14:37:43 0 [Note] WSREP: mon: entered 16009 oooe fraction 0 oool fraction 0
2020-04-15 14:37:43 0 [Note] WSREP: cert index usage at exit 0
2020-04-15 14:37:43 0 [Note] WSREP: cert trx map usage at exit 36
2020-04-15 14:37:43 0 [Note] WSREP: deps set usage at exit 0
2020-04-15 14:37:43 0 [Note] WSREP: avg deps dist 29.5995
2020-04-15 14:37:43 0 [Note] WSREP: avg cert interval 0.0821639
2020-04-15 14:37:43 0 [Note] WSREP: cert index size 141
2020-04-15 14:37:43 0 [Note] WSREP: Service thread queue flushed.
2020-04-15 14:37:43 0 [Note] WSREP: wsdb trx map usage 0 conn query map usage 0
2020-04-15 14:37:43 0 [Note] WSREP: MemPool(LocalTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0
2020-04-15 14:37:43 0 [Note] WSREP: MemPool(SlaveTrxHandle): hit ratio: 0.991555, misses: 133, in use: 0, in pool: 133
2020-04-15 14:37:43 0 [Note] WSREP: Shifting CLOSED -> DESTROYED (TO: 15749)
2020-04-15 14:37:43 0 [Note] WSREP: Flushing memory map to disk...
2020-04-15 14:37:43 0 [Note] InnoDB: Starting shutdown...
2020-04-15 14:37:43 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool
2020-04-15 14:37:43 0 [Note] InnoDB: Instance 0, restricted to 2048 pages due to innodb_buf_pool_dump_pct=25
2020-04-15 14:37:43 0 [Note] InnoDB: Buffer pool(s) dump completed at 200415 14:37:43
2020-04-15 14:37:45 0 [Note] InnoDB: Shutdown completed; log sequence number 47553255; transaction id 47427
2020-04-15 14:37:45 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2020-04-15 14:37:45 0 [Note] /usr/libexec/mysqld: Shutdown complete

Looks like they went down after around 35 minute of the first log entry. 
head controller-2/var/log/containers/mysql/mysqld.log 
2020-04-15 14:02:44 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
2020-04-15 14:02:44 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2020-04-15 14:02:44 0 [Note] WSREP: wsrep_load(): Galera 3.26(rXXXX) by Codership Oy <info> loaded successfully.
2020-04-15 14:02:44 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2020-04-15 14:02:44 0 [Warning] WSREP: Could not open state file for reading: '/var/lib/mysql//grastate.dat'
2020-04-15 14:02:44 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 1
2020-04-15 14:02:44 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = controller-2.internalapi.redhat.local; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://172.17.1.138:45
2020-04-15 14:02:44 0 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 00000000-0000-0000-0000-000000000000:-1
2020-04-15 14:02:44 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2020-04-15 14:02:44 0 [Note] WSREP: wsrep_sst_grab()


Version-Release number of selected component (if applicable):

16.1
RHOS-16.1-RHEL-8-20200414.n.0

How reproducible:

Unknown
In one job looks like always, however this
type of issue may happen in more 3 controller job.
The frequent/always job is closest to a new style reference deployment.



Actual results:
Test are passing, not failing on connecting to keystone.
[Wed Apr 15 14:41:57.262644 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     dbapi_connection = rec.get_connection()
[Wed Apr 15 14:41:57.262648 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 584, in get_connection
[Wed Apr 15 14:41:57.262651 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     self.__connect()
[Wed Apr 15 14:41:57.262654 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]   File "/usr/lib64/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect
[Wed Apr 15 14:41:57.262657 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     connection = pool._invoke_creator(self)
[Wed Apr 15 14:41:57.262661 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
[Wed Apr 15 14:41:57.262663 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     return dialect.connect(*cargs, **cparams)
[Wed Apr 15 14:41:57.262667 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect
[Wed Apr 15 14:41:57.262669 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     return self.dbapi.connect(*cargs, **cparams)
[Wed Apr 15 14:41:57.262673 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]   File "/usr/lib/python3.6/site-packages/pymysql/__init__.py", line 90, in Connect
[Wed Apr 15 14:41:57.262676 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     return Connection(*args, **kwargs)
[Wed Apr 15 14:41:57.262680 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]   File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 699, in __init__
[Wed Apr 15 14:41:57.262682 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     self.connect()
[Wed Apr 15 14:41:57.262686 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]   File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 935, in connect
[Wed Apr 15 14:41:57.262688 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     self._get_server_information()
[Wed Apr 15 14:41:57.262692 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]   File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1249, in _get_server_information
[Wed Apr 15 14:41:57.262696 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     packet = self._read_packet()
[Wed Apr 15 14:41:57.262700 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]   File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 991, in _read_packet
[Wed Apr 15 14:41:57.262703 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     packet_header = self._read_bytes(4)
[Wed Apr 15 14:41:57.262707 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]   File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1037, in _read_bytes
[Wed Apr 15 14:41:57.262709 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778]     CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
[Wed Apr 15 14:41:57.262738 2020] [wsgi:error] [pid 182] [remote 172.17.1.133:41778] oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

Comment 2 Attila Fazekas 2020-04-17 04:27:41 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1821185 may effect the system,
however segfault messages are not in the messages log.

Comment 3 Damien Ciabrini 2020-04-17 10:38:10 UTC

TL;DR: the network gets partitionned 30 minutes after the initial services like pacemaker, galera and rabbit have been started.

Apr 15 14:37:31 [38220] controller-1 corosync info    [KNET  ] link: host: 3 link: 0 is down
Apr 15 14:37:31 [38220] controller-1 corosync info    [KNET  ] link: host: 1 link: 0 is down
Apr 15 14:37:31 [38220] controller-1 corosync info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 15 14:37:31 [38220] controller-1 corosync warning [KNET  ] host: host: 3 has no active links
Apr 15 14:37:31 [38220] controller-1 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Apr 15 14:37:31 [38220] controller-1 corosync warning [KNET  ] host: host: 1 has no active links
Apr 15 14:37:32 [38220] controller-1 corosync notice  [TOTEM ] Token has not been received in 1144 ms 
Apr 15 14:37:32 [38220] controller-1 corosync notice  [TOTEM ] A processor failed, forming new configuration.
Apr 15 14:37:32 [38220] controller-1 corosync info    [KNET  ] rx: host: 1 link: 0 is up
Apr 15 14:37:32 [38220] controller-1 corosync info    [KNET  ] rx: host: 3 link: 0 is up
Apr 15 14:37:32 [38220] controller-1 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Apr 15 14:37:32 [38220] controller-1 corosync info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr 15 14:37:34 [38220] controller-1 corosync notice  [TOTEM ] A new membership (2.11) was formed. Members left: 1 3
Apr 15 14:37:34 [38220] controller-1 corosync notice  [TOTEM ] Failed to receive the leave message. failed: 1 3
Apr 15 14:37:34 [38220] controller-1 corosync warning [CPG   ] downlist left_list: 2 received
Apr 15 14:37:34 [38220] controller-1 corosync notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.

At this time, pacemaker lost quorum and stopped the resources it manages (like galera).

the network disruption seems to keep happening for the next 20 minutes

Apr 15 14:37:50 [38282] controller-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
Apr 15 14:37:50 [38282] controller-2 corosync notice  [QUORUM] Members[3]: 1 2 3
Apr 15 14:39:01 [38282] controller-2 corosync notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 15 14:39:01 [38282] controller-2 corosync notice  [QUORUM] Members[1]: 3
Apr 15 14:39:03 [38282] controller-2 corosync notice  [QUORUM] Members[1]: 3
Apr 15 14:39:09 [38282] controller-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
Apr 15 14:39:09 [38282] controller-2 corosync notice  [QUORUM] Members[3]: 1 2 3
Apr 15 14:44:08 [38282] controller-2 corosync notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 15 14:44:08 [38282] controller-2 corosync notice  [QUORUM] Members[1]: 3
Apr 15 14:44:10 [38282] controller-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
Apr 15 14:44:10 [38282] controller-2 corosync notice  [QUORUM] Members[3]: 1 2 3
Apr 15 14:46:39 [38282] controller-2 corosync notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 15 14:46:39 [38282] controller-2 corosync notice  [QUORUM] Members[1]: 3
Apr 15 14:46:48 [38282] controller-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
Apr 15 14:46:48 [38282] controller-2 corosync notice  [QUORUM] Members[3]: 1 2 3
Apr 15 14:52:54 [38282] controller-2 corosync notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 15 14:52:54 [38282] controller-2 corosync notice  [QUORUM] Members[1]: 3
Apr 15 14:52:56 [38282] controller-2 corosync notice  [QUORUM] Members[1]: 3
Apr 15 14:52:59 [38282] controller-2 corosync notice  [QUORUM] Members[1]: 3
Apr 15 14:53:00 [38282] controller-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
Apr 15 14:53:00 [38282] controller-2 corosync notice  [QUORUM] Members[2]: 2 3
Apr 15 14:53:03 [38282] controller-2 corosync notice  [QUORUM] Members[3]: 1 2 3
Apr 15 14:56:33 [38282] controller-2 corosync notice  [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 15 14:56:33 [38282] controller-2 corosync notice  [QUORUM] Members[1]: 3
Apr 15 14:56:39 [38282] controller-2 corosync notice  [QUORUM] Members[1]: 3
Apr 15 14:56:41 [38282] controller-2 corosync notice  [QUORUM] This node is within the primary component and will provide service.
Apr 15 14:56:41 [38282] controller-2 corosync notice  [QUORUM] Members[2]: 1 3
Apr 15 14:56:46 [38282] controller-2 corosync notice  [QUORUM] Members[3]: 1 2 3

Ultimately, it doesn't look to me that pacemaker is able to restart galera (i see no trace of restart in the mysqld.log),
but the captured tarball are not enough to figure out the pacemaker state. we need sosreports for that.

So I'd say the first thing to figure out is why network got cut in the first place. In parallel I'll have a look at the pacemaker behaviour, but I think it belongs to a separate bz.

Comment 4 Damien Ciabrini 2020-04-17 10:46:01 UTC

The reason for the network partition is probably linked to those crashes i can see in the logs:

[ospdeploy@pidone-host-1 bz1824847]$ grep 'ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV' controller-*/var/log/messages
controller-0/var/log/messages:Apr 15 14:37:39 controller-0 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV
controller-0/var/log/messages:Apr 15 14:56:34 controller-0 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV
controller-1/var/log/messages:Apr 15 14:44:08 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV
controller-1/var/log/messages:Apr 15 14:56:40 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV
controller-1/var/log/messages:Apr 15 14:56:41 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV
controller-2/var/log/messages:Apr 15 14:44:08 controller-2 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV
controller-2/var/log/messages:Apr 15 14:44:10 controller-2 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV

Comment 5 Daniel Alvarez Sanchez 2020-04-17 12:34:32 UTC

Likely dup of https://bugzilla.redhat.com/show_bug.cgi?id=1821185

Comment 6 Daniel Alvarez Sanchez 2020-04-17 12:38:05 UTC

Confirmed, the linked job is using python3-rhosp-openvswitch-2.13-7.el8ost.noarch.rpm  which forces the use of OVS 2.13

*** This bug has been marked as a duplicate of bug 1821185 ***