Bug 1823178 - [OSP16.1] Openvswitch segfaults multiple times causing API failures
Summary: [OSP16.1] Openvswitch segfaults multiple times causing API failures
Keywords:
Status: CLOSED DUPLICATE of bug 1821185
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-12 11:41 UTC by Roman Safronov
Modified: 2020-04-19 13:13 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-16 12:16:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Roman Safronov 2020-04-12 11:41:40 UTC
Description of problem:

While working with OSP16.1 (usual scenario, create security group, keypair, networks, run instance) I noticed that sometimes API requests are failing.
When I rerun same request next time it usually works.
In /var/log/messages on controller I found multiple errors that mysql server is often not running:

Apr 12 11:11:52 controller-0 galera(galera)[31084]: ERROR: MySQL is not running
Apr 12 11:12:00 controller-0 galera(galera)[31211]: ERROR: MySQL is not running
Apr 12 11:12:04 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:12:05 controller-0 galera(galera)[31338]: ERROR: MySQL is not running
Apr 12 11:12:18 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:12:31 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:15:45 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:15:51 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:15:52 controller-0 galera(galera)[33428]: ERROR: MySQL is not running
Apr 12 11:15:52 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:15:53 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:15:53 controller-0 galera(galera)[33492]: ERROR: MySQL is not running
Apr 12 11:16:15 controller-0 galera(galera)[33556]: ERROR: MySQL is not running
Apr 12 11:16:23 controller-0 galera(galera)[33620]: ERROR: MySQL is not running
Apr 12 11:16:26 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:16:26 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:16:27 controller-0 galera(galera)[33684]: ERROR: MySQL is not running
Apr 12 11:16:27 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:16:40 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:16:40 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:16:40 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:20:28 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:20:29 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:20:30 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:20:30 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:20:32 controller-0 pacemaker-remoted[7]: notice: rabbitmq_stop_0:168013:stderr [ Error: unable to perform an operation on node 'rabbit@controller-0'. Please see diagnostics information and suggestions below. ]
Apr 12 11:20:35 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:20:35 controller-0 galera(galera)[35862]: ERROR: MySQL is not running
Apr 12 11:20:37 controller-0 galera(galera)[35926]: ERROR: MySQL is not running
Apr 12 11:20:50 controller-0 pacemaker-remoted[7]: notice: rabbitmq_start_0:168570:stderr [ Error: unable to perform an operation on node 'rabbit@controller-0'. Please see diagnostics information and suggestions below. ]
Apr 12 11:20:50 controller-0 pacemaker-remoted[7]: notice: rabbitmq_start_0:168570:stderr [ Error: unable to perform an operation on node 'rabbit@controller-0'. Please see diagnostics information and suggestions below. ]
Apr 12 11:20:58 controller-0 galera(galera)[35996]: ERROR: MySQL is not running
Apr 12 11:20:59 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:20:59 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:21:00 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:21:08 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:21:08 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:21:08 controller-0 pacemaker-controld[2778]: error: Failed to retrieve meta-data for ocf:ovn:ovndb-servers
Apr 12 11:21:21 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020
Apr 12 11:21:22 controller-0 galera(galera)[36056]: ERROR: MySQL is not running
Apr 12 11:21:39 controller-0 pacemaker-schedulerd[2775]: warning: Unexpected result (error: local node <controller-0> is started, but not in primary mode. Unknown state.) was recorded for monitor of galera:0 on galera-bundle-0 at Apr 12 11:03:13 2020



from mysqld.log


2020-04-12 11:03:04 0 [ERROR] WSREP: exception from gcomm, backend must be restarted: evs::proto(9dec0f6b, GATHER, view_id(REG,5cb51706,19)) failed to form singleton view after exceeding max
_install_timeouts 3, giving up (FATAL)
         at gcomm/src/evs_proto.cpp:handle_install_timer():742
2020-04-12 11:03:04 0 [Note] WSREP: gcomm: terminating thread
2020-04-12 11:03:04 0 [Note] WSREP: gcomm: joining thread
2020-04-12 11:03:04 0 [Note] WSREP: gcomm: closing backend
2020-04-12 11:03:04 0 [Note] WSREP: Forced PC close
2020-04-12 11:03:04 0 [Warning] WSREP: discarding 18 messages from message index
2020-04-12 11:03:04 0 [Note] WSREP: gcomm: closed
2020-04-12 11:03:04 0 [Note] WSREP: Received self-leave message.
2020-04-12 11:03:04 0 [Note] WSREP: comp msg error in core 103
2020-04-12 11:03:04 0 [Note] WSREP: Closing send monitor...
2020-04-12 11:03:04 2 [Note] WSREP: New cluster view: global state: 00000000-0000-0000-0000-000000000000:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version -1
2020-04-12 11:03:04 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-04-12 11:03:04 2 [Note] WSREP: applier thread exiting (code:6)
2020-04-12 11:03:04 664 [Warning] WSREP: Send action {(nil), 1784, TORDERED} returned -103 (Software caused connection abort)
2020-04-12 11:03:04 0 [Note] WSREP: Closed send monitor.
2020-04-12 11:03:04 0 [Note] WSREP: Closing replication queue.
2020-04-12 11:03:04 0 [Note] WSREP: Closing slave action queue.
2020-04-12 11:03:04 0 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 370206)
2020-04-12 11:03:04 0 [Note] WSREP: RECV thread exiting -103: Software caused connection abort
2020-04-12 11:03:06 685 [Warning] Aborted connection 685 to db: 'cinder' user: 'cinder' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 693 [Warning] Aborted connection 693 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 692 [Warning] Aborted connection 692 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 688 [Warning] Aborted connection 688 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 687 [Warning] Aborted connection 687 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 686 [Warning] Aborted connection 686 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 678 [Warning] Aborted connection 678 to db: 'heat' user: 'heat' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 677 [Warning] Aborted connection 677 to db: 'heat' user: 'heat' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 675 [Warning] Aborted connection 675 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 679 [Warning] Aborted connection 679 to db: 'cinder' user: 'cinder' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 682 [Warning] Aborted connection 682 to db: 'cinder' user: 'cinder' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 684 [Warning] Aborted connection 684 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 667 [Warning] Aborted connection 667 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 666 [Warning] Aborted connection 666 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 665 [Warning] Aborted connection 665 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 664 [Warning] Aborted connection 664 to db: 'ovs_neutron' user: 'neutron' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:07 668 [Warning] Aborted connection 668 to db: 'nova' user: 'nova' host: '172.17.1.118' (Got an error reading communication packets)
2020-04-12 11:03:15 0 [Note] /usr/libexec/mysqld (initiated by: unknown): Normal shutdown





Version-Release number of selected component (if applicable):
16.1-RHEL-8/RHOS-16.1-RHEL-8-20200407.n.0
puppet-mysql-10.4.0-0.20200328015238.95f9b98.el8ost.noarch



How reproducible:
100%

Steps to Reproduce:
Try to create network, router, security group with icmp and ssh rules, keypair, launch instance using openstack CLI client 

Actual results:
After 2-3 API calls error returned. See [1], [2] below
note: In case I rerun command once again it works.



Expected results:
No unexpected errors when running CLI commands

Additional info:

[1] (overcloud) [stack@undercloud-0 roman]$ openstack security group create overcloud_sg
Unable to establish connection to http://10.0.0.111:5000/v3/auth/tokens: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

[2] (overcloud) [stack@undercloud-0 roman]$ openstack floating ip create nova
Failed to discover available identity versions when contacting http://10.0.0.111:5000. Attempting to parse version from URL.
Could not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. Unable to establish connection to http://10.0.0.111:5000: HTTPConnectionPool(host='10.0.0.111', port=5000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6190957940>: Failed to establish a new connection: [Errno 111] Connection refused',))

Logs available here: http://rhos-release.virt.bos.redhat.com/log/bzosp161_neutron_no_response

Comment 1 Roman Safronov 2020-04-12 11:46:30 UTC
[root@controller-0 mysql]# podman exec -it galera-bundle-podman-0 rpm -qa | grep maria
mariadb-common-10.3.17-1.module+el8.1.0+3974+90eded84.x86_64
mariadb-connector-c-3.0.7-1.el8.x86_64
mariadb-errmsg-10.3.17-1.module+el8.1.0+3974+90eded84.x86_64
mariadb-server-utils-10.3.17-1.module+el8.1.0+3974+90eded84.x86_64
mariadb-server-galera-10.3.17-1.module+el8.1.0+3974+90eded84.x86_64
mariadb-connector-c-config-3.0.7-1.el8.noarch
mariadb-10.3.17-1.module+el8.1.0+3974+90eded84.x86_64
mariadb-server-10.3.17-1.module+el8.1.0+3974+90eded84.x86_64
mariadb-backup-10.3.17-1.module+el8.1.0+3974+90eded84.x86_64

Comment 2 Luca Miccini 2020-04-14 07:03:42 UTC
Hi Roman, fyi the logs in the sosreports do not cover the timestamps in the description so I am not 100% sure I am looking at the right stuff, here what happens on controller-0:

Apr  9 18:45:21 controller-0 corosync[36483]:  [KNET  ] link: host: 3 link: 0 is down
Apr  9 18:45:21 controller-0 corosync[36483]:  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr  9 18:45:21 controller-0 corosync[36483]:  [KNET  ] host: host: 3 has no active links
Apr  9 18:45:22 controller-0 corosync[36483]:  [TOTEM ] Token has not been received in 844 ms
Apr  9 18:45:23 controller-0 corosync[36483]:  [KNET  ] link: host: 2 link: 0 is down
Apr  9 18:45:23 controller-0 corosync[36483]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr  9 18:45:23 controller-0 corosync[36483]:  [KNET  ] host: host: 2 has no active links
Apr  9 18:45:23 controller-0 corosync[36483]:  [TOTEM ] A processor failed, forming new configuration.
Apr  9 18:45:25 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.d) was formed. Members left: 2 3
Apr  9 18:45:25 controller-0 corosync[36483]:  [TOTEM ] Failed to receive the leave message. failed: 2 3
Apr  9 18:45:26 controller-0 corosync[36483]:  [KNET  ] rx: host: 3 link: 0 is up
Apr  9 18:45:26 controller-0 corosync[36483]:  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr  9 18:45:26 controller-0 corosync[36483]:  [KNET  ] rx: host: 2 link: 0 is up
Apr  9 18:45:26 controller-0 corosync[36483]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr  9 18:45:28 controller-0 corosync[36483]:  [KNET  ] link: host: 3 link: 0 is down
Apr  9 18:45:28 controller-0 corosync[36483]:  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr  9 18:45:28 controller-0 corosync[36483]:  [KNET  ] host: host: 3 has no active links
Apr  9 18:45:30 controller-0 corosync[36483]:  [TOTEM ] Token has not been received in 3274 ms
Apr  9 18:45:31 controller-0 corosync[36483]:  [KNET  ] link: host: 2 link: 0 is down
Apr  9 18:45:31 controller-0 corosync[36483]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr  9 18:45:31 controller-0 corosync[36483]:  [KNET  ] host: host: 2 has no active links
Apr  9 18:45:32 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.19) was formed. Members
Apr  9 18:45:32 controller-0 corosync[36483]:  [KNET  ] rx: host: 3 link: 0 is up
Apr  9 18:45:32 controller-0 corosync[36483]:  [KNET  ] rx: host: 2 link: 0 is up
Apr  9 18:45:32 controller-0 corosync[36483]:  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr  9 18:45:32 controller-0 corosync[36483]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr  9 18:45:33 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.1d) was formed. Members joined: 2
Apr  9 18:45:33 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.21) was formed. Members joined: 3
Apr  9 18:52:45 controller-0 corosync[36483]:  [KNET  ] link: host: 2 link: 0 is down
Apr  9 18:52:45 controller-0 corosync[36483]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr  9 18:52:45 controller-0 corosync[36483]:  [KNET  ] host: host: 2 has no active links
Apr  9 18:52:46 controller-0 corosync[36483]:  [KNET  ] link: host: 3 link: 0 is down
Apr  9 18:52:46 controller-0 corosync[36483]:  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr  9 18:52:46 controller-0 corosync[36483]:  [KNET  ] host: host: 3 has no active links
Apr  9 18:52:46 controller-0 corosync[36483]:  [TOTEM ] Token has not been received in 1237 ms
Apr  9 18:52:47 controller-0 corosync[36483]:  [TOTEM ] A processor failed, forming new configuration.
Apr  9 18:52:49 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.25) was formed. Members left: 2 3
Apr  9 18:52:49 controller-0 corosync[36483]:  [TOTEM ] Failed to receive the leave message. failed: 2 3
Apr  9 18:52:49 controller-0 corosync[36483]:  [KNET  ] rx: host: 3 link: 0 is up
Apr  9 18:52:49 controller-0 corosync[36483]:  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr  9 18:52:52 controller-0 corosync[36483]:  [KNET  ] rx: host: 2 link: 0 is up
Apr  9 18:52:52 controller-0 corosync[36483]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr  9 18:52:52 controller-0 corosync[36483]:  [TOTEM ] Token has not been received in 2186 ms
Apr  9 18:52:52 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.29) was formed. Members joined: 2 3
Apr  9 18:55:29 controller-0 corosync[36483]:  [KNET  ] link: host: 3 link: 0 is down
Apr  9 18:55:29 controller-0 corosync[36483]:  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr  9 18:55:29 controller-0 corosync[36483]:  [KNET  ] host: host: 3 has no active links
Apr  9 18:55:30 controller-0 corosync[36483]:  [TOTEM ] Token has not been received in 451 ms
Apr  9 18:55:30 controller-0 corosync[36483]:  [KNET  ] link: host: 2 link: 0 is down
Apr  9 18:55:30 controller-0 corosync[36483]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr  9 18:55:30 controller-0 corosync[36483]:  [KNET  ] host: host: 2 has no active links
Apr  9 18:55:30 controller-0 corosync[36483]:  [TOTEM ] A processor failed, forming new configuration.
Apr  9 18:55:33 controller-0 corosync[36483]:  [KNET  ] rx: host: 3 link: 0 is up
Apr  9 18:55:33 controller-0 corosync[36483]:  [KNET  ] rx: host: 2 link: 0 is up
Apr  9 18:55:33 controller-0 corosync[36483]:  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr  9 18:55:33 controller-0 corosync[36483]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr  9 18:55:34 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.35) was formed. Members left: 2 3
Apr  9 18:55:34 controller-0 corosync[36483]:  [TOTEM ] Failed to receive the leave message. failed: 2 3
Apr  9 18:55:34 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.39) was formed. Members joined: 2 3
Apr  9 18:56:47 controller-0 corosync[36483]:  [KNET  ] link: host: 3 link: 0 is down
Apr  9 18:56:47 controller-0 corosync[36483]:  [KNET  ] link: host: 2 link: 0 is down
Apr  9 18:56:47 controller-0 corosync[36483]:  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr  9 18:56:47 controller-0 corosync[36483]:  [KNET  ] host: host: 3 has no active links
Apr  9 18:56:47 controller-0 corosync[36483]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr  9 18:56:47 controller-0 corosync[36483]:  [KNET  ] host: host: 2 has no active links
Apr  9 18:56:47 controller-0 corosync[36483]:  [TOTEM ] Token has not been received in 829 ms
Apr  9 18:56:47 controller-0 corosync[36483]:  [TOTEM ] A processor failed, forming new configuration.
Apr  9 18:56:49 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.3d) was formed. Members left: 2 3
Apr  9 18:56:49 controller-0 corosync[36483]:  [TOTEM ] Failed to receive the leave message. failed: 2 3
Apr  9 18:56:51 controller-0 corosync[36483]:  [KNET  ] rx: host: 3 link: 0 is up
Apr  9 18:56:51 controller-0 corosync[36483]:  [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Apr  9 18:56:52 controller-0 corosync[36483]:  [KNET  ] rx: host: 2 link: 0 is up
Apr  9 18:56:52 controller-0 corosync[36483]:  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Apr  9 18:56:52 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.41) was formed. Members joined: 2
Apr  9 18:56:53 controller-0 corosync[36483]:  [TOTEM ] A new membership (1.45) was formed. Members joined: 3

same stuff on the other nodes. Controllers are randomly getting isolated and since you don't have fencing configured/enabled pacemaker can't recover promptly (if at all). 
It seems to me that the root cause could be the load on the environment but since we don't have enough data in the sosreports I can't be 100% sure. If you happen to reproduce maybe give us a ping and we'll look at the live env.

Comment 5 Roman Safronov 2020-04-16 11:34:51 UTC
Openvswitch version is openvswitch2.13-2.13.0-0.20200117git8ae6a5f.el8fdp.1.x86_64

[heat-admin@controller-0 ~]$ sudo podman exec -it ovn_controller rpm -qa | grep openvswitch
rhosp-openvswitch-2.13-7.el8ost.noarch
rhosp-openvswitch-ovn-host-2.13-7.el8ost.noarch
openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch
python3-openvswitch2.13-2.13.0-0.20200117git8ae6a5f.el8fdp.1.x86_64
python3-rhosp-openvswitch-2.13-7.el8ost.noarch
network-scripts-openvswitch2.13-2.13.0-0.20200117git8ae6a5f.el8fdp.1.x86_64
openvswitch2.13-2.13.0-0.20200117git8ae6a5f.el8fdp.1.x86_64

Comment 6 Roman Safronov 2020-04-16 12:16:22 UTC

*** This bug has been marked as a duplicate of bug 1821185 ***


Note You need to log in before you can comment on or make changes to this bug.