1840078 – ovs-vswitchd crashes [more info tbd]

Bug 1840078 - ovs-vswitchd crashes [more info tbd]

Summary: ovs-vswitchd crashes [more info tbd]

Keywords:
Status:	CLOSED DUPLICATE of bug 1825334
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openvswitch
Sub Component:
Version:	16.1 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	beta
Target Release:	---
Assignee:	Open vSwitch development team
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-26 10:31 UTC by Eduardo Olivares
Modified:	2020-06-17 08:14 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-04 12:37:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ovs-vswitchd coredumps (5.69 MB, application/gzip) 2020-05-27 20:01 UTC, Eduardo Olivares	no flags	Details
View All

Description Eduardo Olivares 2020-05-26 10:31:26 UTC

Description of problem:
Many tempest tests are failing during regular OSP16.1 OVN job. HTTP requests are not responded during the test cases.
[root@panther23 tempest-dir]# grep "urllib3.exceptions.MaxRetryError.*Max retries exceeded with url" tempest-results-neutron.1.xml | wc -l
53

urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.0.0.128', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('&lt;urllib3.connection.HTTPConnection object at 0x7f07540dd240>: Failed to establish a new connection: [Errno 111] Connection refused',))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.0.0.128', port=9696): Max retries exceeded with url: /v2.0/routers/50ed07f6-238e-4bca-b797-d61f501052bd/remove_router_interface (Caused by NewConnectionError('&lt;urllib3.conne
ction.HTTPConnection object at 0x7f0750061a20>: Failed to establish a new connection: [Errno 111] Connection refused',))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.0.0.128', port=9696): Max retries exceeded with url: /v2.0/subnets/d17445e1-f82c-4a1b-9b88-3ed0e1bf1d9b (Caused by NewConnectionError('&lt;urllib3.connection.HTTPConnection obj
ect at 0x7f0753bbfe80>: Failed to establish a new connection: [Errno 111] Connection refused',))
...


The following Traceback is printed many times on controller-0 logs, for many different OSP components:
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python3.6/site-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     connection.scalar(select([1]))
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 920, in scalar
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     return self.execute(object_, *multiparams, **params).scalar()
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 988, in execute
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     return meth(self, multiparams, params)
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib64/python3.6/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     return connection._execute_clauseelement(self, multiparams, params)
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     distilled_params,
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     e, statement, parameters, cursor, context
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1464, in _handle_dbapi_exception
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     util.raise_from_cause(newraise, exc_info)
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     reraise(type(exception), exception, tb=exc_tb, cause=cause)
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     raise value.with_traceback(tb)
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     cursor, statement, parameters, context
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     cursor.execute(statement, parameters)
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python3.6/site-packages/pymysql/cursors.py", line 165, in execute
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     result = self._query(query)
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python3.6/site-packages/pymysql/cursors.py", line 321, in _query
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     conn.query(q)
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 860, in query
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1061, in _read_query_result
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     result.read()
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1349, in read
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     first_packet = self.connection._read_packet()
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 991, in _read_packet
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     packet_header = self._read_bytes(4)
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1037, in _read_bytes
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines     CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines [SQL: SELECT 1]
2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines (Background on this error at: http://sqlalche.me/e/e3q8)


[root@panther23 containers]# pwd
/tmp/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve/41/controller-0/var/log/containers
[root@panther23 containers]# grep -rl "pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')" 
cinder/cinder-scheduler.log
stdouts/neutron_api.log
neutron/server.log
placement/placement.log
nova/nova-conductor.log
nova/nova-scheduler.log
heat/heat-engine.log
httpd/keystone/keystone_wsgi_error.log
keystone/keystone.log
[root@panther23 containers]# grep -r "pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')" | wc -l
993







Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20200520.n.0

How reproducible:
1/1

Steps to Reproduce:
1. Run job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve/
2.
3.

Actual results:
48 tempest tests failed

Expected results:
No tempest tests failing

Additional info:
This bug could be related with https://bugzilla.redhat.com/show_bug.cgi?id=1837235, despite the fact that bug reported a problem performing osp update from 16 to 16.1 and this bug was reproduced on a regular osp16.1 ovn installation.
I downloaded controller-0 logs from BZ1837235 and found the same MySQL error:
[root@panther23 containers]# grep -rl "pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during" 
cinder/cinder-scheduler.log
cinder/cinder-backup.log
neutron/server.log
neutron/server.log.1
placement/placement.log
nova/nova-conductor.log
nova/nova-scheduler.log
heat/heat-engine.log
[root@panther23 containers]# grep -r "pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during" | 
wc -l                                                                                                                 
207
[root@panther23 containers]# pwd
/tmp/DFG-upgrades-updates-16-to-16.1-from-z1-composable-ipv6/1/controller-0/var/log/containers

Comment 1 Luca Miccini 2020-05-26 14:54:41 UTC

this looks like a network issue between the controllers:

May 25 00:16:02 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:16:03 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:16:06 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:16:07 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:16:11 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:16:12 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:16:14 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:16:15 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:16:20 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:16:21 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:16:46 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:16:50 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:16:52 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:16:53 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:16:55 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:16:58 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:17:32 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:17:37 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:18:17 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:18:18 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:18:20 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:18:21 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:19:47 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:19:47 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:19:49 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:19:53 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:19:54 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:19:59 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:22:42 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:22:42 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:22:45 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:22:47 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:22:49 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:22:51 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:22:52 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:22:53 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:23:26 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:23:26 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:23:27 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:23:28 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:23:32 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:23:34 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:23:35 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:23:37 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:24:56 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:24:56 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:24:59 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:25:01 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:26:17 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:26:17 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:26:22 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:26:24 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:28:19 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:28:20 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:28:21 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:28:24 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:28:27 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:28:28 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:28:31 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:28:31 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:30:41 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:30:41 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:30:43 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:30:44 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:30:48 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:30:48 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:30:52 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:30:52 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:32:03 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:32:03 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:32:07 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:32:08 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:32:12 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:32:12 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:32:16 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:32:17 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:34:16 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:34:17 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:34:24 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:34:27 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:38:16 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:38:16 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:38:20 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:38:20 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:40:02 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:40:02 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:40:06 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:40:06 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:40:11 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:40:12 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:40:16 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:40:17 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:40:46 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:40:46 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:40:50 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:40:51 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:42:35 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:42:35 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:42:39 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:42:40 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:45:56 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:45:57 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:46:03 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:46:03 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:54:05 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:54:06 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:54:08 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up
May 25 00:54:14 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:58:18 [38575] controller-0 corosync info    [KNET  ] link: host: 2 link: 0 is down
May 25 00:58:19 [38575] controller-0 corosync info    [KNET  ] link: host: 3 link: 0 is down
May 25 00:58:23 [38575] controller-0 corosync info    [KNET  ] rx: host: 2 link: 0 is up
May 25 00:58:23 [38575] controller-0 corosync info    [KNET  ] rx: host: 3 link: 0 is up

ovs sigsegv?

May 25 00:14:40 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV
May 25 00:14:40 controller-1 systemd[1]: ovs-vswitchd.service: Failed with result 'signal'.
May 25 00:14:40 controller-1 systemd[1]: Starting nova_scheduler healthcheck...
May 25 00:14:40 controller-1 systemd[1]: Starting nova_conductor healthcheck...
May 25 00:14:40 controller-1 systemd[1]: Starting memcached healthcheck...
May 25 00:14:40 controller-1 systemd[1]: Starting nova_vnc_proxy healthcheck...
May 25 00:14:40 controller-1 systemd[1]: Starting clustercheck healthcheck...
May 25 00:14:40 controller-1 systemd[1]: Starting swift_container_server healthcheck...
May 25 00:14:40 controller-1 systemd[1]: Starting swift_object_server healthcheck...
May 25 00:14:40 controller-1 systemd[1]: Starting swift_rsync healthcheck...
May 25 00:14:41 controller-1 systemd[1]: ovs-vswitchd.service: Service RestartSec=100ms expired, scheduling restart.
May 25 00:14:41 controller-1 systemd[1]: ovs-vswitchd.service: Scheduled restart job, restart counter is at 17.
May 25 00:14:41 controller-1 systemd[1]: Stopping Open vSwitch...
May 25 00:14:41 controller-1 systemd[1]: Stopped Open vSwitch.
May 25 00:14:41 controller-1 systemd[1]: Stopped Open vSwitch Forwarding Unit.
May 25 00:14:41 controller-1 systemd[1]: Stopping Open vSwitch Database Unit...
May 25 00:14:41 controller-1 ovs-ctl[384851]: Exiting ovsdb-server (348892) [  OK  ]
May 25 00:14:41 controller-1 systemd[1]: Stopped Open vSwitch Database Unit.
May 25 00:14:41 controller-1 systemd[1]: Starting Open vSwitch Database Unit...

grep ovs-vswitchd.service controller-1/var/log/messages |grep SEGV |wc -l
38

no wonder the cluster breaks.

we had quite few of these, see:

https://bugzilla.redhat.com/show_bug.cgi?id=1824847
https://bugzilla.redhat.com/show_bug.cgi?id=1823178

and especially:
https://bugzilla.redhat.com/show_bug.cgi?id=1821185
https://bugzilla.redhat.com/show_bug.cgi?id=1821185#c19

this looks a duplicate of the above bz.

Comment 2 Jakub Libosvar 2020-05-27 09:24:32 UTC

Can you please provide a core-dump of the crashed ovs-vswitchd?

Comment 3 Eduardo Olivares 2020-05-27 20:01:21 UTC

Created attachment 1692830 [details]
ovs-vswitchd coredumps

Comment 4 Eduardo Olivares 2020-05-27 20:02:45 UTC

(In reply to eolivare from comment #3)
> Created attachment 1692830 [details]
> ovs-vswitchd coredumps

I forgot to mention that this is the OVS version:
openvswitch2.13-2.13.0-25.el8fdp.1.x86_64

Comment 5 Jakub Libosvar 2020-05-28 13:48:35 UTC

Raising the severity to urgent as this takes the whole openstack down.

Comment 6 Aaron Conole 2020-06-02 17:55:22 UTC

....
#9056 0x0000559e0c50aee8 in xlate_normal (ctx=0x7f256f8bb6e0) at ../ofproto/ofproto-dpif-xlate.c:3166
#9057 xlate_output_action (ctx=ctx@entry=0x7f256f8bb6e0, port=<optimized out>, controller_len=<optimized out>, may_packet_in=may_packet_in@entry=true, is_last_action=<optimized out>, truncate=truncate@entry=false, group_bucket_action=false) at ../ofproto/ofproto-dpif-xlate.c:5190
#9058 0x0000559e0c50b820 in do_xlate_actions (ofpacts=<optimized out>, ofpacts_len=<optimized out>, ctx=<optimized out>, is_last_action=<optimized out>, group_bucket_action=<optimized out>) at ../include/openvswitch/ofp-actions.h:1302
#9059 0x0000559e0c511753 in xlate_actions (xin=xin@entry=0x7f256f8bc570, xout=xout@entry=0x7f256f8f78d8) at ../ofproto/ofproto-dpif-xlate.c:7699
#9060 0x0000559e0c500586 in upcall_xlate (wc=0x7f256f8f7930, odp_actions=0x7f256f8f78f0, upcall=0x7f256f8f7870, udpif=0x559e0ef12b80) at ../ofproto/ofproto-dpif-upcall.c:1204
#9061 process_upcall (udpif=udpif@entry=0x559e0ef12b80, upcall=upcall@entry=0x7f256f8f7870, odp_actions=odp_actions@entry=0x7f256f8f78f0, wc=wc@entry=0x7f256f8f7930) at ../ofproto/ofproto-dpif-upcall.c:1420
#9062 0x0000559e0c501183 in recv_upcalls (handler=<optimized out>, handler=<optimized out>) at ../ofproto/ofproto-dpif-upcall.c:842
#9063 0x0000559e0c50164c in udpif_upcall_handler (arg=0x559e0ef75b10) at ../ofproto/ofproto-dpif-upcall.c:759
#9064 0x0000559e0c5c8e03 in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-thread.c:383
#9065 0x00007f2572a212de in start_thread () from /lib64/libpthread.so.0
#9066 0x00007f2571e93e83 in timerfd_create () from /lib64/libc.so.6
#9067 0x0000000000000000 in ?? ()


Looks like might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1821185 ?  Is this using OVN?

Comment 7 Jakub Libosvar 2020-06-03 09:11:29 UTC

Yes, it is using OVN. Bug 1821185 was caused by bug 1825334 in OVN, that has been fixed in ovn2.13-2.13.0-18.el8fdp.x86_64. We are running with this version so we should no longer see bug 1821185

Comment 8 Dumitru Ceara 2020-06-03 09:19:12 UTC

It would be great to have the OVN DBs to determine if this is a packet loop again or just a topology that generates enough resubmits to hit the stack limit in ovs.

Comment 9 Jakub Libosvar 2020-06-04 10:34:06 UTC

This is a dup of bug 1825334. It reproduced because of a package mismatch in the OVN DBs image:
ovn2.13-central-2.13.0-11.el8fdp.x86_64
ovn2.13-2.13.0-18.el8fdp.x86_64

Ovn is -18 while ovn-central, which is the package that contains the fixed code, is on -11 version and this version doesn't contain the fix.

So now the main question is how were the images built and why do we have an OVN version mismatch there.

Comment 12 Jakub Libosvar 2020-06-04 12:37:08 UTC

Marking as duplicate. The problem was caused by using bad repos.

*** This bug has been marked as a duplicate of bug 1825334 ***

Comment 13 Eduardo Olivares 2020-06-17 08:02:19 UTC

t(In reply to Jakub Libosvar from comment #9)
> This is a dup of bug 1825334. It reproduced because of a package mismatch in
> the OVN DBs image:
> ovn2.13-central-2.13.0-11.el8fdp.x86_64
> ovn2.13-2.13.0-18.el8fdp.x86_64
> 
> Ovn is -18 while ovn-central, which is the package that contains the fixed
> code, is on -11 version and this version doesn't contain the fix.
> 
> So now the main question is how were the images built and why do we have an
> OVN version mismatch there.


Sorry I cannot answer to how the images are built and why the mismatch. Let's see if Lon can clarify this point.

I can confirm that tests started passing with later puddles, starting with RHOS-16.1-RHEL-8-20200604.n.1
 
I checked in an even later puddle and OVN versions are aligned:
RHOS-16.1-RHEL-8-20200604.n.1
ovn2.13-central-2.13.0-30.el8fdp.x86_64
ovn2.13-2.13.0-30.el8fdp.x86_64

Note You need to log in before you can comment on or make changes to this bug.