Description of problem: Many tempest tests are failing during regular OSP16.1 OVN job. HTTP requests are not responded during the test cases. [root@panther23 tempest-dir]# grep "urllib3.exceptions.MaxRetryError.*Max retries exceeded with url" tempest-results-neutron.1.xml | wc -l 53 urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.0.0.128', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f07540dd240>: Failed to establish a new connection: [Errno 111] Connection refused',)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.0.0.128', port=9696): Max retries exceeded with url: /v2.0/routers/50ed07f6-238e-4bca-b797-d61f501052bd/remove_router_interface (Caused by NewConnectionError('<urllib3.conne ction.HTTPConnection object at 0x7f0750061a20>: Failed to establish a new connection: [Errno 111] Connection refused',)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.0.0.128', port=9696): Max retries exceeded with url: /v2.0/subnets/d17445e1-f82c-4a1b-9b88-3ed0e1bf1d9b (Caused by NewConnectionError('<urllib3.connection.HTTPConnection obj ect at 0x7f0753bbfe80>: Failed to establish a new connection: [Errno 111] Connection refused',)) ... The following Traceback is printed many times on controller-0 logs, for many different OSP components: 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last): 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines connection.scalar(select([1])) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 920, in scalar 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines return self.execute(object_, *multiparams, **params).scalar() 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 988, in execute 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines return meth(self, multiparams, params) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines return connection._execute_clauseelement(self, multiparams, params) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines distilled_params, 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines e, statement, parameters, cursor, context 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1464, in _handle_dbapi_exception 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines util.raise_from_cause(newraise, exc_info) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines reraise(type(exception), exception, tb=exc_tb, cause=cause) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines raise value.with_traceback(tb) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines cursor, statement, parameters, context 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines cursor.execute(statement, parameters) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/cursors.py", line 165, in execute 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines result = self._query(query) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/cursors.py", line 321, in _query 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines conn.query(q) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 860, in query 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines self._affected_rows = self._read_query_result(unbuffered=unbuffered) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1061, in _read_query_result 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines result.read() 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1349, in read 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines first_packet = self.connection._read_packet() 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 991, in _read_packet 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines packet_header = self._read_bytes(4) 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1037, in _read_bytes 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query") 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines [SQL: SELECT 1] 2020-05-25 00:16:22.743 28 ERROR oslo_db.sqlalchemy.engines (Background on this error at: http://sqlalche.me/e/e3q8) [root@panther23 containers]# pwd /tmp/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve/41/controller-0/var/log/containers [root@panther23 containers]# grep -rl "pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')" cinder/cinder-scheduler.log stdouts/neutron_api.log neutron/server.log placement/placement.log nova/nova-conductor.log nova/nova-scheduler.log heat/heat-engine.log httpd/keystone/keystone_wsgi_error.log keystone/keystone.log [root@panther23 containers]# grep -r "pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')" | wc -l 993 Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20200520.n.0 How reproducible: 1/1 Steps to Reproduce: 1. Run job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve/ 2. 3. Actual results: 48 tempest tests failed Expected results: No tempest tests failing Additional info: This bug could be related with https://bugzilla.redhat.com/show_bug.cgi?id=1837235, despite the fact that bug reported a problem performing osp update from 16 to 16.1 and this bug was reproduced on a regular osp16.1 ovn installation. I downloaded controller-0 logs from BZ1837235 and found the same MySQL error: [root@panther23 containers]# grep -rl "pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during" cinder/cinder-scheduler.log cinder/cinder-backup.log neutron/server.log neutron/server.log.1 placement/placement.log nova/nova-conductor.log nova/nova-scheduler.log heat/heat-engine.log [root@panther23 containers]# grep -r "pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during" | wc -l 207 [root@panther23 containers]# pwd /tmp/DFG-upgrades-updates-16-to-16.1-from-z1-composable-ipv6/1/controller-0/var/log/containers
this looks like a network issue between the controllers: May 25 00:16:02 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:16:03 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:16:06 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:16:07 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:16:11 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:16:12 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:16:14 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:16:15 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:16:20 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:16:21 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:16:46 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:16:50 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:16:52 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:16:53 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:16:55 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:16:58 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:17:32 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:17:37 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:18:17 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:18:18 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:18:20 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:18:21 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:19:47 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:19:47 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:19:49 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:19:53 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:19:54 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:19:59 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:22:42 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:22:42 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:22:45 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:22:47 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:22:49 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:22:51 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:22:52 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:22:53 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:23:26 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:23:26 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:23:27 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:23:28 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:23:32 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:23:34 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:23:35 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:23:37 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:24:56 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:24:56 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:24:59 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:25:01 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:26:17 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:26:17 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:26:22 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:26:24 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:28:19 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:28:20 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:28:21 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:28:24 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:28:27 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:28:28 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:28:31 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:28:31 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:30:41 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:30:41 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:30:43 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:30:44 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:30:48 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:30:48 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:30:52 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:30:52 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:32:03 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:32:03 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:32:07 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:32:08 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:32:12 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:32:12 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:32:16 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:32:17 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:34:16 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:34:17 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:34:24 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:34:27 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:38:16 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:38:16 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:38:20 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:38:20 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:40:02 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:40:02 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:40:06 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:40:06 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:40:11 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:40:12 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:40:16 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:40:17 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:40:46 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:40:46 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:40:50 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:40:51 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:42:35 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:42:35 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:42:39 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:42:40 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:45:56 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:45:57 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:46:03 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:46:03 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:54:05 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:54:06 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:54:08 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up May 25 00:54:14 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:58:18 [38575] controller-0 corosync info [KNET ] link: host: 2 link: 0 is down May 25 00:58:19 [38575] controller-0 corosync info [KNET ] link: host: 3 link: 0 is down May 25 00:58:23 [38575] controller-0 corosync info [KNET ] rx: host: 2 link: 0 is up May 25 00:58:23 [38575] controller-0 corosync info [KNET ] rx: host: 3 link: 0 is up ovs sigsegv? May 25 00:14:40 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV May 25 00:14:40 controller-1 systemd[1]: ovs-vswitchd.service: Failed with result 'signal'. May 25 00:14:40 controller-1 systemd[1]: Starting nova_scheduler healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting nova_conductor healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting memcached healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting nova_vnc_proxy healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting clustercheck healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting swift_container_server healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting swift_object_server healthcheck... May 25 00:14:40 controller-1 systemd[1]: Starting swift_rsync healthcheck... May 25 00:14:41 controller-1 systemd[1]: ovs-vswitchd.service: Service RestartSec=100ms expired, scheduling restart. May 25 00:14:41 controller-1 systemd[1]: ovs-vswitchd.service: Scheduled restart job, restart counter is at 17. May 25 00:14:41 controller-1 systemd[1]: Stopping Open vSwitch... May 25 00:14:41 controller-1 systemd[1]: Stopped Open vSwitch. May 25 00:14:41 controller-1 systemd[1]: Stopped Open vSwitch Forwarding Unit. May 25 00:14:41 controller-1 systemd[1]: Stopping Open vSwitch Database Unit... May 25 00:14:41 controller-1 ovs-ctl[384851]: Exiting ovsdb-server (348892) [ OK ] May 25 00:14:41 controller-1 systemd[1]: Stopped Open vSwitch Database Unit. May 25 00:14:41 controller-1 systemd[1]: Starting Open vSwitch Database Unit... grep ovs-vswitchd.service controller-1/var/log/messages |grep SEGV |wc -l 38 no wonder the cluster breaks. we had quite few of these, see: https://bugzilla.redhat.com/show_bug.cgi?id=1824847 https://bugzilla.redhat.com/show_bug.cgi?id=1823178 and especially: https://bugzilla.redhat.com/show_bug.cgi?id=1821185 https://bugzilla.redhat.com/show_bug.cgi?id=1821185#c19 this looks a duplicate of the above bz.
Can you please provide a core-dump of the crashed ovs-vswitchd?
Created attachment 1692830 [details] ovs-vswitchd coredumps
(In reply to eolivare from comment #3) > Created attachment 1692830 [details] > ovs-vswitchd coredumps I forgot to mention that this is the OVS version: openvswitch2.13-2.13.0-25.el8fdp.1.x86_64
Raising the severity to urgent as this takes the whole openstack down.
.... #9056 0x0000559e0c50aee8 in xlate_normal (ctx=0x7f256f8bb6e0) at ../ofproto/ofproto-dpif-xlate.c:3166 #9057 xlate_output_action (ctx=ctx@entry=0x7f256f8bb6e0, port=<optimized out>, controller_len=<optimized out>, may_packet_in=may_packet_in@entry=true, is_last_action=<optimized out>, truncate=truncate@entry=false, group_bucket_action=false) at ../ofproto/ofproto-dpif-xlate.c:5190 #9058 0x0000559e0c50b820 in do_xlate_actions (ofpacts=<optimized out>, ofpacts_len=<optimized out>, ctx=<optimized out>, is_last_action=<optimized out>, group_bucket_action=<optimized out>) at ../include/openvswitch/ofp-actions.h:1302 #9059 0x0000559e0c511753 in xlate_actions (xin=xin@entry=0x7f256f8bc570, xout=xout@entry=0x7f256f8f78d8) at ../ofproto/ofproto-dpif-xlate.c:7699 #9060 0x0000559e0c500586 in upcall_xlate (wc=0x7f256f8f7930, odp_actions=0x7f256f8f78f0, upcall=0x7f256f8f7870, udpif=0x559e0ef12b80) at ../ofproto/ofproto-dpif-upcall.c:1204 #9061 process_upcall (udpif=udpif@entry=0x559e0ef12b80, upcall=upcall@entry=0x7f256f8f7870, odp_actions=odp_actions@entry=0x7f256f8f78f0, wc=wc@entry=0x7f256f8f7930) at ../ofproto/ofproto-dpif-upcall.c:1420 #9062 0x0000559e0c501183 in recv_upcalls (handler=<optimized out>, handler=<optimized out>) at ../ofproto/ofproto-dpif-upcall.c:842 #9063 0x0000559e0c50164c in udpif_upcall_handler (arg=0x559e0ef75b10) at ../ofproto/ofproto-dpif-upcall.c:759 #9064 0x0000559e0c5c8e03 in ovsthread_wrapper (aux_=<optimized out>) at ../lib/ovs-thread.c:383 #9065 0x00007f2572a212de in start_thread () from /lib64/libpthread.so.0 #9066 0x00007f2571e93e83 in timerfd_create () from /lib64/libc.so.6 #9067 0x0000000000000000 in ?? () Looks like might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1821185 ? Is this using OVN?
Yes, it is using OVN. Bug 1821185 was caused by bug 1825334 in OVN, that has been fixed in ovn2.13-2.13.0-18.el8fdp.x86_64. We are running with this version so we should no longer see bug 1821185
It would be great to have the OVN DBs to determine if this is a packet loop again or just a topology that generates enough resubmits to hit the stack limit in ovs.
This is a dup of bug 1825334. It reproduced because of a package mismatch in the OVN DBs image: ovn2.13-central-2.13.0-11.el8fdp.x86_64 ovn2.13-2.13.0-18.el8fdp.x86_64 Ovn is -18 while ovn-central, which is the package that contains the fixed code, is on -11 version and this version doesn't contain the fix. So now the main question is how were the images built and why do we have an OVN version mismatch there.
Marking as duplicate. The problem was caused by using bad repos. *** This bug has been marked as a duplicate of bug 1825334 ***
t(In reply to Jakub Libosvar from comment #9) > This is a dup of bug 1825334. It reproduced because of a package mismatch in > the OVN DBs image: > ovn2.13-central-2.13.0-11.el8fdp.x86_64 > ovn2.13-2.13.0-18.el8fdp.x86_64 > > Ovn is -18 while ovn-central, which is the package that contains the fixed > code, is on -11 version and this version doesn't contain the fix. > > So now the main question is how were the images built and why do we have an > OVN version mismatch there. Sorry I cannot answer to how the images are built and why the mismatch. Let's see if Lon can clarify this point. I can confirm that tests started passing with later puddles, starting with RHOS-16.1-RHEL-8-20200604.n.1 I checked in an even later puddle and OVN versions are aligned: RHOS-16.1-RHEL-8-20200604.n.1 ovn2.13-central-2.13.0-30.el8fdp.x86_64 ovn2.13-2.13.0-30.el8fdp.x86_64