Created attachment 1651854 [details] tempest logs and ovn-metadata-agent logs Description of problem: Info obtained from CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/28/: - 40 fails in second tempest run, after update. 32 out of 40 because VM instance could not be accessed via ssh. - 35 fails in third tempest run, after overcloud reboot. 29 out of 35 because VM instance could not be accessed via ssh. I have focused on a failing test that looks simple, test_basic_instance. Find tempest logs attached. After having created network, subnet, secgroup, etc, a server (instance) is created at 19:44:02 2020-01-09 19:44:02,817 352985 INFO [tempest.lib.common.rest_client] Request (NetworkBasicTest:test_basic_instance): 202 POST http://10.0.0.117:8774/v2.1/servers 1.601s Response... Body: b'{"server": {"id": "fd91f455-3a2f-43d6-baea-348826eb8eb4", "links": [{"rel": "self", "href": "http://10.0.0.117:8774/v2.1/servers/fd91f455-3a2f-43d6-baea-348826eb8eb4"}, {"rel": "bookmark", "href": "http://10.0.0.117:8774/servers/fd91f455-3a2f-43d6-baea-348826eb8eb4"}], "OS-DCF:diskConfig": "MANUAL", "security_groups": [{"name": "tempest-secgroup-1181119668"}], "adminPass": "fX3x8b9aYpJM"}}' It changes from status BUILD to ACTIVE at 19:44:26 according to tempest logs and FIP=10.0.0.214 is assigned at 19:44:28. Immediately, tempest tries to connect via ssh and all retries fail until 19:49:28. OVN logs at compute-1 (find them attached): 2020-01-09 19:44:08.937 101840 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=None, old_conditions=None) to row=Port_Binding(tunnel_key=3, parent_port=[], logical_port=cad2bcee-aaa0-4aff-9f0f-1da971ac265f, chassis=[<ovs.db.idl.Row object at 0x7f405d79ae80>], mac=['fa:16:3e:8c:4a:2c 10.100.0.6'], options={'requested-chassis': 'compute-1.redhat.local'}, ha_chassis_group=[], gateway_chassis=[], encap=[], external_ids={'neutron:cidrs': '10.100.0.6/28', 'neutron:device_id': 'fd91f455-3a2f-43d6-baea-348826eb8eb4', 'neutron:device_owner': 'compute:nova', 'neutron:network_name': 'neutron-b32cf836-f6dc-4c7a-b7a6-b91351b5a2bb', 'neutron:port_name': '', 'neutron:project_id': '98143f9ea32643de999dd520d1918385', 'neutron:revision_number': '2', 'neutron:security_group_ids': '1b99205b-1b78-4623-ac29-1d690876418c'}, type=, nat_addresses=[], virtual_parent=[], datapath=1d57661f-61ce-4cc3-87ce-946427c38e6a, tag=[]) old=Port_Binding(chassis=[]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44 ... 2020-01-09 19:44:09.729 101840 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/1d57661f-61ce-4cc3-87ce-946427c38e6a.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 ... Error: unable to pull undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200103.1: unable to pull image: Error initializing source docker://undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200103.1: pinging docker registry returned: Get http://undercloud-0.ctlplane.redhat.local:8787/v2/: dial tcp 192.168.24.1:8787: connect: network is unreachable It crashes at this moment (19:44:10) and there are more crashes related with these values at 19:44:25 and at 19:44:29. And there other crashes, but related with other values, I guess that's because other tests are running in parallel. For some reason, no crashes in compute-0: [root@titan07 neutron]# grep -c Traceback ../../../../../compute-1/var/log/containers/neutron/* ../../../../../compute-1/var/log/containers/neutron/kill-script.log:0 ../../../../../compute-1/var/log/containers/neutron/ovn-metadata-agent.log:159 ../../../../../compute-1/var/log/containers/neutron/ovn-metadata-agent.log.1:168 ../../../../../compute-1/var/log/containers/neutron/privsep-helper.log:0 [root@titan07 neutron]# grep -c Traceback ../../../../../compute-0/var/log/containers/neutron/* ../../../../../compute-0/var/log/containers/neutron/kill-script.log:0 ../../../../../compute-0/var/log/containers/neutron/ovn-metadata-agent.log:0 ../../../../../compute-0/var/log/containers/neutron/ovn-metadata-agent.log.1:0 ../../../../../compute-0/var/log/containers/neutron/privsep-helper.log:0 Version-Release number of selected component (if applicable): update from: RHOS_TRUNK-16.0-RHEL-8-20200103.n.1 to: RHOS_TRUNK-16.0-RHEL-8-20200109.n.1 (tested with RHOS_TRUNK-16.0-RHEL-8-20200110.n.3 as well, same fails) How reproducible: 100% (failed 3/3) Steps to Reproduce: 1. Run CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/ 2. 3. Actual results: many tempest tests failing with reason: tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.214 via SSH timed out. User: cirros, Password: None Expected results: tests passing Additional info:
Scenario is reproduced in titan07.lab.eng.tlv2.redhat.com and will be available for some time for debugging. It was reproduced updating from RHOS_TRUNK-16.0-RHEL-8-20200109.n.1 to RHOS_TRUNK-16.0-RHEL-8-20200110.n.3 The issue only happens for VMs created at compute-0. VM creation at compute-1 is successful. According to the logs, undercloud-0.ctlplane.redhat.local is unreachable, but I checked it is reachable from both computes.
The problem here is that metadata-agent and its side-car containers are not restarted during an update because restart leads to metadata api being not available. That means metadata agent runs with old bits and tries to spawn a side-car container for a new workload in given network with image from the pre-update version. This is not OVN specific and we think it will also cause issues with ml2/ovs side-car containers.
Small update: containers run on correct images, but in ovn_metadata_agent_wrapper we have wrong name of image . I still need to check way this file is not recreated correctly
*** Bug 1783978 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:0283
I think this bug has been wrongly closed. The original fix was rejected and is not available in osp16.0 (see target release=z1). Greg and Brent are discussing how to document it as a known issue. It makes no sense to document it as a known issue and to included in the errata as a fixed bug. The status change made by the errata tool (see comment 23) from ON_DEV to RELEASE_PENDING makes no sense to me. Maybe this happened because I moved it from ON_QA to ON_DEV (should have moved it to ASSIGNED). Sorry for that.
This bug is, unfortunately, not fixed as described in comment #24. We have opened a new bug for tracking this. Please refer to bug 1801763 for future updates. *** This bug has been marked as a duplicate of bug 1801763 ***