+++ This bug was initially created as a clone of Bug #1790467 +++ Description of problem: Info obtained from CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/28/: - 40 fails in second tempest run, after update. 32 out of 40 because VM instance could not be accessed via ssh. - 35 fails in third tempest run, after overcloud reboot. 29 out of 35 because VM instance could not be accessed via ssh. I have focused on a failing test that looks simple, test_basic_instance. Find tempest logs attached. After having created network, subnet, secgroup, etc, a server (instance) is created at 19:44:02 2020-01-09 19:44:02,817 352985 INFO [tempest.lib.common.rest_client] Request (NetworkBasicTest:test_basic_instance): 202 POST http://10.0.0.117:8774/v2.1/servers 1.601s Response... Body: b'{"server": {"id": "fd91f455-3a2f-43d6-baea-348826eb8eb4", "links": [{"rel": "self", "href": "http://10.0.0.117:8774/v2.1/servers/fd91f455-3a2f-43d6-baea-348826eb8eb4"}, {"rel": "bookmark", "href": "http://10.0.0.117:8774/servers/fd91f455-3a2f-43d6-baea-348826eb8eb4"}], "OS-DCF:diskConfig": "MANUAL", "security_groups": [{"name": "tempest-secgroup-1181119668"}], "adminPass": "fX3x8b9aYpJM"}}' It changes from status BUILD to ACTIVE at 19:44:26 according to tempest logs and FIP=10.0.0.214 is assigned at 19:44:28. Immediately, tempest tries to connect via ssh and all retries fail until 19:49:28. OVN logs at compute-1 (find them attached): 2020-01-09 19:44:08.937 101840 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=None, old_conditions=None) to row=Port_Binding(tunnel_key=3, parent_port=[], logical_port=cad2bcee-aaa0-4aff-9f0f-1da971ac265f, chassis=[<ovs.db.idl.Row object at 0x7f405d79ae80>], mac=['fa:16:3e:8c:4a:2c 10.100.0.6'], options={'requested-chassis': 'compute-1.redhat.local'}, ha_chassis_group=[], gateway_chassis=[], encap=[], external_ids={'neutron:cidrs': '10.100.0.6/28', 'neutron:device_id': 'fd91f455-3a2f-43d6-baea-348826eb8eb4', 'neutron:device_owner': 'compute:nova', 'neutron:network_name': 'neutron-b32cf836-f6dc-4c7a-b7a6-b91351b5a2bb', 'neutron:port_name': '', 'neutron:project_id': '98143f9ea32643de999dd520d1918385', 'neutron:revision_number': '2', 'neutron:security_group_ids': '1b99205b-1b78-4623-ac29-1d690876418c'}, type=, nat_addresses=[], virtual_parent=[], datapath=1d57661f-61ce-4cc3-87ce-946427c38e6a, tag=[]) old=Port_Binding(chassis=[]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44 ... 2020-01-09 19:44:09.729 101840 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/1d57661f-61ce-4cc3-87ce-946427c38e6a.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261 ... Error: unable to pull undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200103.1: unable to pull image: Error initializing source docker://undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200103.1: pinging docker registry returned: Get http://undercloud-0.ctlplane.redhat.local:8787/v2/: dial tcp 192.168.24.1:8787: connect: network is unreachable It crashes at this moment (19:44:10) and there are more crashes related with these values at 19:44:25 and at 19:44:29. And there other crashes, but related with other values, I guess that's because other tests are running in parallel. For some reason, no crashes in compute-0: [root@titan07 neutron]# grep -c Traceback ../../../../../compute-1/var/log/containers/neutron/* ../../../../../compute-1/var/log/containers/neutron/kill-script.log:0 ../../../../../compute-1/var/log/containers/neutron/ovn-metadata-agent.log:159 ../../../../../compute-1/var/log/containers/neutron/ovn-metadata-agent.log.1:168 ../../../../../compute-1/var/log/containers/neutron/privsep-helper.log:0 [root@titan07 neutron]# grep -c Traceback ../../../../../compute-0/var/log/containers/neutron/* ../../../../../compute-0/var/log/containers/neutron/kill-script.log:0 ../../../../../compute-0/var/log/containers/neutron/ovn-metadata-agent.log:0 ../../../../../compute-0/var/log/containers/neutron/ovn-metadata-agent.log.1:0 ../../../../../compute-0/var/log/containers/neutron/privsep-helper.log:0 Version-Release number of selected component (if applicable): update from: RHOS_TRUNK-16.0-RHEL-8-20200103.n.1 to: RHOS_TRUNK-16.0-RHEL-8-20200109.n.1 (tested with RHOS_TRUNK-16.0-RHEL-8-20200110.n.3 as well, same fails) How reproducible: 100% (failed 3/3) Steps to Reproduce: 1. Run CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/ 2. 3. Actual results: many tempest tests failing with reason: tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.214 via SSH timed out. User: cirros, Password: None Expected results: tests passing Additional info: --- Additional comment from on 2020-01-14 08:09:18 UTC --- Scenario is reproduced in titan07.lab.eng.tlv2.redhat.com and will be available for some time for debugging. It was reproduced updating from RHOS_TRUNK-16.0-RHEL-8-20200109.n.1 to RHOS_TRUNK-16.0-RHEL-8-20200110.n.3 The issue only happens for VMs created at compute-0. VM creation at compute-1 is successful. According to the logs, undercloud-0.ctlplane.redhat.local is unreachable, but I checked it is reachable from both computes. --- Additional comment from Jakub Libosvar on 2020-01-14 14:39:11 UTC --- The problem here is that metadata-agent and its side-car containers are not restarted during an update because restart leads to metadata api being not available. That means metadata agent runs with old bits and tries to spawn a side-car container for a new workload in given network with image from the pre-update version. This is not OVN specific and we think it will also cause issues with ml2/ovs side-car containers. --- Additional comment from Kamil Sambor on 2020-01-14 17:26:08 UTC --- Small update: containers run on correct images, but in ovn_metadata_agent_wrapper we have wrong name of image . I still need to check way this file is not recreated correctly
*** Bug 1790467 has been marked as a duplicate of this bug. ***
Hi, the review attached should recreate the hiera earlier in the update process. Currently this is only done during the "converge" phase of the update. Adding specific hiera regeneration very as the first thing to do during update. Patch attached need testing and master merged. Thanks.
Hi, so I've tested the hiera regeneration based on https://review.opendev.org/#/c/705642/: - start with osp16 GA - regenerate all images using: sudo openstack tripleo container image prepare -e /home/stack/update-container-workaround.yaml --output-env-file /home/stack/update-container-params.yaml with a modified yum_install_buildah.yaml, so that the images are squashed: awk '{if($0 ~/format docker/){print $0 " --sqash"}else{print $0}}' \ /usr/share/ansible/roles/tripleo-modify-image/tasks/yum_install_buildah.yml > /tmp/yum_install_buildah.yml - start the update process with logs at various point in time of the wrapper scripts[1] [root@compute-0 ~]# grep agent-ovn /var/log/extra/*_update_ovn_metadata_haproxy_wrapper_in_container /var/log/extra/after_update_ovn_metadata_haproxy_wrapper_in_container: undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix \ /var/log/extra/before_update_ovn_metadata_haproxy_wrapper_in_container: undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1 \ [root@compute-0 ~]# grep agent-ovn /var/log/extra/*_converge_ovn_metadata_haproxy_wrapper_in_container /var/log/extra/after_converge_ovn_metadata_haproxy_wrapper_in_container: undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix \ /var/log/extra/before_converge_ovn_metadata_haproxy_wrapper_in_container: undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix \ We can see that the wrapper as seen in the container[2] is now correctly changed after the update round, and not only after the converge. So this patch should resolve the issue. As I see it's good to go. [1] full workaround script used can be found there http://file.rdu.redhat.com/~sathlang/tripleo/osp16/workaround_new_tag_logs.yaml [2] used if sudo podman inspect ovn_metadata_agent; then sudo podman exec -ti ovn_metadata_agent cat /usr/local/bin/haproxy | sudo tee /var/log/extra/before_update_ovn_metadata_haproxy_wrapper_in_container; fi
One note, I'm note able to reproduce exactly that issue, so if someone from networking QE is willing to try the patch that would be appreciated. Here is my sequence of podman container is the "standard" update testing workflow: [root@compute-0 ~]# cat /var/log/extra/podman-before-overcloud-update-container.log | grep -v Exited CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 1bd12ae362a4 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1 /bin/bash -c HAPR... About an hour ago Up About an hour ago neutron-haproxy-ovnmeta-93cf7f45-92b2-41fd-a149-3a5864a233d1 5897024baaf2 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1 kolla_start 3 hours ago Up 3 hours ago nova_compute 868d6d72b577 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1 kolla_start 3 hours ago Up 3 hours ago ovn_metadata_agent 020067ef909f undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:20200130.1 kolla_start 3 hours ago Up 3 hours ago ovn_controller 2563f91084fb undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1 kolla_start 3 hours ago Up 3 hours ago nova_migration_target a45a6937d4cc undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:20200130.1 kolla_start 3 hours ago Up 3 hours ago logrotate_crond b1485f601600 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:20200130.1 kolla_start 3 hours ago Up 3 hours ago iscsid 447a4b8069f5 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1 kolla_start 3 hours ago Up 3 hours ago nova_libvirt fe7f04034024 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1 kolla_start 3 hours ago Up 3 hours ago nova_virtlogd [root@compute-0 ~]# cat /var/log/extra/podman-after-overcloud- | grep -v Exited podman-after-overcloud-converge-container.log podman-after-overcloud-converge-images.log podman-after-overcloud-update-container.log podman-after-overcloud-update-images.log [root@compute-0 ~]# cat /var/log/extra/podman-after-overcloud-update-container.log | grep -v Exited CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 25acd990a6e9 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago nova_compute 279a2f14fee6 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago ovn_metadata_agent 82bc7afff97f undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago ovn_controller f93171151401 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago nova_migration_target 40ac9123bcff undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago logrotate_crond d264ddf9de87 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago iscsid 6300388c0cba undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago nova_libvirt 29553ea63768 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago nova_virtlogd 1bd12ae362a4 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1 /bin/bash -c HAPR... 3 hours ago Up 3 hours ago neutron-haproxy-ovnmeta-93cf7f45-92b2-41fd-a149-3a5864a233d1 [root@compute-0 ~]# cat /var/log/extra/podman-after-overcloud- | grep -v Exited podman-after-overcloud-converge-container.log podman-after-overcloud-converge-images.log podman-after-overcloud-update-container.log podman-after-overcloud-update-images.log [root@compute-0 ~]# cat /var/log/extra/podman-after-overcloud-converge-container.log | grep -v Exited CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 25acd990a6e9 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago nova_compute 279a2f14fee6 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago ovn_metadata_agent 82bc7afff97f undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago ovn_controller f93171151401 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago nova_migration_target 40ac9123bcff undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago logrotate_crond d264ddf9de87 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago iscsid 6300388c0cba undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago nova_libvirt 29553ea63768 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix kolla_start 2 hours ago Up 2 hours ago nova_virtlogd 1bd12ae362a4 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1 /bin/bash -c HAPR... 4 hours ago Up 4 hours ago neutron-haproxy-ovnmeta-93cf7f45-92b2-41fd-a149-3a5864a233d1 ### AFTER REBOOT [root@compute-0 ~]# podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 925f4e3446b4 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix /bin/bash -c HAPR... 15 hours ago Up 15 hours ago neutron-haproxy-ovnmeta-164cff3f-069d-40d3-ab8a-b19580aca6f9 25acd990a6e9 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix kolla_start 18 hours ago Up 15 hours ago nova_compute 279a2f14fee6 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix kolla_start 18 hours ago Up 15 hours ago ovn_metadata_agent 82bc7afff97f undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:20200130.1-hotfix kolla_start 18 hours ago Up 15 hours ago ovn_controller f93171151401 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix kolla_start 18 hours ago Up 15 hours ago nova_migration_target 40ac9123bcff undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:20200130.1-hotfix kolla_start 18 hours ago Up 15 hours ago logrotate_crond d264ddf9de87 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:20200130.1-hotfix kolla_start 18 hours ago Up 15 hours ago iscsid 6300388c0cba undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix kolla_start 18 hours ago Up 15 hours ago nova_libvirt 29553ea63768 undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix kolla_start 18 hours ago Up 15 hours ago nova_virtlogd We can see that the sidecar container is restarted correctly with the new image after the reboot. I think what is missing here to reproduce is the creation of an vm before the converge step. But again the wrapper script is now properly updated, so this doesn't conflict with the fact that it's working. It would just be good to have the exact reproducer. Thanks,
Hi Sofer, First, maybe you did not reproduce the issue because you were using only one compute node? When we reproduced this issue with OVN CI update job, we used two computes and for some reason, the issue only occurs in one of them. Regarding the reproduction/verification using the upstream commit, in the past Brent asked for something similar. I checked with other colleagues from QE and was told we need the change merged downstream in order to test it. I totally agree with your suggestion (testing the change before porting it downstream), but I don't know how to do it. Thanks!
Hi, hum, something else must be at work as production ci use 2 computes and we have a vm running on it. Could it be some tempest testing that you trigger on the env that give rise to that error ? Note patch merged upstream, moving to POST. Thanks,
Verified on RHOS_TRUNK-16.0-RHEL-8-20200226.n.1 This puddle includes openstack-tripleo-heat-templates-11.3.2-0.20200211065546.d3d6dc3.el8ost.noarch.rpm NOTE: Fixed in Version in this bug is wrong because the fix is included in an earlier RPM (see https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1102176). Verified with CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/68/ Second and third Tempest tests executions after update stages pass. This means port creation and attachment to instances is done successfully. We can see in the logs that neutron-metadata-agent-ovn containers used before update are rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1 (puddle RHOS_TRUNK-16.0-RHEL-8-20200204.n.1). While containers used after update are rhosp16-openstack-neutron-metadata-agent-ovn:20200226.1 (puddle RHOS_TRUNK-16.0-RHEL-8-20200226.n.1). No crashes found in ovn-metadata-agent logs, neither for compute-0 nor for compute-1.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0655