Bug 1801763 - Network unreachable after minor update: unable to pull rhosp16-openstack-neutron-metadata-agent-ovn image [NEEDINFO]
Summary: Network unreachable after minor update: unable to pull rhosp16-openstack-neut...
Keywords:
Status: MODIFIED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z1
: 16.0
Assignee: Brent Eagles
QA Contact: eolivare
URL:
Whiteboard:
: 1790467 (view as bug list)
Depends On: 1790467
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-11 15:33 UTC by Lon Hohberger
Modified: 2020-02-18 23:21 UTC (History)
23 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200218132857.ab1079e.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1790467
Environment:
Last Closed:
Target Upstream Version:
shrjoshi: needinfo? (beagles)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Launchpad 1860571 None None None 2020-02-11 15:39:23 UTC
Launchpad 1861799 None None None 2020-02-11 15:39:23 UTC
OpenStack gerrit 703855 None MERGED Add DeployIdentifier to extra config containers 2020-02-17 11:10:37 UTC
OpenStack gerrit 707230 None NEW [update] Ensure we get fresh hiera data before running update_steps. 2020-02-17 11:10:37 UTC

Description Lon Hohberger 2020-02-11 15:33:20 UTC
+++ This bug was initially created as a clone of Bug #1790467 +++

Description of problem:
Info obtained from CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/28/:
- 40 fails in second tempest run, after  update. 32 out of 40 because VM instance could not be accessed via ssh.
- 35 fails in third tempest run, after  overcloud reboot. 29 out of 35 because VM instance could not be accessed via ssh.

I have focused on a failing test that looks simple, test_basic_instance. Find tempest logs attached.


After having created network, subnet, secgroup, etc, a server (instance) is created at 19:44:02
2020-01-09 19:44:02,817 352985 INFO     [tempest.lib.common.rest_client] Request (NetworkBasicTest:test_basic_instance): 202 POST http://10.0.0.117:8774/v2.1/servers 1.601s
Response...
       Body: b'{"server": {"id": "fd91f455-3a2f-43d6-baea-348826eb8eb4", "links": [{"rel": "self", "href": "http://10.0.0.117:8774/v2.1/servers/fd91f455-3a2f-43d6-baea-348826eb8eb4"}, {"rel": "bookmark", "href": "http://10.0.0.117:8774/servers/fd91f455-3a2f-43d6-baea-348826eb8eb4"}], "OS-DCF:diskConfig": "MANUAL", "security_groups": [{"name": "tempest-secgroup-1181119668"}], "adminPass": "fX3x8b9aYpJM"}}'


It changes from status BUILD to ACTIVE at 19:44:26 according to tempest logs and FIP=10.0.0.214 is assigned at 19:44:28. Immediately, tempest tries to connect via ssh and all retries fail until 19:49:28.


OVN logs at compute-1 (find them attached):
2020-01-09 19:44:08.937 101840 DEBUG ovsdbapp.backend.ovs_idl.event [-] Matched UPDATE: PortBindingChassisEvent(events=('update',), table='Port_Binding', conditions=None, old_conditions=None) to row=Port_Binding(tunnel_key=3, parent_port=[], logical_port=cad2bcee-aaa0-4aff-9f0f-1da971ac265f, chassis=[<ovs.db.idl.Row object at 0x7f405d79ae80>], mac=['fa:16:3e:8c:4a:2c 10.100.0.6'], options={'requested-chassis': 'compute-1.redhat.local'}, ha_chassis_group=[], gateway_chassis=[], encap=[], external_ids={'neutron:cidrs': '10.100.0.6/28', 'neutron:device_id': 'fd91f455-3a2f-43d6-baea-348826eb8eb4', 'neutron:device_owner': 'compute:nova', 'neutron:network_name': 'neutron-b32cf836-f6dc-4c7a-b7a6-b91351b5a2bb', 'neutron:port_name': '', 'neutron:project_id': '98143f9ea32643de999dd520d1918385', 'neutron:revision_number': '2', 'neutron:security_group_ids': '1b99205b-1b78-4623-ac29-1d690876418c'}, type=, nat_addresses=[], virtual_parent=[], datapath=1d57661f-61ce-4cc3-87ce-946427c38e6a, tag=[]) old=Port_Binding(chassis=[]) matches /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/event.py:44
...
2020-01-09 19:44:09.729 101840 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/external/pids/1d57661f-61ce-4cc3-87ce-946427c38e6a.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
...
Error: unable to pull undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200103.1: unable to pull image: Error initializing source docker://undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200103.1: pinging docker registry returned: Get http://undercloud-0.ctlplane.redhat.local:8787/v2/: dial tcp 192.168.24.1:8787: connect: network is unreachable

It crashes at this moment (19:44:10) and there are more crashes related with these values at 19:44:25 and at 19:44:29. And there other crashes, but related with other values, I guess that's because other tests are running in parallel.


For some reason, no crashes in compute-0:
[root@titan07 neutron]# grep -c Traceback ../../../../../compute-1/var/log/containers/neutron/*
../../../../../compute-1/var/log/containers/neutron/kill-script.log:0
../../../../../compute-1/var/log/containers/neutron/ovn-metadata-agent.log:159
../../../../../compute-1/var/log/containers/neutron/ovn-metadata-agent.log.1:168
../../../../../compute-1/var/log/containers/neutron/privsep-helper.log:0
[root@titan07 neutron]# grep -c Traceback ../../../../../compute-0/var/log/containers/neutron/*
../../../../../compute-0/var/log/containers/neutron/kill-script.log:0
../../../../../compute-0/var/log/containers/neutron/ovn-metadata-agent.log:0
../../../../../compute-0/var/log/containers/neutron/ovn-metadata-agent.log.1:0
../../../../../compute-0/var/log/containers/neutron/privsep-helper.log:0








Version-Release number of selected component (if applicable):
update from:
RHOS_TRUNK-16.0-RHEL-8-20200103.n.1
to:
RHOS_TRUNK-16.0-RHEL-8-20200109.n.1
(tested with RHOS_TRUNK-16.0-RHEL-8-20200110.n.3 as well, same fails)

How reproducible:
100% (failed 3/3)


Steps to Reproduce:
1. Run CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/
2.
3.

Actual results:
many tempest tests failing with reason: 
tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.214 via SSH timed out.
User: cirros, Password: None

Expected results:
tests passing

Additional info:

--- Additional comment from  on 2020-01-14 08:09:18 UTC ---

Scenario is reproduced in titan07.lab.eng.tlv2.redhat.com and will be available for some time for debugging. It was reproduced updating from RHOS_TRUNK-16.0-RHEL-8-20200109.n.1 to RHOS_TRUNK-16.0-RHEL-8-20200110.n.3

The issue only happens for VMs created at compute-0. VM creation at compute-1 is successful.


According to the logs, undercloud-0.ctlplane.redhat.local is unreachable, but I checked it is reachable from both computes.

--- Additional comment from Jakub Libosvar on 2020-01-14 14:39:11 UTC ---

The problem here is that metadata-agent and its side-car containers are not restarted during an update because restart leads to metadata api being not available. That means metadata agent runs with old bits and tries to spawn a side-car container for a new workload in given network with image from the pre-update version.

This is not OVN specific and we think it will also cause issues with ml2/ovs side-car containers.

--- Additional comment from Kamil Sambor on 2020-01-14 17:26:08 UTC ---

Small update: containers run on correct images, but in ovn_metadata_agent_wrapper we have wrong name of image . I still need to check way this file is not recreated correctly

Comment 3 Lon Hohberger 2020-02-11 15:35:59 UTC
*** Bug 1790467 has been marked as a duplicate of this bug. ***

Comment 4 Sofer Athlan-Guyot 2020-02-11 18:55:41 UTC
Hi,

the review attached should recreate the hiera earlier in the update process.  Currently this is only done during the "converge" phase of the update.  Adding specific hiera regeneration very as the first thing to do during update.  Patch attached need testing and master merged.

Thanks.

Comment 6 Sofer Athlan-Guyot 2020-02-13 09:43:07 UTC
Hi,

so I've tested the hiera regeneration based on
https://review.opendev.org/#/c/705642/:

 - start with osp16 GA
 - regenerate all images using:

    sudo openstack tripleo container image prepare -e /home/stack/update-container-workaround.yaml --output-env-file /home/stack/update-container-params.yaml

    with a modified yum_install_buildah.yaml, so that the images are squashed:

             awk '{if($0 ~/format docker/){print $0 " --sqash"}else{print $0}}' \
                /usr/share/ansible/roles/tripleo-modify-image/tasks/yum_install_buildah.yml > /tmp/yum_install_buildah.yml

 - start the update process with logs at various point in time of the
   wrapper scripts[1]

     [root@compute-0 ~]# grep agent-ovn /var/log/extra/*_update_ovn_metadata_haproxy_wrapper_in_container 
     /var/log/extra/after_update_ovn_metadata_haproxy_wrapper_in_container:     undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix \
     /var/log/extra/before_update_ovn_metadata_haproxy_wrapper_in_container:     undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1 \
     [root@compute-0 ~]# grep agent-ovn /var/log/extra/*_converge_ovn_metadata_haproxy_wrapper_in_container 
     /var/log/extra/after_converge_ovn_metadata_haproxy_wrapper_in_container:     undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix \
     /var/log/extra/before_converge_ovn_metadata_haproxy_wrapper_in_container:     undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix \

We can see that the wrapper as seen in the container[2] is now
correctly changed after the update round, and not only after the
converge.

So this patch should resolve the issue.  As I see it's good to go.


[1] full workaround script used can be found there http://file.rdu.redhat.com/~sathlang/tripleo/osp16/workaround_new_tag_logs.yaml
[2] used if sudo podman inspect ovn_metadata_agent; then sudo podman exec -ti ovn_metadata_agent cat /usr/local/bin/haproxy | sudo tee /var/log/extra/before_update_ovn_metadata_haproxy_wrapper_in_container; fi

Comment 7 Sofer Athlan-Guyot 2020-02-13 10:11:56 UTC
One note, I'm note able to reproduce exactly that issue, so if someone from networking QE is willing to try the patch that would be appreciated.

Here is my sequence of podman container is the "standard" update testing workflow:

[root@compute-0 ~]# cat /var/log/extra/podman-before-overcloud-update-container.log   | grep -v Exited
CONTAINER ID  IMAGE                                                                                                    COMMAND               CREATED            STATUS                  PORTS  NAMES
1bd12ae362a4  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1  /bin/bash -c HAPR...  About an hour ago  Up About an hour ago           neutron-haproxy-ovnmeta-93cf7f45-92b2-41fd-a149-3a5864a233d1
5897024baaf2  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1                kolla_start           3 hours ago        Up 3 hours ago                 nova_compute
868d6d72b577  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1  kolla_start           3 hours ago        Up 3 hours ago                 ovn_metadata_agent
020067ef909f  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:20200130.1              kolla_start           3 hours ago        Up 3 hours ago                 ovn_controller
2563f91084fb  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1                kolla_start           3 hours ago        Up 3 hours ago                 nova_migration_target
a45a6937d4cc  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:20200130.1                        kolla_start           3 hours ago        Up 3 hours ago                 logrotate_crond
b1485f601600  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:20200130.1                      kolla_start           3 hours ago        Up 3 hours ago                 iscsid
447a4b8069f5  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1                kolla_start           3 hours ago        Up 3 hours ago                 nova_libvirt
fe7f04034024  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1                kolla_start           3 hours ago        Up 3 hours ago                 nova_virtlogd
[root@compute-0 ~]# cat /var/log/extra/podman-after-overcloud-   | grep -v Exited
podman-after-overcloud-converge-container.log  podman-after-overcloud-converge-images.log     podman-after-overcloud-update-container.log    podman-after-overcloud-update-images.log       
[root@compute-0 ~]# cat /var/log/extra/podman-after-overcloud-update-container.log   | grep -v Exited
CONTAINER ID  IMAGE                                                                                                           COMMAND               CREATED      STATUS                  PORTS  NAMES
25acd990a6e9  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix                kolla_start           2 hours ago  Up 2 hours ago                 nova_compute
279a2f14fee6  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix  kolla_start           2 hours ago  Up 2 hours ago                 ovn_metadata_agent
82bc7afff97f  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:20200130.1-hotfix              kolla_start           2 hours ago  Up 2 hours ago                 ovn_controller
f93171151401  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix                kolla_start           2 hours ago  Up 2 hours ago                 nova_migration_target
40ac9123bcff  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:20200130.1-hotfix                        kolla_start           2 hours ago  Up 2 hours ago                 logrotate_crond
d264ddf9de87  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:20200130.1-hotfix                      kolla_start           2 hours ago  Up 2 hours ago                 iscsid
6300388c0cba  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix                kolla_start           2 hours ago  Up 2 hours ago                 nova_libvirt
29553ea63768  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix                kolla_start           2 hours ago  Up 2 hours ago                 nova_virtlogd
1bd12ae362a4  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1         /bin/bash -c HAPR...  3 hours ago  Up 3 hours ago                 neutron-haproxy-ovnmeta-93cf7f45-92b2-41fd-a149-3a5864a233d1
[root@compute-0 ~]# cat /var/log/extra/podman-after-overcloud-   | grep -v Exited
podman-after-overcloud-converge-container.log  podman-after-overcloud-converge-images.log     podman-after-overcloud-update-container.log    podman-after-overcloud-update-images.log       
[root@compute-0 ~]# cat /var/log/extra/podman-after-overcloud-converge-container.log   | grep -v Exited
CONTAINER ID  IMAGE                                                                                                           COMMAND               CREATED         STATUS                     PORTS  NAMES
25acd990a6e9  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix                kolla_start           2 hours ago     Up 2 hours ago                    nova_compute
279a2f14fee6  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix  kolla_start           2 hours ago     Up 2 hours ago                    ovn_metadata_agent
82bc7afff97f  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:20200130.1-hotfix              kolla_start           2 hours ago     Up 2 hours ago                    ovn_controller
f93171151401  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix                kolla_start           2 hours ago     Up 2 hours ago                    nova_migration_target
40ac9123bcff  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:20200130.1-hotfix                        kolla_start           2 hours ago     Up 2 hours ago                    logrotate_crond
d264ddf9de87  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:20200130.1-hotfix                      kolla_start           2 hours ago     Up 2 hours ago                    iscsid
6300388c0cba  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix                kolla_start           2 hours ago     Up 2 hours ago                    nova_libvirt
29553ea63768  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix                kolla_start           2 hours ago     Up 2 hours ago                    nova_virtlogd
1bd12ae362a4  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1         /bin/bash -c HAPR...  4 hours ago     Up 4 hours ago                    neutron-haproxy-ovnmeta-93cf7f45-92b2-41fd-a149-3a5864a233d1

### AFTER REBOOT

[root@compute-0 ~]# podman ps 
CONTAINER ID  IMAGE                                                                                                           COMMAND               CREATED       STATUS           PORTS  NAMES
925f4e3446b4  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix  /bin/bash -c HAPR...  15 hours ago  Up 15 hours ago         neutron-haproxy-ovnmeta-164cff3f-069d-40d3-ab8a-b19580aca6f9
25acd990a6e9  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix                kolla_start           18 hours ago  Up 15 hours ago         nova_compute
279a2f14fee6  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-metadata-agent-ovn:20200130.1-hotfix  kolla_start           18 hours ago  Up 15 hours ago         ovn_metadata_agent
82bc7afff97f  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-ovn-controller:20200130.1-hotfix              kolla_start           18 hours ago  Up 15 hours ago         ovn_controller
f93171151401  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-compute:20200130.1-hotfix                kolla_start           18 hours ago  Up 15 hours ago         nova_migration_target
40ac9123bcff  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cron:20200130.1-hotfix                        kolla_start           18 hours ago  Up 15 hours ago         logrotate_crond
d264ddf9de87  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-iscsid:20200130.1-hotfix                      kolla_start           18 hours ago  Up 15 hours ago         iscsid
6300388c0cba  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix                kolla_start           18 hours ago  Up 15 hours ago         nova_libvirt
29553ea63768  undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-nova-libvirt:20200130.1-hotfix                kolla_start           18 hours ago  Up 15 hours ago         nova_virtlogd


We can see that the sidecar container is restarted correctly with the
new image after the reboot.  I think what is missing here to reproduce
is the creation of an vm before the converge step.

But again the wrapper script is now properly updated, so this doesn't
conflict with the fact that it's working.  It would just be good to
have the exact reproducer.

Thanks,

Comment 8 eolivare 2020-02-14 11:28:54 UTC
Hi Sofer,

First, maybe you did not reproduce the issue because you were using only one compute node? When we reproduced this issue with OVN CI update job, we used two computes and for some reason, the issue only occurs in one of them.

Regarding the reproduction/verification using the upstream commit, in the past Brent asked for something similar. I checked with other colleagues from QE and was told we need the change merged downstream in order to test it.
I totally agree with your suggestion (testing the change before porting it downstream), but I don't know how to do it.

Thanks!

Comment 9 Sofer Athlan-Guyot 2020-02-18 09:41:48 UTC
Hi,

hum, something else must be at work as production ci use 2 computes and we have a vm running on it.  Could it be some tempest testing that you trigger on the env that give rise to that error ?

Note patch merged upstream, moving to POST.

Thanks,


Note You need to log in before you can comment on or make changes to this bug.