Bug 1873742

Summary: Unhealthy ovn_metadata_agent during 13-16.1 FFU
Product: Red Hat OpenStack Reporter: Eduardo Olivares <eolivare>
Component: openstack-tripleo-heat-templatesAssignee: RHOS Maint <rhos-maint>
Status: CLOSED ERRATA QA Contact: Eduardo Olivares <eolivare>
Severity: high Docs Contact:
Priority: urgent    
Version: 16.1 (Train)CC: apevec, ekuris, jlibosva, jpretori, jschluet, lbezdick, lhh, majopela, mburns, scohen
Target Milestone: z3Keywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20200914170157.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-15 18:36:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eduardo Olivares 2020-08-29 19:38:18 UTC
Description of problem:
The issue happens during the FFU procedure, while the Overcloud FFU Upgrade stage is performed. The topology used is HA-no-ceph (3 computes and 2 controllers)

The script overcloud_upgrade_run-controller-0.sh upgrades the first controller node to osp16.1, disables controller-1 and controller-2 and created a pacemaker single resource cluster with controller-0. Meanwhile, both compute nodes are running osp13.

Workload was created before starting the upgrade procedure. The corresponding VM instance can be successfully reached.

At this moment (with only controller-0 upgraded), new workload is created successfully, but there is no connectivity with its FIP.

Findings:
1) ovn_metadata_agent is unhealthy
38b5aa21f39c        192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-metadata-agent-ovn:20200730.1                "dumb-init --singl..."   23 hours ago        Up 7 minutes (unhealthy)                       ovn_metadata_agent                                      
2) neutron-haproxy-ovnmeta sidecar container is not created for the new workload (only a sidecar container exists for the old workload)


The issue is reproduced on a BM server: panther23.lab.eng.tlv2.redhat.com
The issue can be reproduced from the undercloud node at panther23 by running /home/stack/workload_launch.sh (new created workload can be removed running /home/stack/workload_launch.sh cleanup).




Version-Release number of selected component (if applicable):
FFU upgrade performed from osp13: 2020-08-05.1
to osp16.1: RHOS-16.1-RHEL-8-20200821.n.0

How reproducible:
100% - several OVN FFU jobs reproduced it, both with and without DVR enabled.
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv6-geneve-HA-no-ceph-ovn-dvr/
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp-ipv6-geneve-HA-no-ceph-ovn/


Steps to Reproduce:
1. Run an OVN FFU job like the ones pasted above. The issue is reproduced after the first controller node is upgraded, during the Overcloud FFU Upgrade Run stage
2.
3.

Comment 1 Jakub Libosvar 2020-09-08 12:54:38 UTC
This is the same reason as in bug 1871834 - the metadata agent config uses the old VIP while OVN DB is listening on a new one.

[root@controller-0 ~]# ss -putnal | grep 6642
tcp   LISTEN 0      10          [fd00:fd00:fd00:2000::136]:6642            [::]:*                                                                                users:(("ovsdb-server",pid=672283,fd=14))

[root@compute-1 ~]# grep ovn_sb_connection /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/networking-ovn/networking-ovn-metadata-agent.ini
ovn_sb_connection=tcp:[fd00:fd00:fd00:2000::ef]:6642

Comment 20 errata-xmlrpc 2020-12-15 18:36:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5413