Bug 1813758

Summary: 13z11 (March 10 2020) update breaks neutron_* healthchecks
Product: Red Hat OpenStack Reporter: Brendan Shephard <bshephar>
Component: openstack-tripleo-commonAssignee: Cédric Jeanneret <cjeanner>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: adakopou, akaris, amoralej, amuller, asimonel, astupnik, bcafarel, ccopello, chris.smart, cjeanner, ekuris, gkadam, harsh.kotak, jhakimra, juanluis.alarcon, jvisser, knoha, ltamagno, mburns, mmethot, msufiyan, nalmond, rbarrott, rrubins, satmakur, scohen, slinaber, sukar, vkhitrin, vz.mec
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-8.7.1-15.el7ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-02 10:05:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Brendan Shephard 2020-03-16 01:12:25 UTC
Description of problem:
After updating the OSP13 z11, the following container will appear as unhealthy:

[root@overcloud-controller-1 ~]# docker ps | grep unhealthy
3353c42ec8d5        192.168.24.1:8787/rhosp13/openstack-neutron-openvswitch-agent:13.0-114   "dumb-init --singl..."   40 minutes ago      Up 6 minutes (unhealthy)                            neutron_ovs_agent
23f1fd176f20        192.168.24.1:8787/rhosp13/openstack-neutron-l3-agent:13.0-113            "dumb-init --singl..."   40 minutes ago      Up 40 minutes (unhealthy)                           neutron_l3_agent
f2b3f4b4a6ec        192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent:13.0-115      "dumb-init --singl..."   40 minutes ago      Up 40 minutes (unhealthy)                           neutron_metadata_agent
10449baa986b        192.168.24.1:8787/rhosp13/openstack-neutron-dhcp-agent:13.0-114          "dumb-init --singl..."   40 minutes ago      Up 40 minutes (unhealthy)                           neutron_dhcp

Version-Release number of selected component (if applicable):
openstack-tripleo-common-8.7.1-12.el7ost.noarch.rpm

How reproducible:
100%

Steps to Reproduce:
1. Update to z11
2.
3.

Actual results:
The neutron_* containers will all show as unhealthy

Expected results:
The health check should return healthy since the services are definitely connected with RabbitMQ.

Additional info:
All of the processes now appear as python instead of the respective names that the health check is looking for:

If I run this on a pre-z11 node we see the process:

[heat-admin@overcloud-compute-0 ~]$ sudo ss -ntp | awk '{print $5,"-",$6}' | egrep ":5672"
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=51))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=31))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=42))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=55))
x.x.x.53:5672 - users:(("nova-compute",pid=989999,fd=4))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=43))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=53))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=40))
x.x.x.54:5672 - users:(("nova-compute",pid=989999,fd=57))
x.x.x.54:5672 - users:(("nova-compute",pid=989999,fd=56))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=46))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=50))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=37))
x.x.x.61:5672 - users:(("nova-compute",pid=989999,fd=22))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=49))
x.x.x.54:5672 - users:(("nova-compute",pid=989999,fd=23))
x.x.x.53:5672 - users:(("nova-compute",pid=989999,fd=54))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=47))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=52))
x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=54))
x.x.x.54:5672 - users:(("ceilometer-poll",pid=545626,fd=169))

Where as on z11 we get this (only get the python):

[heat-admin@compute-3 ~]$ sudo ss -ntp | awk '{print $5,"-",$6}' | egrep ":5672"
x.x.x.50:5672 - users:(("nova-compute",pid=692482,fd=25))
x.x.x.75:5672 - users:(("nova-compute",pid=692482,fd=37))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=54))
x.x.x.75:5672 - users:(("ceilometer-poll",pid=692078,fd=134))
x.x.x.75:5672 - users:(("nova-compute",pid=692482,fd=38))
x.x.x.80:5672 - users:(("/usr/bin/python",pid=692722,fd=28))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=55))
x.x.x.75:5672 - users:(("nova-compute",pid=692482,fd=41))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692722,fd=31))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=57))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=50))
x.x.x.80:5672 - users:(("/usr/bin/python",pid=692722,fd=30))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692722,fd=29))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=51))
x.x.x.50:5672 - users:(("nova-compute",pid=692482,fd=26))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=40))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=24))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=46))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692914,fd=13))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=56))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692914,fd=15))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=49))
x.x.x.80:5672 - users:(("nova-compute",pid=692482,fd=40))
x.x.x.80:5672 - users:(("/usr/bin/python",pid=692974,fd=52))
x.x.x.80:5672 - users:(("/usr/bin/python",pid=692974,fd=53))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=58))
x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=29))


It looks like something similar has been discovered for the undercloud in upstream releases:
https://opendev.org/openstack/tripleo-common/commit/5312bf19c8f820ac65514885aebdc2dc4776d72d

I am working on a reproducer env atm and haven't been able to verify the new pgrep from this change yet. But if that does indeed fix the issue, I'll update the BZ accordingly.

Comment 1 Cédric Jeanneret 2020-03-17 06:28:44 UTC
Hello there,

Just started upstream backports of my old patch, we "just" need to get it in two old releases (rocky, then queens) - hopefully it will be done quickly. If needed, I can probably do the downstream work earlier, the change isn't that complicated in the end.

Linked here, the first backport, upstream rocky.

Setting blocker? since it might prevent good stability in an LTS release.

Cheers,

C.

Comment 4 Saravanan KR 2020-03-18 07:51:32 UTC
*** Bug 1814230 has been marked as a duplicate of this bug. ***

Comment 8 Cédric Jeanneret 2020-03-18 15:30:57 UTC
Dear all,

Care to tell me if the one patch is enough? We might need some other I made like 10 months ago on upstream, but I'm pretty sure the main issue is solved with the current one..

Feel free to ping me back if needed!

Cheers,

C.

Comment 13 Bernard Cafarelli 2020-03-19 12:26:01 UTC
*** Bug 1814130 has been marked as a duplicate of this bug. ***

Comment 14 Alex Schultz 2020-03-20 19:53:43 UTC
*** Bug 1815626 has been marked as a duplicate of this bug. ***

Comment 21 Marc Methot 2020-03-24 22:43:35 UTC
Hey @cedric,

We should consider pulling the other patches for https://bugs.launchpad.net/tripleo/+bug/1843555
Examples of if we dont:
- octavia healthcheck -> https://review.opendev.org/#/c/681989/
- nova will start to have issues without sudo -> https://review.opendev.org/#/c/681525/

Or should we create a preemptive bug and request further backports?
Also I like your latest commits to common upstream, getting rid `ss` altogether, although I didn't check if we'd have issues backporting that.


Cheers,
Marc Methot

Comment 22 Cédric Jeanneret 2020-03-25 07:52:59 UTC
Hello Marc,

hmmm I think creating dedicated BZ, linked to this one, is the right thing to do. Here, we're talking about neutron, while you're pointing nova and octavia. All is in the same package (tripleo-common), but in order to do a clean backport pointing to the right issue(s), it's better to actually get the right issue(s).

Care to create those BZ and add some "depends-on" or something so that we can easily find the whole lot of them?
Also, the "sudo" created some nice issues with the healthcheck hardening, that's why I tried to find a better way - this also allows to drop some pipes, which is always a good thing. For the records, you're talking about https://review.opendev.org/708339

So, imho, plan is:
- create relevant BZ with info about "how to verify" the change
- if possible, point to the needed backports (since you seem to know what you want, that's easier :))
- link those new BZ to the current one

On my side, I'll create the relevant launchpad issues based on the BZ, since we have to get them for the upstream backports, and manage to actually get those backports done.

How does it sound?

Cheers,

C

Comment 23 Marc Methot 2020-03-26 13:11:35 UTC
Hey Cedric,

I tested just backporting what you selected so far, I could not reproduce the nova issue they saw upstream. Unsure if I just made a mistake or we are running the container slightly differently (only tested rhosp13, not upstream queens). That was my main concern (just having one failed healthcheck traded for another), since we would apparently need to backport something else as well (container config) to replicate the issue we're fine for now.

Octavia is a different issue, the healthcheck is just bad. I don't have any cases against it though, so also fine.


Thanks,
MM

Comment 28 David Rosenfeld 2020-03-28 01:30:29 UTC
After fix there are no unhealthy containers and the four containers listed in the problem description as unhealthy now appear as healthy:

sudo docker ps | grep unhealthy
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ sudo docker ps | grep openstack-neutron-openvswitch-agent
9caf57205ad0        192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-openvswitch-agent:20200325.1   "dumb-init --singl..."   39 minutes ago      Up 39 minutes (healthy)                          neutron_ovs_agent
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ sudo docker ps | grep openstack-neutron-l3-agent
75e2beb2266e        192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-l3-agent:20200325.1            "dumb-init --singl..."   40 minutes ago      Up 40 minutes (healthy)                          neutron_l3_agent
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ sudo docker ps | grep openstack-neutron-metadata-agent
691d585a2ae7        192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-metadata-agent:20200325.1      "dumb-init --singl..."   40 minutes ago      Up 40 minutes (healthy)                          neutron_metadata_agent
[heat-admin@controller-0 ~]$ 
[heat-admin@controller-0 ~]$ sudo docker ps | grep openstack-neutron-dhcp-agent
3b70f4f6077f        192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-dhcp-agent:20200325.1          "dumb-init --singl..."   40 minutes ago      Up 40 minutes (healthy)                          neutron_dhcp
[heat-admin@controller-0 ~]$

Comment 37 errata-xmlrpc 2020-04-02 10:05:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1297