Bug 1813758
| Summary: | 13z11 (March 10 2020) update breaks neutron_* healthchecks | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Brendan Shephard <bshephar> |
| Component: | openstack-tripleo-common | Assignee: | Cédric Jeanneret <cjeanner> |
| Status: | CLOSED ERRATA | QA Contact: | David Rosenfeld <drosenfe> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 13.0 (Queens) | CC: | adakopou, akaris, amoralej, amuller, asimonel, astupnik, bcafarel, ccopello, chris.smart, cjeanner, ekuris, gkadam, harsh.kotak, jhakimra, juanluis.alarcon, jvisser, knoha, ltamagno, mburns, mmethot, msufiyan, nalmond, rbarrott, rrubins, satmakur, scohen, slinaber, sukar, vkhitrin, vz.mec |
| Target Milestone: | --- | Keywords: | Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-common-8.7.1-15.el7ost | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-02 10:05:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Hello there, Just started upstream backports of my old patch, we "just" need to get it in two old releases (rocky, then queens) - hopefully it will be done quickly. If needed, I can probably do the downstream work earlier, the change isn't that complicated in the end. Linked here, the first backport, upstream rocky. Setting blocker? since it might prevent good stability in an LTS release. Cheers, C. *** Bug 1814230 has been marked as a duplicate of this bug. *** Dear all, Care to tell me if the one patch is enough? We might need some other I made like 10 months ago on upstream, but I'm pretty sure the main issue is solved with the current one.. Feel free to ping me back if needed! Cheers, C. *** Bug 1814130 has been marked as a duplicate of this bug. *** *** Bug 1815626 has been marked as a duplicate of this bug. *** Hey @cedric, We should consider pulling the other patches for https://bugs.launchpad.net/tripleo/+bug/1843555 Examples of if we dont: - octavia healthcheck -> https://review.opendev.org/#/c/681989/ - nova will start to have issues without sudo -> https://review.opendev.org/#/c/681525/ Or should we create a preemptive bug and request further backports? Also I like your latest commits to common upstream, getting rid `ss` altogether, although I didn't check if we'd have issues backporting that. Cheers, Marc Methot Hello Marc, hmmm I think creating dedicated BZ, linked to this one, is the right thing to do. Here, we're talking about neutron, while you're pointing nova and octavia. All is in the same package (tripleo-common), but in order to do a clean backport pointing to the right issue(s), it's better to actually get the right issue(s). Care to create those BZ and add some "depends-on" or something so that we can easily find the whole lot of them? Also, the "sudo" created some nice issues with the healthcheck hardening, that's why I tried to find a better way - this also allows to drop some pipes, which is always a good thing. For the records, you're talking about https://review.opendev.org/708339 So, imho, plan is: - create relevant BZ with info about "how to verify" the change - if possible, point to the needed backports (since you seem to know what you want, that's easier :)) - link those new BZ to the current one On my side, I'll create the relevant launchpad issues based on the BZ, since we have to get them for the upstream backports, and manage to actually get those backports done. How does it sound? Cheers, C Hey Cedric, I tested just backporting what you selected so far, I could not reproduce the nova issue they saw upstream. Unsure if I just made a mistake or we are running the container slightly differently (only tested rhosp13, not upstream queens). That was my main concern (just having one failed healthcheck traded for another), since we would apparently need to backport something else as well (container config) to replicate the issue we're fine for now. Octavia is a different issue, the healthcheck is just bad. I don't have any cases against it though, so also fine. Thanks, MM After fix there are no unhealthy containers and the four containers listed in the problem description as unhealthy now appear as healthy: sudo docker ps | grep unhealthy [heat-admin@controller-0 ~]$ [heat-admin@controller-0 ~]$ [heat-admin@controller-0 ~]$ sudo docker ps | grep openstack-neutron-openvswitch-agent 9caf57205ad0 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-openvswitch-agent:20200325.1 "dumb-init --singl..." 39 minutes ago Up 39 minutes (healthy) neutron_ovs_agent [heat-admin@controller-0 ~]$ [heat-admin@controller-0 ~]$ sudo docker ps | grep openstack-neutron-l3-agent 75e2beb2266e 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-l3-agent:20200325.1 "dumb-init --singl..." 40 minutes ago Up 40 minutes (healthy) neutron_l3_agent [heat-admin@controller-0 ~]$ [heat-admin@controller-0 ~]$ sudo docker ps | grep openstack-neutron-metadata-agent 691d585a2ae7 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-metadata-agent:20200325.1 "dumb-init --singl..." 40 minutes ago Up 40 minutes (healthy) neutron_metadata_agent [heat-admin@controller-0 ~]$ [heat-admin@controller-0 ~]$ sudo docker ps | grep openstack-neutron-dhcp-agent 3b70f4f6077f 192.168.24.1:8787/rh-osbs/rhosp13-openstack-neutron-dhcp-agent:20200325.1 "dumb-init --singl..." 40 minutes ago Up 40 minutes (healthy) neutron_dhcp [heat-admin@controller-0 ~]$ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1297 |
Description of problem: After updating the OSP13 z11, the following container will appear as unhealthy: [root@overcloud-controller-1 ~]# docker ps | grep unhealthy 3353c42ec8d5 192.168.24.1:8787/rhosp13/openstack-neutron-openvswitch-agent:13.0-114 "dumb-init --singl..." 40 minutes ago Up 6 minutes (unhealthy) neutron_ovs_agent 23f1fd176f20 192.168.24.1:8787/rhosp13/openstack-neutron-l3-agent:13.0-113 "dumb-init --singl..." 40 minutes ago Up 40 minutes (unhealthy) neutron_l3_agent f2b3f4b4a6ec 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent:13.0-115 "dumb-init --singl..." 40 minutes ago Up 40 minutes (unhealthy) neutron_metadata_agent 10449baa986b 192.168.24.1:8787/rhosp13/openstack-neutron-dhcp-agent:13.0-114 "dumb-init --singl..." 40 minutes ago Up 40 minutes (unhealthy) neutron_dhcp Version-Release number of selected component (if applicable): openstack-tripleo-common-8.7.1-12.el7ost.noarch.rpm How reproducible: 100% Steps to Reproduce: 1. Update to z11 2. 3. Actual results: The neutron_* containers will all show as unhealthy Expected results: The health check should return healthy since the services are definitely connected with RabbitMQ. Additional info: All of the processes now appear as python instead of the respective names that the health check is looking for: If I run this on a pre-z11 node we see the process: [heat-admin@overcloud-compute-0 ~]$ sudo ss -ntp | awk '{print $5,"-",$6}' | egrep ":5672" x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=51)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=31)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=42)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=55)) x.x.x.53:5672 - users:(("nova-compute",pid=989999,fd=4)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=43)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=53)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=40)) x.x.x.54:5672 - users:(("nova-compute",pid=989999,fd=57)) x.x.x.54:5672 - users:(("nova-compute",pid=989999,fd=56)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=46)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=50)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=37)) x.x.x.61:5672 - users:(("nova-compute",pid=989999,fd=22)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=49)) x.x.x.54:5672 - users:(("nova-compute",pid=989999,fd=23)) x.x.x.53:5672 - users:(("nova-compute",pid=989999,fd=54)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=47)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=52)) x.x.x.54:5672 - users:(("neutron-openvsw",pid=59490,fd=54)) x.x.x.54:5672 - users:(("ceilometer-poll",pid=545626,fd=169)) Where as on z11 we get this (only get the python): [heat-admin@compute-3 ~]$ sudo ss -ntp | awk '{print $5,"-",$6}' | egrep ":5672" x.x.x.50:5672 - users:(("nova-compute",pid=692482,fd=25)) x.x.x.75:5672 - users:(("nova-compute",pid=692482,fd=37)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=54)) x.x.x.75:5672 - users:(("ceilometer-poll",pid=692078,fd=134)) x.x.x.75:5672 - users:(("nova-compute",pid=692482,fd=38)) x.x.x.80:5672 - users:(("/usr/bin/python",pid=692722,fd=28)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=55)) x.x.x.75:5672 - users:(("nova-compute",pid=692482,fd=41)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692722,fd=31)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=57)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=50)) x.x.x.80:5672 - users:(("/usr/bin/python",pid=692722,fd=30)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692722,fd=29)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=51)) x.x.x.50:5672 - users:(("nova-compute",pid=692482,fd=26)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=40)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=24)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=46)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692914,fd=13)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=56)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692914,fd=15)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=49)) x.x.x.80:5672 - users:(("nova-compute",pid=692482,fd=40)) x.x.x.80:5672 - users:(("/usr/bin/python",pid=692974,fd=52)) x.x.x.80:5672 - users:(("/usr/bin/python",pid=692974,fd=53)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=58)) x.x.x.75:5672 - users:(("/usr/bin/python",pid=692974,fd=29)) It looks like something similar has been discovered for the undercloud in upstream releases: https://opendev.org/openstack/tripleo-common/commit/5312bf19c8f820ac65514885aebdc2dc4776d72d I am working on a reproducer env atm and haven't been able to verify the new pgrep from this change yet. But if that does indeed fix the issue, I'll update the BZ accordingly.