Bug 1723482
| Summary: | Octavia LBs stuck in PENDING_UPDATE state after compute nodes reboot (Nova port detach failure) | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Bruna Bonguardo <bbonguar> | |
| Component: | openstack-octavia | Assignee: | Michael Johnson <michjohn> | |
| Status: | CLOSED ERRATA | QA Contact: | Omer Schwartz <oschwart> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 14.0 (Rocky) | CC: | averi, batkisso, cgoncalves, igallagh, ihrachys, irichart, jveiraca, lpeer, majopela, mgarciac, mvalsecc, oschwart, scohen | |
| Target Milestone: | z13 | Keywords: | Triaged, ZStream | |
| Target Release: | 13.0 (Queens) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | openstack-octavia-5.0.3-0.20200724164655.b3113b3.el7ost | Doc Type: | Bug Fix | |
| Doc Text: |
Before this update, the Compute (nova) service not releasing resources, such as network ports, until a Compute node is restored caused the Load-balancing service (octavia) failover to fail when it was unable to detach a network port from an instance on a Compute node that is down.
With this update, the failover flow in the Load-balancing service has been updated to work around this Compute service issue. The Load-balancing service will now abandon ports that the Compute service will not release, leaving them in a "pending delete" state for the Compute service or Networking service to clean up once the Compute node is restored. This resolves the issue, allowing failover to succeed even if the Compute node is still failed.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1874927 (view as bug list) | Environment: | ||
| Last Closed: | 2020-10-28 18:34:51 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1874927 | |||
|
Description
Bruna Bonguardo
2019-06-24 15:39:53 UTC
Just adding more information, I noticed the compute_id listed for the amphorae is not the same as the compute id of the compute-0 and compute-1. I guess it is because the compute nodes were recreated: [2019-06-24 11:49:54] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+ | 848ed1d0-d424-4f36-b7de-acecede5a95b | 789510be-eee5-4055-97c8-917e680b8e0e | ERROR | STANDALONE | 172.24.0.26 | 10.0.1.13 | | 0b6d11a7-97ed-46cb-8825-2349be8715ee | b94d6658-4266-4051-9674-881d280ac6ea | ERROR | STANDALONE | 172.24.0.7 | 10.0.1.10 | | 3fb72336-568f-46df-bfa3-907e90be55b5 | 56804bfb-aefa-4569-a2ab-54b8fdde7542 | ALLOCATED | STANDALONE | 172.24.0.22 | 10.0.1.6 | +--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+ [2019-06-24 11:50:06] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora show 848ed1d0-d424-4f36-b7de-acecede5a95b +-----------------+--------------------------------------+ | Field | Value | +-----------------+--------------------------------------+ | id | 848ed1d0-d424-4f36-b7de-acecede5a95b | | loadbalancer_id | 789510be-eee5-4055-97c8-917e680b8e0e | | compute_id | 1e91544f-eaa5-4504-9363-c1453bbd0ee0 | | lb_network_ip | 172.24.0.26 | | vrrp_ip | 10.0.1.7 | | ha_ip | 10.0.1.13 | | vrrp_port_id | ee1f6bb7-ff8c-445e-980d-9dc655686c8b | | ha_port_id | d2b1fa6a-c6d5-4150-94ad-ef93c5837985 | | cert_expiration | 2021-06-23T12:58:36 | | cert_busy | False | | role | STANDALONE | | status | ERROR | | vrrp_interface | None | | vrrp_id | 1 | | vrrp_priority | None | | cached_zone | nova | | created_at | 2019-06-24T12:58:36 | | updated_at | 2019-06-24T13:09:52 | | image_id | 95547e16-0770-4982-a04e-539cbce9f6f8 | +-----------------+--------------------------------------+ [2019-06-24 11:50:18] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora show 0b6d11a7-97ed-46cb-8825-2349be8715ee +-----------------+--------------------------------------+ | Field | Value | +-----------------+--------------------------------------+ | id | 0b6d11a7-97ed-46cb-8825-2349be8715ee | | loadbalancer_id | b94d6658-4266-4051-9674-881d280ac6ea | | compute_id | e8fde6ef-d4d4-40a0-a1cd-bfa272425c51 | | lb_network_ip | 172.24.0.7 | | vrrp_ip | 10.0.1.23 | | ha_ip | 10.0.1.10 | | vrrp_port_id | fe0ddc25-3fd8-4ccb-b131-6819de86f2ef | | ha_port_id | 44fdd62a-665b-46b1-b233-edef45fddde1 | | cert_expiration | 2021-06-23T13:00:58 | | cert_busy | False | | role | STANDALONE | | status | ERROR | | vrrp_interface | None | | vrrp_id | 1 | | vrrp_priority | None | | cached_zone | nova | | created_at | 2019-06-24T13:00:58 | | updated_at | 2019-06-24T13:09:54 | | image_id | 95547e16-0770-4982-a04e-539cbce9f6f8 | +-----------------+--------------------------------------+ [2019-06-24 11:50:31] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora show 3fb72336-568f-46df-bfa3-907e90be55b5 +-----------------+--------------------------------------+ | Field | Value | +-----------------+--------------------------------------+ | id | 3fb72336-568f-46df-bfa3-907e90be55b5 | | loadbalancer_id | 56804bfb-aefa-4569-a2ab-54b8fdde7542 | | compute_id | bf07e809-25ce-4260-9b59-255e9f43411a | | lb_network_ip | 172.24.0.22 | | vrrp_ip | 10.0.1.16 | | ha_ip | 10.0.1.6 | | vrrp_port_id | 669f8919-4428-40d5-8f5e-31e5970b40a9 | | ha_port_id | 604464a8-0809-4e64-a412-2c59aa0bae66 | | cert_expiration | 2021-06-23T13:21:41 | | cert_busy | False | | role | STANDALONE | | status | ALLOCATED | | vrrp_interface | None | | vrrp_id | 1 | | vrrp_priority | None | | cached_zone | nova | | created_at | 2019-06-24T13:21:41 | | updated_at | 2019-06-24T13:22:59 | | image_id | 95547e16-0770-4982-a04e-539cbce9f6f8 | +-----------------+--------------------------------------+ [2019-06-24 11:50:47] (overcloud) [stack@undercloud-0 ~]$ . stackrc ; openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | 4924dc98-f130-4688-9c39-399ad72e70ec | controller-0 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | controller | | b2c7771c-712c-480e-a1a4-19e99bf4e54c | controller-2 | ACTIVE | ctlplane=192.168.24.8 | overcloud-full | controller | | d269d56d-3cd5-48b7-a85d-3a0211d6a944 | controller-1 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller | | e6b1df9e-139b-45f9-8648-c647b8737f63 | compute-1 | ACTIVE | ctlplane=192.168.24.17 | overcloud-full | compute | | c41285ee-a26d-43fe-b4c5-de2d246185a9 | compute-0 | ACTIVE | ctlplane=192.168.24.7 | overcloud-full | compute | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ I can confirm this issue from the sos logs. The controller-1 log contains one of the failures. Root cause: Neutron/nova failed to detach the port from the instance for up to five minutes after the detach request. This is caused by nova getting stuck while the compute host that contained the instance is down. This has also been reported to the nova team as an issue that nova will not release port resources if the host is down: https://bugs.launchpad.net/nova/+bug/1827746 Nova has the same defect for volume detach, though that does not impact Octavia. Upstream there is an open patch with a -1 for this issue: https://review.opendev.org/#/c/585864/ This patch needs work and additional review. There is also a secondary bug here, that the failover process is not properly returning the load balancer object to the proper provisioning status of ERROR. I have opened an upstream story for this issue: https://storyboard.openstack.org/#!/story/2006051 Linked the upstream Octavia patch with a workaround for the nova issue. It is still WIP, but getting closer to being ready for upstream reviews. *** Bug 1853893 has been marked as a duplicate of this bug. *** After verification process that involved these steps, I: 1) Deployed tripleo + octavia 2) Created internal tenant network 3) Created 3 load balancers in internal network 4) Increased memory and vcpu number of compute nodes (virsh) - one compute node at a time. first compute-0 and then compute-1. A more detailed version of the steps: (overcloud) [stack@undercloud-0 ~]$ cat /var/lib/rhos-release/latest-installed 13 -p 2020-09-16.1 (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ | 9b6cc8eb-63ab-4787-bfe1-cf95bd33eb06 | test-lb1 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.193 | ACTIVE | amphora | | effaaa1f-5792-4fdb-bf90-098a908692f1 | test-lb2 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.55 | ACTIVE | amphora | | 00330dc3-4d16-42d5-869c-92cf14e6b7c2 | test-lb3 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.37 | ACTIVE | amphora | +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ (overcloud) [stack@undercloud-0 ~]$ logout Connection to undercloud-0 closed. root@titan89 ~]# virsh list Id Name State ------------------------------ 4 undercloud-0 running 19 controller-0 running 20 controller-2 running 21 controller-1 running 27 compute-1 running 28 compute-0 running [root@titan89 ~]# virsh dumpxml compute-0 | grep cpu <vcpu placement='static'>4</vcpu> <cpu mode='host-passthrough' check='none'/> [root@titan89 ~]# virsh dumpxml compute-1 | grep cpu <vcpu placement='static'>4</vcpu> <cpu mode='host-passthrough' check='none'/> [root@titan89 ~]# virsh dumpxml compute-0 | grep emo <memory unit='KiB'>12572672</memory> <currentMemory unit='KiB'>12572672</currentMemory> [root@titan89 ~]# virsh dumpxml compute-1 | grep emo <memory unit='KiB'>12572672</memory> <currentMemory unit='KiB'>12572672</currentMemory> [root@titan89 ~]# virsh shutdown compute-1 Domain compute-1 is being shutdown [root@titan89 ~]# virsh edit compute-1 Domain compute-1 XML configuration edited. <-- I doubled the memory and vcpu [root@titan89 ~]# virsh create /etc/libvirt/qemu/compute-1.xml Domain compute-1 created from /etc/libvirt/qemu/compute-1.xml [root@titan89 ~]# virsh shutdown compute-0 Domain compute-0 is being shutdown [root@titan89 ~]# virsh edit compute-0 Domain compute-0 XML configuration edited. <-- I doubled the memory and vcpu [root@titan89 ~]# virsh create /etc/libvirt/qemu/compute-0.xml Domain compute-0 created from /etc/libvirt/qemu/compute-0.xml [root@titan89 ~]# virsh list Id Name State ------------------------------ 4 undercloud-0 running 19 controller-0 running 20 controller-2 running 21 controller-1 running 29 compute-1 running 30 compute-0 running [root@titan89 ~]# virsh dumpxml compute-0 | grep cpu <vcpu placement='static'>8</vcpu> <cpu mode='host-passthrough' check='none'/> [root@titan89 ~]# virsh dumpxml compute-1 | grep cpu <vcpu placement='static'>8</vcpu> <cpu mode='host-passthrough' check='none'/> [root@titan89 ~]# virsh dumpxml compute-0 | grep emo <memory unit='KiB'>25145344</memory> <currentMemory unit='KiB'>25145344</currentMemory> [root@titan89 ~]# virsh dumpxml compute-1 | grep emo <memory unit='KiB'>25145344</memory> <currentMemory unit='KiB'>25145344</currentMemory> [root@titan89 ~]# ssh stack@undercloud-0 Warning: Permanently added 'undercloud-0' (ECDSA) to the list of known hosts. Last login: Sun Oct 4 03:56:12 2020 from 172.16.0.1 [stack@undercloud-0 ~]$ . overcloudrc (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ | 9b6cc8eb-63ab-4787-bfe1-cf95bd33eb06 | test-lb1 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.193 | ACTIVE | amphora | | effaaa1f-5792-4fdb-bf90-098a908692f1 | test-lb2 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.55 | ACTIVE | amphora | | 00330dc3-4d16-42d5-869c-92cf14e6b7c2 | test-lb3 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.37 | ACTIVE | amphora | +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+ | 8e72cada-0c42-47cb-b5d7-c2f41526eb79 | 9b6cc8eb-63ab-4787-bfe1-cf95bd33eb06 | ALLOCATED | STANDALONE | 172.24.1.63 | 192.168.1.193 | | 58f36bd2-58c4-47a7-88a9-985d78c74a55 | effaaa1f-5792-4fdb-bf90-098a908692f1 | ALLOCATED | STANDALONE | 172.24.0.53 | 192.168.1.55 | | 883b52fb-705a-4dae-8b20-111f4965c8ff | 00330dc3-4d16-42d5-869c-92cf14e6b7c2 | ALLOCATED | STANDALONE | 172.24.0.219 | 192.168.1.37 | +--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+ (overcloud) [stack@undercloud-0 ~]$ The provisioning_status of all the 3 LBs is ACTIVE. The status of all the 3 Amphoras is ALLOCATED. Looks good to me. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (octavia-train bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4400 |