Description of problem: Octavia Load Balancers stuck in PENDING_UPDATE state after compute nodes reboot. Version-Release number of selected component (if applicable): [2019-06-24 11:24:52] (overcloud) [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed 14 -p 2019-06-19.2 [2019-06-24 11:30:05] (overcloud) [stack@undercloud-0 ~]$ rpm -qa | grep octavia python2-octaviaclient-1.6.0-0.20180816134808.64d007f.el7ost.noarch puppet-octavia-13.3.2-0.20190420064721.29482dd.el7ost.noarch octavia-amphora-image-x86_64-14.0-20190617.1.el7ost.noarch How reproducible: Unclear Steps to Reproduce: 1) Deploy tripleo + octavia 2) Create internal tenant network 3) Create 3 load balancers in internal network 4) Increase memory and vcpu number of compute nodes (virsh) - one compute node at a time. first compute-0 and then compute-1. Actual results: One of the 3 LBs is stating as ACTIVE and the other two are stuck at PENDING_UPDATE. Two of the amphorae are in ERROR state. Expected results: All Load Balancers and Amphorae are ACTIVE and ONLINE. Steps: [root@titan10 ~]# virsh shutdown compute-1 Domain compute-1 is being shutdown [root@titan10 ~]# virsh edit compute-1 Domain compute-1 XML configuration edited. <------ Added more memory and more vcpus. [root@titan10 ~]# virsh create /etc/libvirt/qemu/compute-1.xml Domain compute-1 created from /etc/libvirt/qemu/compute-1.xml [root@titan10 ~]# virsh list Id Name State ---------------------------------------------------- 4 undercloud-0 running 20 controller-2 running 22 controller-1 running 24 controller-0 running 25 compute-0 running 26 compute-1 running [root@titan10 ~]# ssh root@undercloud-0 [2019-06-24 09:34:49] (tester) [stack@undercloud-0 ~]$ openstack loadbalancer listener create --name listenerHTTP-one one --protocol HTTP --protocol-port 80 Load Balancer 789510be-eee5-4055-97c8-917e680b8e0e is immutable and cannot be updated. (HTTP 409) (Request-ID: req-19be68a2-9069-4d7d-b53f-2cbd8271475e) [2019-06-24 09:35:25] (tester) [stack@undercloud-0 ~]$ openstack loadbalancer list +--------------------------------------+-------+----------------------------------+-------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+-------+----------------------------------+-------------+---------------------+----------+ | 789510be-eee5-4055-97c8-917e680b8e0e | one | 635e7c28cd8e416cbc3225f642b8d28b | 10.0.1.13 | PENDING_UPDATE | amphora | | 56804bfb-aefa-4569-a2ab-54b8fdde7542 | two | 635e7c28cd8e416cbc3225f642b8d28b | 10.0.1.6 | ACTIVE | amphora | | b94d6658-4266-4051-9674-881d280ac6ea | three | 635e7c28cd8e416cbc3225f642b8d28b | 10.0.1.10 | PENDING_UPDATE | amphora | +--------------------------------------+-------+----------------------------------+-------------+---------------------+----------+ [2019-06-24 09:47:55] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+ | 848ed1d0-d424-4f36-b7de-acecede5a95b | 789510be-eee5-4055-97c8-917e680b8e0e | ERROR | STANDALONE | 172.24.0.26 | 10.0.1.13 | | 0b6d11a7-97ed-46cb-8825-2349be8715ee | b94d6658-4266-4051-9674-881d280ac6ea | ERROR | STANDALONE | 172.24.0.7 | 10.0.1.10 | | 3fb72336-568f-46df-bfa3-907e90be55b5 | 56804bfb-aefa-4569-a2ab-54b8fdde7542 | ALLOCATED | STANDALONE | 172.24.0.22 | 10.0.1.6 | +--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+ [2019-06-24 09:51:39] (overcloud) [stack@undercloud-0 ~]$ openstack server list --all +--------------------------------------+----------------------------------------------+---------+-------------------------------------------------------------------------+----------------------------------------+---------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+----------------------------------------------+---------+-------------------------------------------------------------------------+----------------------------------------+---------------+ | bf07e809-25ce-4260-9b59-255e9f43411a | amphora-3fb72336-568f-46df-bfa3-907e90be55b5 | ACTIVE | lb-mgmt-net=172.24.0.22; int_net_1=2001::f816:3eff:fea1:a23f, 10.0.1.16 | octavia-amphora-14.0-20190617.1.x86_64 | octavia_65 | +--------------------------------------+----------------------------------------------+---------+-------------------------------------------------------------------------+----------------------------------------+---------------+ Am I maybe missing something? Should I have failovered the LBs before the compute node shutdown? What is the best practice for compute node manipulation with Octavia? Thank you
Just adding more information, I noticed the compute_id listed for the amphorae is not the same as the compute id of the compute-0 and compute-1. I guess it is because the compute nodes were recreated: [2019-06-24 11:49:54] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+ | 848ed1d0-d424-4f36-b7de-acecede5a95b | 789510be-eee5-4055-97c8-917e680b8e0e | ERROR | STANDALONE | 172.24.0.26 | 10.0.1.13 | | 0b6d11a7-97ed-46cb-8825-2349be8715ee | b94d6658-4266-4051-9674-881d280ac6ea | ERROR | STANDALONE | 172.24.0.7 | 10.0.1.10 | | 3fb72336-568f-46df-bfa3-907e90be55b5 | 56804bfb-aefa-4569-a2ab-54b8fdde7542 | ALLOCATED | STANDALONE | 172.24.0.22 | 10.0.1.6 | +--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+ [2019-06-24 11:50:06] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora show 848ed1d0-d424-4f36-b7de-acecede5a95b +-----------------+--------------------------------------+ | Field | Value | +-----------------+--------------------------------------+ | id | 848ed1d0-d424-4f36-b7de-acecede5a95b | | loadbalancer_id | 789510be-eee5-4055-97c8-917e680b8e0e | | compute_id | 1e91544f-eaa5-4504-9363-c1453bbd0ee0 | | lb_network_ip | 172.24.0.26 | | vrrp_ip | 10.0.1.7 | | ha_ip | 10.0.1.13 | | vrrp_port_id | ee1f6bb7-ff8c-445e-980d-9dc655686c8b | | ha_port_id | d2b1fa6a-c6d5-4150-94ad-ef93c5837985 | | cert_expiration | 2021-06-23T12:58:36 | | cert_busy | False | | role | STANDALONE | | status | ERROR | | vrrp_interface | None | | vrrp_id | 1 | | vrrp_priority | None | | cached_zone | nova | | created_at | 2019-06-24T12:58:36 | | updated_at | 2019-06-24T13:09:52 | | image_id | 95547e16-0770-4982-a04e-539cbce9f6f8 | +-----------------+--------------------------------------+ [2019-06-24 11:50:18] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora show 0b6d11a7-97ed-46cb-8825-2349be8715ee +-----------------+--------------------------------------+ | Field | Value | +-----------------+--------------------------------------+ | id | 0b6d11a7-97ed-46cb-8825-2349be8715ee | | loadbalancer_id | b94d6658-4266-4051-9674-881d280ac6ea | | compute_id | e8fde6ef-d4d4-40a0-a1cd-bfa272425c51 | | lb_network_ip | 172.24.0.7 | | vrrp_ip | 10.0.1.23 | | ha_ip | 10.0.1.10 | | vrrp_port_id | fe0ddc25-3fd8-4ccb-b131-6819de86f2ef | | ha_port_id | 44fdd62a-665b-46b1-b233-edef45fddde1 | | cert_expiration | 2021-06-23T13:00:58 | | cert_busy | False | | role | STANDALONE | | status | ERROR | | vrrp_interface | None | | vrrp_id | 1 | | vrrp_priority | None | | cached_zone | nova | | created_at | 2019-06-24T13:00:58 | | updated_at | 2019-06-24T13:09:54 | | image_id | 95547e16-0770-4982-a04e-539cbce9f6f8 | +-----------------+--------------------------------------+ [2019-06-24 11:50:31] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora show 3fb72336-568f-46df-bfa3-907e90be55b5 +-----------------+--------------------------------------+ | Field | Value | +-----------------+--------------------------------------+ | id | 3fb72336-568f-46df-bfa3-907e90be55b5 | | loadbalancer_id | 56804bfb-aefa-4569-a2ab-54b8fdde7542 | | compute_id | bf07e809-25ce-4260-9b59-255e9f43411a | | lb_network_ip | 172.24.0.22 | | vrrp_ip | 10.0.1.16 | | ha_ip | 10.0.1.6 | | vrrp_port_id | 669f8919-4428-40d5-8f5e-31e5970b40a9 | | ha_port_id | 604464a8-0809-4e64-a412-2c59aa0bae66 | | cert_expiration | 2021-06-23T13:21:41 | | cert_busy | False | | role | STANDALONE | | status | ALLOCATED | | vrrp_interface | None | | vrrp_id | 1 | | vrrp_priority | None | | cached_zone | nova | | created_at | 2019-06-24T13:21:41 | | updated_at | 2019-06-24T13:22:59 | | image_id | 95547e16-0770-4982-a04e-539cbce9f6f8 | +-----------------+--------------------------------------+ [2019-06-24 11:50:47] (overcloud) [stack@undercloud-0 ~]$ . stackrc ; openstack server list +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+------------------------+----------------+------------+ | 4924dc98-f130-4688-9c39-399ad72e70ec | controller-0 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | controller | | b2c7771c-712c-480e-a1a4-19e99bf4e54c | controller-2 | ACTIVE | ctlplane=192.168.24.8 | overcloud-full | controller | | d269d56d-3cd5-48b7-a85d-3a0211d6a944 | controller-1 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller | | e6b1df9e-139b-45f9-8648-c647b8737f63 | compute-1 | ACTIVE | ctlplane=192.168.24.17 | overcloud-full | compute | | c41285ee-a26d-43fe-b4c5-de2d246185a9 | compute-0 | ACTIVE | ctlplane=192.168.24.7 | overcloud-full | compute | +--------------------------------------+--------------+--------+------------------------+----------------+------------+
I can confirm this issue from the sos logs. The controller-1 log contains one of the failures. Root cause: Neutron/nova failed to detach the port from the instance for up to five minutes after the detach request. This is caused by nova getting stuck while the compute host that contained the instance is down. This has also been reported to the nova team as an issue that nova will not release port resources if the host is down: https://bugs.launchpad.net/nova/+bug/1827746 Nova has the same defect for volume detach, though that does not impact Octavia. Upstream there is an open patch with a -1 for this issue: https://review.opendev.org/#/c/585864/ This patch needs work and additional review. There is also a secondary bug here, that the failover process is not properly returning the load balancer object to the proper provisioning status of ERROR. I have opened an upstream story for this issue: https://storyboard.openstack.org/#!/story/2006051
Linked the upstream Octavia patch with a workaround for the nova issue. It is still WIP, but getting closer to being ready for upstream reviews.
*** Bug 1853893 has been marked as a duplicate of this bug. ***
After verification process that involved these steps, I: 1) Deployed tripleo + octavia 2) Created internal tenant network 3) Created 3 load balancers in internal network 4) Increased memory and vcpu number of compute nodes (virsh) - one compute node at a time. first compute-0 and then compute-1. A more detailed version of the steps: (overcloud) [stack@undercloud-0 ~]$ cat /var/lib/rhos-release/latest-installed 13 -p 2020-09-16.1 (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ | 9b6cc8eb-63ab-4787-bfe1-cf95bd33eb06 | test-lb1 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.193 | ACTIVE | amphora | | effaaa1f-5792-4fdb-bf90-098a908692f1 | test-lb2 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.55 | ACTIVE | amphora | | 00330dc3-4d16-42d5-869c-92cf14e6b7c2 | test-lb3 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.37 | ACTIVE | amphora | +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ (overcloud) [stack@undercloud-0 ~]$ logout Connection to undercloud-0 closed. root@titan89 ~]# virsh list Id Name State ------------------------------ 4 undercloud-0 running 19 controller-0 running 20 controller-2 running 21 controller-1 running 27 compute-1 running 28 compute-0 running [root@titan89 ~]# virsh dumpxml compute-0 | grep cpu <vcpu placement='static'>4</vcpu> <cpu mode='host-passthrough' check='none'/> [root@titan89 ~]# virsh dumpxml compute-1 | grep cpu <vcpu placement='static'>4</vcpu> <cpu mode='host-passthrough' check='none'/> [root@titan89 ~]# virsh dumpxml compute-0 | grep emo <memory unit='KiB'>12572672</memory> <currentMemory unit='KiB'>12572672</currentMemory> [root@titan89 ~]# virsh dumpxml compute-1 | grep emo <memory unit='KiB'>12572672</memory> <currentMemory unit='KiB'>12572672</currentMemory> [root@titan89 ~]# virsh shutdown compute-1 Domain compute-1 is being shutdown [root@titan89 ~]# virsh edit compute-1 Domain compute-1 XML configuration edited. <-- I doubled the memory and vcpu [root@titan89 ~]# virsh create /etc/libvirt/qemu/compute-1.xml Domain compute-1 created from /etc/libvirt/qemu/compute-1.xml [root@titan89 ~]# virsh shutdown compute-0 Domain compute-0 is being shutdown [root@titan89 ~]# virsh edit compute-0 Domain compute-0 XML configuration edited. <-- I doubled the memory and vcpu [root@titan89 ~]# virsh create /etc/libvirt/qemu/compute-0.xml Domain compute-0 created from /etc/libvirt/qemu/compute-0.xml [root@titan89 ~]# virsh list Id Name State ------------------------------ 4 undercloud-0 running 19 controller-0 running 20 controller-2 running 21 controller-1 running 29 compute-1 running 30 compute-0 running [root@titan89 ~]# virsh dumpxml compute-0 | grep cpu <vcpu placement='static'>8</vcpu> <cpu mode='host-passthrough' check='none'/> [root@titan89 ~]# virsh dumpxml compute-1 | grep cpu <vcpu placement='static'>8</vcpu> <cpu mode='host-passthrough' check='none'/> [root@titan89 ~]# virsh dumpxml compute-0 | grep emo <memory unit='KiB'>25145344</memory> <currentMemory unit='KiB'>25145344</currentMemory> [root@titan89 ~]# virsh dumpxml compute-1 | grep emo <memory unit='KiB'>25145344</memory> <currentMemory unit='KiB'>25145344</currentMemory> [root@titan89 ~]# ssh stack@undercloud-0 Warning: Permanently added 'undercloud-0' (ECDSA) to the list of known hosts. Last login: Sun Oct 4 03:56:12 2020 from 172.16.0.1 [stack@undercloud-0 ~]$ . overcloudrc (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ | id | name | project_id | vip_address | provisioning_status | provider | +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ | 9b6cc8eb-63ab-4787-bfe1-cf95bd33eb06 | test-lb1 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.193 | ACTIVE | amphora | | effaaa1f-5792-4fdb-bf90-098a908692f1 | test-lb2 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.55 | ACTIVE | amphora | | 00330dc3-4d16-42d5-869c-92cf14e6b7c2 | test-lb3 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.37 | ACTIVE | amphora | +--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+ (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list +--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+ | id | loadbalancer_id | status | role | lb_network_ip | ha_ip | +--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+ | 8e72cada-0c42-47cb-b5d7-c2f41526eb79 | 9b6cc8eb-63ab-4787-bfe1-cf95bd33eb06 | ALLOCATED | STANDALONE | 172.24.1.63 | 192.168.1.193 | | 58f36bd2-58c4-47a7-88a9-985d78c74a55 | effaaa1f-5792-4fdb-bf90-098a908692f1 | ALLOCATED | STANDALONE | 172.24.0.53 | 192.168.1.55 | | 883b52fb-705a-4dae-8b20-111f4965c8ff | 00330dc3-4d16-42d5-869c-92cf14e6b7c2 | ALLOCATED | STANDALONE | 172.24.0.219 | 192.168.1.37 | +--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+ (overcloud) [stack@undercloud-0 ~]$ The provisioning_status of all the 3 LBs is ACTIVE. The status of all the 3 Amphoras is ALLOCATED. Looks good to me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (octavia-train bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4400