Bug 1723482 - Octavia LBs stuck in PENDING_UPDATE state after compute nodes reboot (Nova port detach failure)
Summary: Octavia LBs stuck in PENDING_UPDATE state after compute nodes reboot (Nova po...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z13
: 13.0 (Queens)
Assignee: Michael Johnson
QA Contact: Omer Schwartz
URL:
Whiteboard:
: 1853893 (view as bug list)
Depends On:
Blocks: 1874927
TreeView+ depends on / blocked
 
Reported: 2019-06-24 15:39 UTC by Bruna Bonguardo
Modified: 2024-10-01 16:17 UTC (History)
13 users (show)

Fixed In Version: openstack-octavia-5.0.3-0.20200724164655.b3113b3.el7ost
Doc Type: Bug Fix
Doc Text:
Before this update, the Compute (nova) service not releasing resources, such as network ports, until a Compute node is restored caused the Load-balancing service (octavia) failover to fail when it was unable to detach a network port from an instance on a Compute node that is down. With this update, the failover flow in the Load-balancing service has been updated to work around this Compute service issue. The Load-balancing service will now abandon ports that the Compute service will not release, leaving them in a "pending delete" state for the Compute service or Networking service to clean up once the Compute node is restored. This resolves the issue, allowing failover to succeed even if the Compute node is still failed.
Clone Of:
: 1874927 (view as bug list)
Environment:
Last Closed: 2020-10-28 18:34:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack Storyboard 2003084 0 None None None 2019-06-26 15:48:42 UTC
OpenStack Storyboard 2006051 0 None None None 2019-06-26 15:48:42 UTC
OpenStack gerrit 739772 0 None MERGED Refactor the failover flows 2021-02-14 23:42:22 UTC
Red Hat Bugzilla 1725189 0 high CLOSED Port detach fails when compute host is unreachable 2024-03-25 15:19:47 UTC
Red Hat Issue Tracker OSP-6590 0 None None None 2021-11-17 18:08:29 UTC
Red Hat Product Errata RHEA-2020:4400 0 None None None 2020-10-28 18:35:20 UTC

Description Bruna Bonguardo 2019-06-24 15:39:53 UTC
Description of problem:
Octavia Load Balancers stuck in PENDING_UPDATE state after compute nodes reboot.

Version-Release number of selected component (if applicable):
[2019-06-24 11:24:52] (overcloud) [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed
14  -p 2019-06-19.2


[2019-06-24 11:30:05] (overcloud) [stack@undercloud-0 ~]$ rpm -qa | grep octavia
python2-octaviaclient-1.6.0-0.20180816134808.64d007f.el7ost.noarch
puppet-octavia-13.3.2-0.20190420064721.29482dd.el7ost.noarch
octavia-amphora-image-x86_64-14.0-20190617.1.el7ost.noarch

How reproducible:
Unclear

Steps to Reproduce:
1) Deploy tripleo + octavia
2) Create internal tenant network
3) Create 3 load balancers in internal network
4) Increase memory and vcpu number of compute nodes (virsh) - one compute node at a time. first compute-0 and then compute-1.


Actual results:
One of the 3 LBs is stating as ACTIVE and the other two are stuck at PENDING_UPDATE. Two of the amphorae are in ERROR state.


Expected results:
All Load Balancers and Amphorae are ACTIVE and ONLINE.

Steps:
[root@titan10 ~]# virsh shutdown compute-1
Domain compute-1 is being shutdown

[root@titan10 ~]# virsh edit compute-1
Domain compute-1 XML configuration edited. <------ Added more memory and more vcpus.

[root@titan10 ~]# virsh create /etc/libvirt/qemu/compute-1.xml
Domain compute-1 created from /etc/libvirt/qemu/compute-1.xml


[root@titan10 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 4     undercloud-0                   running
 20    controller-2                   running
 22    controller-1                   running
 24    controller-0                   running
 25    compute-0                      running
 26    compute-1                      running

[root@titan10 ~]# ssh root@undercloud-0

[2019-06-24 09:34:49] (tester) [stack@undercloud-0 ~]$ openstack loadbalancer listener create --name listenerHTTP-one one --protocol HTTP --protocol-port 80
Load Balancer 789510be-eee5-4055-97c8-917e680b8e0e is immutable and cannot be updated. (HTTP 409) (Request-ID: req-19be68a2-9069-4d7d-b53f-2cbd8271475e)

[2019-06-24 09:35:25] (tester) [stack@undercloud-0 ~]$ openstack loadbalancer list
+--------------------------------------+-------+----------------------------------+-------------+---------------------+----------+
| id                                   | name  | project_id                       | vip_address | provisioning_status | provider |
+--------------------------------------+-------+----------------------------------+-------------+---------------------+----------+
| 789510be-eee5-4055-97c8-917e680b8e0e | one   | 635e7c28cd8e416cbc3225f642b8d28b | 10.0.1.13   | PENDING_UPDATE      | amphora  |
| 56804bfb-aefa-4569-a2ab-54b8fdde7542 | two   | 635e7c28cd8e416cbc3225f642b8d28b | 10.0.1.6    | ACTIVE              | amphora  |
| b94d6658-4266-4051-9674-881d280ac6ea | three | 635e7c28cd8e416cbc3225f642b8d28b | 10.0.1.10   | PENDING_UPDATE      | amphora  |
+--------------------------------------+-------+----------------------------------+-------------+---------------------+----------+

[2019-06-24 09:47:55] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+
| id                                   | loadbalancer_id                      | status    | role       | lb_network_ip | ha_ip     |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+
| 848ed1d0-d424-4f36-b7de-acecede5a95b | 789510be-eee5-4055-97c8-917e680b8e0e | ERROR     | STANDALONE | 172.24.0.26   | 10.0.1.13 |
| 0b6d11a7-97ed-46cb-8825-2349be8715ee | b94d6658-4266-4051-9674-881d280ac6ea | ERROR     | STANDALONE | 172.24.0.7    | 10.0.1.10 |
| 3fb72336-568f-46df-bfa3-907e90be55b5 | 56804bfb-aefa-4569-a2ab-54b8fdde7542 | ALLOCATED | STANDALONE | 172.24.0.22   | 10.0.1.6  |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+

[2019-06-24 09:51:39] (overcloud) [stack@undercloud-0 ~]$ openstack server list --all
+--------------------------------------+----------------------------------------------+---------+-------------------------------------------------------------------------+----------------------------------------+---------------+
| ID                                   | Name                                         | Status  | Networks                                                                | Image                                  | Flavor        |
+--------------------------------------+----------------------------------------------+---------+-------------------------------------------------------------------------+----------------------------------------+---------------+
| bf07e809-25ce-4260-9b59-255e9f43411a | amphora-3fb72336-568f-46df-bfa3-907e90be55b5 | ACTIVE  | lb-mgmt-net=172.24.0.22; int_net_1=2001::f816:3eff:fea1:a23f, 10.0.1.16 | octavia-amphora-14.0-20190617.1.x86_64 | octavia_65    |
+--------------------------------------+----------------------------------------------+---------+-------------------------------------------------------------------------+----------------------------------------+---------------+


Am I maybe missing something? Should I have failovered the LBs before the compute node shutdown? What is the best practice for compute node manipulation with Octavia?

Thank you

Comment 2 Bruna Bonguardo 2019-06-24 16:03:04 UTC
Just adding more information, I noticed the compute_id listed for the amphorae is not the same as the compute id of the compute-0 and compute-1. I guess it is because the compute nodes were recreated:


[2019-06-24 11:49:54] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+
| id                                   | loadbalancer_id                      | status    | role       | lb_network_ip | ha_ip     |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+
| 848ed1d0-d424-4f36-b7de-acecede5a95b | 789510be-eee5-4055-97c8-917e680b8e0e | ERROR     | STANDALONE | 172.24.0.26   | 10.0.1.13 |
| 0b6d11a7-97ed-46cb-8825-2349be8715ee | b94d6658-4266-4051-9674-881d280ac6ea | ERROR     | STANDALONE | 172.24.0.7    | 10.0.1.10 |
| 3fb72336-568f-46df-bfa3-907e90be55b5 | 56804bfb-aefa-4569-a2ab-54b8fdde7542 | ALLOCATED | STANDALONE | 172.24.0.22   | 10.0.1.6  |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+
[2019-06-24 11:50:06] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora show 848ed1d0-d424-4f36-b7de-acecede5a95b
+-----------------+--------------------------------------+
| Field           | Value                                |
+-----------------+--------------------------------------+
| id              | 848ed1d0-d424-4f36-b7de-acecede5a95b |
| loadbalancer_id | 789510be-eee5-4055-97c8-917e680b8e0e |
| compute_id      | 1e91544f-eaa5-4504-9363-c1453bbd0ee0 |
| lb_network_ip   | 172.24.0.26                          |
| vrrp_ip         | 10.0.1.7                             |
| ha_ip           | 10.0.1.13                            |
| vrrp_port_id    | ee1f6bb7-ff8c-445e-980d-9dc655686c8b |
| ha_port_id      | d2b1fa6a-c6d5-4150-94ad-ef93c5837985 |
| cert_expiration | 2021-06-23T12:58:36                  |
| cert_busy       | False                                |
| role            | STANDALONE                           |
| status          | ERROR                                |
| vrrp_interface  | None                                 |
| vrrp_id         | 1                                    |
| vrrp_priority   | None                                 |
| cached_zone     | nova                                 |
| created_at      | 2019-06-24T12:58:36                  |
| updated_at      | 2019-06-24T13:09:52                  |
| image_id        | 95547e16-0770-4982-a04e-539cbce9f6f8 |
+-----------------+--------------------------------------+
[2019-06-24 11:50:18] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora show 0b6d11a7-97ed-46cb-8825-2349be8715ee
+-----------------+--------------------------------------+
| Field           | Value                                |
+-----------------+--------------------------------------+
| id              | 0b6d11a7-97ed-46cb-8825-2349be8715ee |
| loadbalancer_id | b94d6658-4266-4051-9674-881d280ac6ea |
| compute_id      | e8fde6ef-d4d4-40a0-a1cd-bfa272425c51 |
| lb_network_ip   | 172.24.0.7                           |
| vrrp_ip         | 10.0.1.23                            |
| ha_ip           | 10.0.1.10                            |
| vrrp_port_id    | fe0ddc25-3fd8-4ccb-b131-6819de86f2ef |
| ha_port_id      | 44fdd62a-665b-46b1-b233-edef45fddde1 |
| cert_expiration | 2021-06-23T13:00:58                  |
| cert_busy       | False                                |
| role            | STANDALONE                           |
| status          | ERROR                                |
| vrrp_interface  | None                                 |
| vrrp_id         | 1                                    |
| vrrp_priority   | None                                 |
| cached_zone     | nova                                 |
| created_at      | 2019-06-24T13:00:58                  |
| updated_at      | 2019-06-24T13:09:54                  |
| image_id        | 95547e16-0770-4982-a04e-539cbce9f6f8 |
+-----------------+--------------------------------------+
[2019-06-24 11:50:31] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora show 3fb72336-568f-46df-bfa3-907e90be55b5
+-----------------+--------------------------------------+
| Field           | Value                                |
+-----------------+--------------------------------------+
| id              | 3fb72336-568f-46df-bfa3-907e90be55b5 |
| loadbalancer_id | 56804bfb-aefa-4569-a2ab-54b8fdde7542 |
| compute_id      | bf07e809-25ce-4260-9b59-255e9f43411a |
| lb_network_ip   | 172.24.0.22                          |
| vrrp_ip         | 10.0.1.16                            |
| ha_ip           | 10.0.1.6                             |
| vrrp_port_id    | 669f8919-4428-40d5-8f5e-31e5970b40a9 |
| ha_port_id      | 604464a8-0809-4e64-a412-2c59aa0bae66 |
| cert_expiration | 2021-06-23T13:21:41                  |
| cert_busy       | False                                |
| role            | STANDALONE                           |
| status          | ALLOCATED                            |
| vrrp_interface  | None                                 |
| vrrp_id         | 1                                    |
| vrrp_priority   | None                                 |
| cached_zone     | nova                                 |
| created_at      | 2019-06-24T13:21:41                  |
| updated_at      | 2019-06-24T13:22:59                  |
| image_id        | 95547e16-0770-4982-a04e-539cbce9f6f8 |
+-----------------+--------------------------------------+
[2019-06-24 11:50:47] (overcloud) [stack@undercloud-0 ~]$ . stackrc ; openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 4924dc98-f130-4688-9c39-399ad72e70ec | controller-0 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | controller |
| b2c7771c-712c-480e-a1a4-19e99bf4e54c | controller-2 | ACTIVE | ctlplane=192.168.24.8  | overcloud-full | controller |
| d269d56d-3cd5-48b7-a85d-3a0211d6a944 | controller-1 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller |
| e6b1df9e-139b-45f9-8648-c647b8737f63 | compute-1    | ACTIVE | ctlplane=192.168.24.17 | overcloud-full | compute    |
| c41285ee-a26d-43fe-b4c5-de2d246185a9 | compute-0    | ACTIVE | ctlplane=192.168.24.7  | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+

Comment 3 Michael Johnson 2019-06-26 15:32:25 UTC
I can confirm this issue from the sos logs. The controller-1 log contains one of the failures.

Root cause: Neutron/nova failed to detach the port from the instance for up to five minutes after the detach request. This is caused by nova getting stuck while the compute host that contained the instance is down. This has also been reported to the nova team as an issue that nova will not release port resources if the host is down: https://bugs.launchpad.net/nova/+bug/1827746

Nova has the same defect for volume detach, though that does not impact Octavia.

Upstream there is an open patch with a -1 for this issue:
https://review.opendev.org/#/c/585864/
This patch needs work and additional review.


There is also a secondary bug here, that the failover process is not properly returning the load balancer object to the proper provisioning status of ERROR.

I have opened an upstream story for this issue: https://storyboard.openstack.org/#!/story/2006051

Comment 14 Michael Johnson 2020-03-03 15:59:04 UTC
Linked the upstream Octavia patch with a workaround for the nova issue. It is still WIP, but getting closer to being ready for upstream reviews.

Comment 16 Bruna Bonguardo 2020-07-08 14:05:33 UTC
*** Bug 1853893 has been marked as a duplicate of this bug. ***

Comment 28 Omer Schwartz 2020-10-04 08:12:46 UTC
After verification process that involved these steps, I:

1) Deployed tripleo + octavia
2) Created internal tenant network
3) Created 3 load balancers in internal network
4) Increased memory and vcpu number of compute nodes (virsh) - one compute node at a time. first compute-0 and then compute-1.

A more detailed version of the steps:

(overcloud) [stack@undercloud-0 ~]$ cat /var/lib/rhos-release/latest-installed
13  -p 2020-09-16.1
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list
+--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+
| id                                   | name     | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+
| 9b6cc8eb-63ab-4787-bfe1-cf95bd33eb06 | test-lb1 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.193 | ACTIVE              | amphora  |
| effaaa1f-5792-4fdb-bf90-098a908692f1 | test-lb2 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.55  | ACTIVE              | amphora  |
| 00330dc3-4d16-42d5-869c-92cf14e6b7c2 | test-lb3 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.37  | ACTIVE              | amphora  |
+--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+
(overcloud) [stack@undercloud-0 ~]$ logout
Connection to undercloud-0 closed.
root@titan89 ~]# virsh list
 Id   Name           State
------------------------------
 4    undercloud-0   running
 19   controller-0   running
 20   controller-2   running
 21   controller-1   running
 27   compute-1      running
 28   compute-0      running

[root@titan89 ~]# virsh dumpxml compute-0 | grep cpu
  <vcpu placement='static'>4</vcpu>
  <cpu mode='host-passthrough' check='none'/>
[root@titan89 ~]# virsh dumpxml compute-1 | grep cpu
  <vcpu placement='static'>4</vcpu>
  <cpu mode='host-passthrough' check='none'/>
[root@titan89 ~]# virsh dumpxml compute-0 | grep emo
  <memory unit='KiB'>12572672</memory>
  <currentMemory unit='KiB'>12572672</currentMemory>
[root@titan89 ~]# virsh dumpxml compute-1 | grep emo
  <memory unit='KiB'>12572672</memory>
  <currentMemory unit='KiB'>12572672</currentMemory>
[root@titan89 ~]# virsh shutdown compute-1
Domain compute-1 is being shutdown

[root@titan89 ~]# virsh edit compute-1
Domain compute-1 XML configuration edited.  <-- I doubled the memory and vcpu

[root@titan89 ~]# virsh create /etc/libvirt/qemu/compute-1.xml
Domain compute-1 created from /etc/libvirt/qemu/compute-1.xml

[root@titan89 ~]# virsh shutdown compute-0
Domain compute-0 is being shutdown

[root@titan89 ~]# virsh edit compute-0
Domain compute-0 XML configuration edited. <-- I doubled the memory and vcpu

[root@titan89 ~]# virsh create /etc/libvirt/qemu/compute-0.xml
Domain compute-0 created from /etc/libvirt/qemu/compute-0.xml

[root@titan89 ~]# virsh list
 Id   Name           State
------------------------------
 4    undercloud-0   running
 19   controller-0   running
 20   controller-2   running
 21   controller-1   running
 29   compute-1      running
 30   compute-0      running

[root@titan89 ~]# virsh dumpxml compute-0 | grep cpu
  <vcpu placement='static'>8</vcpu>
  <cpu mode='host-passthrough' check='none'/>
[root@titan89 ~]# virsh dumpxml compute-1 | grep cpu
  <vcpu placement='static'>8</vcpu>
  <cpu mode='host-passthrough' check='none'/>
[root@titan89 ~]# virsh dumpxml compute-0 | grep emo
  <memory unit='KiB'>25145344</memory>
  <currentMemory unit='KiB'>25145344</currentMemory>
[root@titan89 ~]# virsh dumpxml compute-1 | grep emo
  <memory unit='KiB'>25145344</memory>
  <currentMemory unit='KiB'>25145344</currentMemory>

[root@titan89 ~]# ssh stack@undercloud-0
Warning: Permanently added 'undercloud-0' (ECDSA) to the list of known hosts.
Last login: Sun Oct  4 03:56:12 2020 from 172.16.0.1
[stack@undercloud-0 ~]$ . overcloudrc 
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list
+--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+
| id                                   | name     | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+
| 9b6cc8eb-63ab-4787-bfe1-cf95bd33eb06 | test-lb1 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.193 | ACTIVE              | amphora  |
| effaaa1f-5792-4fdb-bf90-098a908692f1 | test-lb2 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.55  | ACTIVE              | amphora  |
| 00330dc3-4d16-42d5-869c-92cf14e6b7c2 | test-lb3 | 60c7cfeb082f416aa5c8a651276c959e | 192.168.1.37  | ACTIVE              | amphora  |
+--------------------------------------+----------+----------------------------------+---------------+---------------------+----------+
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+
| id                                   | loadbalancer_id                      | status    | role       | lb_network_ip | ha_ip         |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+
| 8e72cada-0c42-47cb-b5d7-c2f41526eb79 | 9b6cc8eb-63ab-4787-bfe1-cf95bd33eb06 | ALLOCATED | STANDALONE | 172.24.1.63   | 192.168.1.193 |
| 58f36bd2-58c4-47a7-88a9-985d78c74a55 | effaaa1f-5792-4fdb-bf90-098a908692f1 | ALLOCATED | STANDALONE | 172.24.0.53   | 192.168.1.55  |
| 883b52fb-705a-4dae-8b20-111f4965c8ff | 00330dc3-4d16-42d5-869c-92cf14e6b7c2 | ALLOCATED | STANDALONE | 172.24.0.219  | 192.168.1.37  |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+
(overcloud) [stack@undercloud-0 ~]$

The provisioning_status of all the 3 LBs is ACTIVE.
The status of all the 3 Amphoras is ALLOCATED.

Looks good to me.

Comment 34 errata-xmlrpc 2020-10-28 18:34:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (octavia-train bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4400


Note You need to log in before you can comment on or make changes to this bug.