Bug 1874927

Summary: Octavia LBs stuck in PENDING_UPDATE state after compute nodes reboot (Nova port detach failure)
Product: Red Hat OpenStack Reporter: Gregory Thiemonge <gthiemon>
Component: openstack-octaviaAssignee: Gregory Thiemonge <gthiemon>
Status: CLOSED ERRATA QA Contact: Omer Schwartz <oschwart>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: averi, batkisso, bbonguar, cgoncalves, ihrachys, lpeer, majopela, mgarciac, michjohn, mvalsecc, njohnston, oschwart, scohen
Target Milestone: z2Keywords: Triaged, ZStream
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-octavia-5.0.3-0.20200725113403.68c0285.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1723482 Environment:
Last Closed: 2020-10-28 15:39:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1723482    
Bug Blocks:    

Description Gregory Thiemonge 2020-09-02 15:13:00 UTC
+++ This bug was initially created as a clone of Bug #1723482 +++

Description of problem:
Octavia Load Balancers stuck in PENDING_UPDATE state after compute nodes reboot.

Version-Release number of selected component (if applicable):
[2019-06-24 11:24:52] (overcloud) [stack@undercloud-0 ~]$ cat /etc/yum.repos.d/latest-installed
14  -p 2019-06-19.2


[2019-06-24 11:30:05] (overcloud) [stack@undercloud-0 ~]$ rpm -qa | grep octavia
python2-octaviaclient-1.6.0-0.20180816134808.64d007f.el7ost.noarch
puppet-octavia-13.3.2-0.20190420064721.29482dd.el7ost.noarch
octavia-amphora-image-x86_64-14.0-20190617.1.el7ost.noarch

How reproducible:
Unclear

Steps to Reproduce:
1) Deploy tripleo + octavia
2) Create internal tenant network
3) Create 3 load balancers in internal network
4) Increase memory and vcpu number of compute nodes (virsh) - one compute node at a time. first compute-0 and then compute-1.


Actual results:
One of the 3 LBs is stating as ACTIVE and the other two are stuck at PENDING_UPDATE. Two of the amphorae are in ERROR state.


Expected results:
All Load Balancers and Amphorae are ACTIVE and ONLINE.

Steps:
[root@titan10 ~]# virsh shutdown compute-1
Domain compute-1 is being shutdown

[root@titan10 ~]# virsh edit compute-1
Domain compute-1 XML configuration edited. <------ Added more memory and more vcpus.

[root@titan10 ~]# virsh create /etc/libvirt/qemu/compute-1.xml
Domain compute-1 created from /etc/libvirt/qemu/compute-1.xml


[root@titan10 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 4     undercloud-0                   running
 20    controller-2                   running
 22    controller-1                   running
 24    controller-0                   running
 25    compute-0                      running
 26    compute-1                      running

[root@titan10 ~]# ssh root@undercloud-0

[2019-06-24 09:34:49] (tester) [stack@undercloud-0 ~]$ openstack loadbalancer listener create --name listenerHTTP-one one --protocol HTTP --protocol-port 80
Load Balancer 789510be-eee5-4055-97c8-917e680b8e0e is immutable and cannot be updated. (HTTP 409) (Request-ID: req-19be68a2-9069-4d7d-b53f-2cbd8271475e)

[2019-06-24 09:35:25] (tester) [stack@undercloud-0 ~]$ openstack loadbalancer list
+--------------------------------------+-------+----------------------------------+-------------+---------------------+----------+
| id                                   | name  | project_id                       | vip_address | provisioning_status | provider |
+--------------------------------------+-------+----------------------------------+-------------+---------------------+----------+
| 789510be-eee5-4055-97c8-917e680b8e0e | one   | 635e7c28cd8e416cbc3225f642b8d28b | 10.0.1.13   | PENDING_UPDATE      | amphora  |
| 56804bfb-aefa-4569-a2ab-54b8fdde7542 | two   | 635e7c28cd8e416cbc3225f642b8d28b | 10.0.1.6    | ACTIVE              | amphora  |
| b94d6658-4266-4051-9674-881d280ac6ea | three | 635e7c28cd8e416cbc3225f642b8d28b | 10.0.1.10   | PENDING_UPDATE      | amphora  |
+--------------------------------------+-------+----------------------------------+-------------+---------------------+----------+

[2019-06-24 09:47:55] (overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+
| id                                   | loadbalancer_id                      | status    | role       | lb_network_ip | ha_ip     |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+
| 848ed1d0-d424-4f36-b7de-acecede5a95b | 789510be-eee5-4055-97c8-917e680b8e0e | ERROR     | STANDALONE | 172.24.0.26   | 10.0.1.13 |
| 0b6d11a7-97ed-46cb-8825-2349be8715ee | b94d6658-4266-4051-9674-881d280ac6ea | ERROR     | STANDALONE | 172.24.0.7    | 10.0.1.10 |
| 3fb72336-568f-46df-bfa3-907e90be55b5 | 56804bfb-aefa-4569-a2ab-54b8fdde7542 | ALLOCATED | STANDALONE | 172.24.0.22   | 10.0.1.6  |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+-----------+

[2019-06-24 09:51:39] (overcloud) [stack@undercloud-0 ~]$ openstack server list --all
+--------------------------------------+----------------------------------------------+---------+-------------------------------------------------------------------------+----------------------------------------+---------------+
| ID                                   | Name                                         | Status  | Networks                                                                | Image                                  | Flavor        |
+--------------------------------------+----------------------------------------------+---------+-------------------------------------------------------------------------+----------------------------------------+---------------+
| bf07e809-25ce-4260-9b59-255e9f43411a | amphora-3fb72336-568f-46df-bfa3-907e90be55b5 | ACTIVE  | lb-mgmt-net=172.24.0.22; int_net_1=2001::f816:3eff:fea1:a23f, 10.0.1.16 | octavia-amphora-14.0-20190617.1.x86_64 | octavia_65    |
+--------------------------------------+----------------------------------------------+---------+-------------------------------------------------------------------------+----------------------------------------+---------------+


Am I maybe missing something? Should I have failovered the LBs before the compute node shutdown? What is the best practice for compute node manipulation with Octavia?

Thank you

Comment 3 Gregory Thiemonge 2020-09-02 15:17:54 UTC
*** Bug 1874542 has been marked as a duplicate of this bug. ***

Comment 8 Omer Schwartz 2020-09-24 15:08:17 UTC
After verification that involved these steps, I:

1) Deployed tripleo + octavia
2) Created internal tenant network
3) Created 3 load balancers in internal network
4) Increased memory and vcpu number of compute nodes (virsh) - one compute node at a time. first compute-0 and then compute-1.


A more detailed version of the steps:

(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list
+--------------------------------------+--------------------------------------+----------------------------------+---------------+---------------------+----------+
| id                                   | name                                 | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+--------------------------------------+----------------------------------+---------------+---------------------+----------+
| 62385ae9-3a52-41d4-87c8-4abd481e6261 | test_tree-loadbalancer-4t7iwpemjra2  | d3f6513b54b34c3bb4bf6ee58f78d030 | 192.168.1.29  | ACTIVE              | amphora  |
| 78c30ea1-c6b9-41e8-9e8d-61497c8f7734 | test_tree2-loadbalancer-atzbw64lfivs | d3f6513b54b34c3bb4bf6ee58f78d030 | 192.168.1.252 | ACTIVE              | amphora  |
| 83c3027c-df67-41d4-b5df-793fe0da370c | test_tree3-loadbalancer-f7arjwyhl2bh | d3f6513b54b34c3bb4bf6ee58f78d030 | 192.168.1.96  | ACTIVE              | amphora  |
+--------------------------------------+--------------------------------------+----------------------------------+---------------+---------------------+----------+

(overcloud) [stack@undercloud-0 ~]$ logout
Connection to undercloud-0 closed.

[root@seal19 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 4     undercloud-0                   running
 17    compute-1                      running
 18    compute-0                      running
 19    controller-0                   running
 20    controller-2                   running
 21    controller-1                   running

[root@seal19 ~]# virsh dumpxml compute-0 | grep cpu
  <vcpu placement='static'>4</vcpu>
  <cpu mode='host-passthrough' check='none'/>

[root@seal19 ~]# virsh dumpxml compute-1 | grep cpu
  <vcpu placement='static'>4</vcpu>
  <cpu mode='host-passthrough' check='none'/>

[root@seal19 ~]# virsh dumpxml compute-0 | grep emo
  <memory unit='KiB'>12572672</memory>
  <currentMemory unit='KiB'>12572672</currentMemory>

[root@seal19 ~]# virsh dumpxml compute-1 | grep emo
  <memory unit='KiB'>12572672</memory>
  <currentMemory unit='KiB'>12572672</currentMemory>

[root@seal19 ~]# virsh shutdown compute-1
Domain compute-1 is being shutdown

[root@seal19 ~]# virsh edit compute-1
Domain compute-1 XML configuration edited.   <-- I doubled the memory and vcpu

[root@seal19 ~]# virsh create /etc/libvirt/qemu/compute-1.xml
Domain compute-1 created from /etc/libvirt/qemu/compute-1.xml

[root@seal19 ~]# virsh shutdown compute-0
Domain compute-0 is being shutdown

[root@seal19 ~]# virsh edit compute-0
Domain compute-0 XML configuration edited.  <-- I doubled the memory and vcpu

[root@seal19 ~]# virsh create /etc/libvirt/qemu/compute-0.xml
Domain compute-0 created from /etc/libvirt/qemu/compute-0.xml

[root@seal19 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 4     undercloud-0                   running
 19    controller-0                   running
 20    controller-2                   running
 21    controller-1                   running
 22    compute-1                      running
 23    compute-0                      running

[root@seal19 ~]# ssh stack@undercloud-0
Warning: Permanently added 'undercloud-0' (ECDSA) to the list of known hosts.
Activate the web console with: systemctl enable --now cockpit.socket

This system is not registered to Red Hat Insights. See https://cloud.redhat.com/
To register this system, run: insights-client --register

Last login: Thu Sep 24 09:39:20 2020 from 172.16.0.1
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer list
+--------------------------------------+--------------------------------------+----------------------------------+---------------+---------------------+----------+
| id                                   | name                                 | project_id                       | vip_address   | provisioning_status | provider |
+--------------------------------------+--------------------------------------+----------------------------------+---------------+---------------------+----------+
| 62385ae9-3a52-41d4-87c8-4abd481e6261 | test_tree-loadbalancer-4t7iwpemjra2  | d3f6513b54b34c3bb4bf6ee58f78d030 | 192.168.1.29  | ACTIVE              | amphora  |
| 78c30ea1-c6b9-41e8-9e8d-61497c8f7734 | test_tree2-loadbalancer-atzbw64lfivs | d3f6513b54b34c3bb4bf6ee58f78d030 | 192.168.1.252 | ACTIVE              | amphora  |
| 83c3027c-df67-41d4-b5df-793fe0da370c | test_tree3-loadbalancer-f7arjwyhl2bh | d3f6513b54b34c3bb4bf6ee58f78d030 | 192.168.1.96  | ACTIVE              | amphora  |
+--------------------------------------+--------------------------------------+----------------------------------+---------------+---------------------+----------+
(overcloud) [stack@undercloud-0 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+
| id                                   | loadbalancer_id                      | status    | role       | lb_network_ip | ha_ip         |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+
| fbeac58b-2cfc-4ffd-b93a-cf2fd6832b1d | 78c30ea1-c6b9-41e8-9e8d-61497c8f7734 | ALLOCATED | STANDALONE | 172.24.3.138  | 192.168.1.252 |
| 2d02afbe-cbc7-4fa3-92e2-3b2b00acde32 | 83c3027c-df67-41d4-b5df-793fe0da370c | ALLOCATED | STANDALONE | 172.24.3.215  | 192.168.1.96  |
| bd2156cd-3e8f-4148-86e3-f539696c3617 | 62385ae9-3a52-41d4-87c8-4abd481e6261 | ALLOCATED | STANDALONE | 172.24.3.156  | 192.168.1.29  |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+---------------+
(overcloud) [stack@undercloud-0 ~]$ 

The provisioning_status of all the 3 LBs is ACTIVE.
The status of all the 3 Amphoras is ALLOCATED.


The puddle I checked is:
(overcloud) [stack@undercloud-0 virt]$ cat /var/lib/rhos-release/latest-installed
16.1  -p RHOS-16.1-RHEL-8-20200917.n.3

Looks good to me.

Comment 9 Omer Schwartz 2020-09-24 15:12:48 UTC
I forgot to attach the updated COMPUTEs servers details, so there they are:

[root@seal19 ~]# virsh dumpxml compute-0 | grep cpu
  <vcpu placement='static'>8</vcpu>
  <cpu mode='host-passthrough' check='none'/>

[root@seal19 ~]# virsh dumpxml compute-1 | grep cpu
  <vcpu placement='static'>8</vcpu>
  <cpu mode='host-passthrough' check='none'/>

[root@seal19 ~]# virsh dumpxml compute-0 | grep emo
  <memory unit='KiB'>25145344</memory>
  <currentMemory unit='KiB'>25145344</currentMemory>

[root@seal19 ~]# virsh dumpxml compute-1 | grep emo
  <memory unit='KiB'>25145344</memory>
  <currentMemory unit='KiB'>25145344</currentMemory>

[root@seal19 ~]#

Comment 15 errata-xmlrpc 2020-10-28 15:39:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284