Bug 1973674

Summary: Stack reconfiguration failed because ha-proxy container crashed during reconfiguration
Product: Red Hat OpenStack Reporter: tmicheli
Component: openstack-tripleo-heat-templatesAssignee: Bogdan Dobrelya <bdobreli>
Status: CLOSED ERRATA QA Contact: Joe H. Rahme <jhakimra>
Severity: low Docs Contact:
Priority: medium    
Version: 16.1 (Train)CC: bdobreli, dabarzil, enothen, ggrasza, jfrancoa, jpretori, lmiccini, mburns, mkrcmari
Target Milestone: z8Keywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20220112143355.29a02c1.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2008418 (view as bug list) Environment:
Last Closed: 2022-03-24 10:59:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2008418    

Description tmicheli 2021-06-18 13:02:12 UTC
Description of problem:
During a scale out of the overcloud (following this procedure [1]) the ha-proxy crashed and was restarted. However, the deployment scripts are trying to reload the ha-proxy after reconfiguring the certificates by sending a `kill -HUP <uuid-of-the-container>` as the container was restarted that was not possible, and the scale out failed. One possible approach to solve this would be, not to use the uuid of the container, but using the name.

changed: [overcloud-test-controller-1] => (item=6eb0c3cb45d7) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 6eb0c3cb4
5d7:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 6eb0c3cb45d7 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 6eb0c3cb45d7\n", "d$
lta": "0:00:00.549954", "end": "2021-06-17 17:02:44.448680", "item": "6eb0c3cb45d7", "rc": 0, "start": "2021-06-17 17:02:43.898726", "stderr": "", "stderr_lines": [], "stdout": "6eb0c3cb45d$
301e20dcf440821990ae135e7c8ac7d99c9ebb59f40f41867c3a", "stdout_lines": ["6eb0c3cb45d7301e20dcf440821990ae135e7c8ac7d99c9ebb59f40f41867c3a"]}
changed: [overcloud-test-controller-2] => (item=794b10fc1006) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 794b10fc$
006:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 794b10fc1006 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 794b10fc1006\n", "d$
lta": "0:00:00.517081", "end": "2021-06-17 17:02:45.127154", "item": "794b10fc1006", "rc": 0, "start": "2021-06-17 17:02:44.610073", "stderr": "", "stderr_lines": [], "stdout": "794b10fc100$
f987be67d68b25b6afe7a2e197a2fea0d9351ee389ee129e6337", "stdout_lines": ["794b10fc1006f987be67d68b25b6afe7a2e197a2fea0d9351ee389ee129e6337"]}
changed: [overcloud-test-controller-0] => (item=7318348dda5b) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 7318348d$
a5b:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 7318348dda5b chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 7318348dda5b\n", "d$
lta": "0:00:00.512512", "end": "2021-06-17 17:02:45.138861", "item": "7318348dda5b", "rc": 0, "start": "2021-06-17 17:02:44.626349", "stderr": "", "stderr_lines": [], "stdout": "7318348dda5$
509989fd68a55fe44425213a8a3ecdd23d68dab1a6bf79664b48", "stdout_lines": ["7318348dda5b509989fd68a55fe44425213a8a3ecdd23d68dab1a6bf79664b48"]}
changed: [overcloud-test-controller-1] => (item=e979156f2e99) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem e979156f$
e99:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root e979156f2e99 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP e979156f2e99\n", "d$
lta": "0:00:00.530241", "end": "2021-06-17 17:02:45.191649", "item": "e979156f2e99", "rc": 0, "start": "2021-06-17 17:02:44.661408", "stderr": "", "stderr_lines": [], "stdout": "e979156f2e9$
1927f38ebad1b58bd7d6211601664daacddaeee99d2eff8bef97", "stdout_lines": ["e979156f2e991927f38ebad1b58bd7d6211601664daacddaeee99d2eff8bef97"]}
changed: [overcloud-test-controller-0] => (item=c648f98d658f) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem c648f98d$
58f:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root c648f98d658f chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP c648f98d658f\n", "d$
lta": "0:00:00.660171", "end": "2021-06-17 17:02:46.000114", "item": "c648f98d658f", "rc": 0, "start": "2021-06-17 17:02:45.339943", "stderr": "", "stderr_lines": [], "stdout": "c648f98d658$
4fc0025e9c58ee35b37028fcca0006a1e8059e69113112272fa3", "stdout_lines": ["c648f98d658f4fc0025e9c58ee35b37028fcca0006a1e8059e69113112272fa3"]}
changed: [overcloud-test-controller-1] => (item=be1df05c896e) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem be1df05c$
96e:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root be1df05c896e chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP be1df05c896e\n", "d$
lta": "0:00:00.649566", "end": "2021-06-17 17:02:46.041566", "item": "be1df05c896e", "rc": 0, "start": "2021-06-17 17:02:45.392000", "stderr": "", "stderr_lines": [], "stdout": "be1df05c896$
7e593b073d4e14c7a438a6420336ba3ce0e4182ad3d1e09b6e85", "stdout_lines": ["be1df05c896e7e593b073d4e14c7a438a6420336ba3ce0e4182ad3d1e09b6e85"]}
failed: [overcloud-test-controller-2] (item=02cd82dac029) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 02cd82dac029$
/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 02cd82dac029 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 02cd82dac029\n", "delta$
: "0:00:00.889173", "end": "2021-06-17 17:02:46.221427", "item": "02cd82dac029", "msg": "non-zero return code", "rc": 125, "start": "2021-06-17 17:02:45.332254", "stderr": "Error: container
02cd82dac029f2bba136f0b7bc315da10bdda77db0f1e214cea65ddc9c830df6 does not exist in database: no such container", "stderr_lines": ["Error: container 02cd82dac029f2bba136f0b7bc315da10bdda77db$
f1e214cea65ddc9c830df6 does not exist in database: no such container"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
overcloud-test-compute-2   : ok=80   changed=23   unreachable=0    failed=0    skipped=70   rescued=0    ignored=0
overcloud-test-compute-3   : ok=75   changed=42   unreachable=0    failed=0    skipped=73   rescued=0    ignored=0
overcloud-test-controller-0 : ok=97   changed=29   unreachable=0    failed=0    skipped=53   rescued=0    ignored=0
overcloud-test-controller-1 : ok=94   changed=29   unreachable=0    failed=0    skipped=53   rescued=0    ignored=0
overcloud-test-controller-2 : ok=93   changed=28   unreachable=0    failed=1    skipped=53   rescued=0    ignored=0
undercloud                 : ok=9    changed=5    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

[1]: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/director_installation_and_usage/scaling-overcloud-nodes

Version-Release number of selected component (if applicable):
RHOSP 16.1

How reproducible:


Steps to Reproduce:
1. Scale down OpenStack
2. Scale up OpenStack
3. Restart haproxy container during the process

Actual results:
* Scale out failed and needed to be restarted

Expected results:
* Robustness against such failures


Additional info:

Comment 5 Grzegorz Grasza 2021-07-27 19:19:32 UTC
*** Bug 1979840 has been marked as a duplicate of this bug. ***

Comment 6 Grzegorz Grasza 2021-07-29 08:48:02 UTC
jfrancoa got an additional error, which might be related:

    2021-07-28 16:15:18 | TASK [copy certificate, chgrp, restart haproxy] ********************************
    2021-07-28 16:15:18 | Wednesday 28 July 2021  16:15:13 +0000 (0:00:00.104)       0:02:02.950 ********
    2021-07-28 16:15:18 | skipping: [controller-1] => (item=)  => {"ansible_loop_var": "item", "changed": false, "item": "", "skip_reason": "Conditional result was False"}
    2021-07-28 16:15:18 | changed: [controller-0] => (item=40edb9869494) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\nif podman ps -f \"id=40edb9869494\" --format \"{{.Names}}\" | grep -q \"^haproxy-bundle\"; then\n  tar -c /etc/pki/tls/private/overcloud_endpoint.pem | podman exec -i 40edb9869494 tar -C / -xv\nelse\n  podman cp /etc/pki/tls/private/overcloud_endpoint.pem 40edb9869494:/etc/pki/tls/private/overcloud_endpoint.pem\nfi\npodman exec --user root 40edb9869494 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 40edb9869494\n", "delta": "0:00:00.943979", "end": "2021-07-28 16:15:14.860717", "item": "40edb9869494", "rc": 0, "start": "2021-07-28 16:15:13.916738", "stderr": "", "stderr_lines": [], "stdout": "40edb9869494f55d9f81ceb5f83ab20cf5e221872c521e84caa3c40adaa481e5", "stdout_lines": ["40edb9869494f55d9f81ceb5f83ab20cf5e221872c521e84caa3c40adaa481e5"]}
    2021-07-28 16:15:18 | changed: [controller-0] => (item=ce11eae222e0) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\nif podman ps -f \"id=ce11eae222e0\" --format \"{{.Names}}\" | grep -q \"^haproxy-bundle\"; then\n  tar -c /etc/pki/tls/private/overcloud_endpoint.pem | podman exec -i ce11eae222e0 tar -C / -xv\nelse\n  podman cp /etc/pki/tls/private/overcloud_endpoint.pem ce11eae222e0:/etc/pki/tls/private/overcloud_endpoint.pem\nfi\npodman exec --user root ce11eae222e0 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP ce11eae222e0\n", "delta": "0:00:00.909961", "end": "2021-07-28 16:15:16.085611", "item": "ce11eae222e0", "rc": 0, "start": "2021-07-28 16:15:15.175650", "stderr": "", "stderr_lines": [], "stdout": "ce11eae222e0e564021a1594d8ee8d8afd793235cb323136a6f857ded75e28d5", "stdout_lines": ["ce11eae222e0e564021a1594d8ee8d8afd793235cb323136a6f857ded75e28d5"]}
    2021-07-28 16:15:18 | changed: [controller-0] => (item=dde4c18bb88f) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\nif podman ps -f \"id=dde4c18bb88f\" --format \"{{.Names}}\" | grep -q \"^haproxy-bundle\"; then\n  tar -c /etc/pki/tls/private/overcloud_endpoint.pem | podman exec -i dde4c18bb88f tar -C / -xv\nelse\n  podman cp /etc/pki/tls/private/overcloud_endpoint.pem dde4c18bb88f:/etc/pki/tls/private/overcloud_endpoint.pem\nfi\npodman exec --user root dde4c18bb88f chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP dde4c18bb88f\n", "delta": "0:00:00.999568", "end": "2021-07-28 16:15:17.389579", "item": "dde4c18bb88f", "rc": 0, "start": "2021-07-28 16:15:16.390011", "stderr": "", "stderr_lines": [], "stdout": "dde4c18bb88f050f7631d4c2f00ede1ab812c6f5fcde999b241ce2ab48cb0b13", "stdout_lines": ["dde4c18bb88f050f7631d4c2f00ede1ab812c6f5fcde999b241ce2ab48cb0b13"]}
    2021-07-28 16:15:18 |
    2021-07-28 16:15:18 | failed: [controller-0] (item=cbd66eb3e0ce) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\nif podman ps -f \"id=cbd66eb3e0ce\" --format \"{{.Names}}\" | grep -q \"^haproxy-bundle\"; then\n  tar -c /etc/pki/tls/private/overcloud_endpoint.pem | podman exec -i cbd66eb3e0ce tar -C / -xv\nelse\n  podman cp /etc/pki/tls/private/overcloud_endpoint.pem cbd66eb3e0ce:/etc/pki/tls/private/overcloud_endpoint.pem\nfi\npodman exec --user root cbd66eb3e0ce chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP cbd66eb3e0ce\n", "delta": "0:00:00.536844", "end": "2021-07-28 16:15:18.252000", "item": "cbd66eb3e0ce", "msg": "non-zero return code", "rc": 2, "start": "2021-07-28 16:15:17.715156", "stderr": "tar: Removing leading `/' from member names\ntar: This does not look like a tar archive\ntar: Exiting with failure status due to previous errors\ntime=\"2021-07-28T16:15:18Z\" level=error msg=\"read unixpacket @->/var/run/libpod/socket/2642350cdc5fbc67a2061a1471ec74c55f17e16bf4e474510fa0d819dada1628/attach: read: connection reset by peer\"\nError: non zero exit code: 2: OCI runtime error", "stderr_lines": ["tar: Removing leading `/' from member names", "tar: This does not look like a tar archive", "tar: Exiting with failure status due to previous errors", "time=\"2021-07-28T16:15:18Z\" level=error msg=\"read unixpacket @->/var/run/libpod/socket/2642350cdc5fbc67a2061a1471ec74c55f17e16bf4e474510fa0d819dada1628/attach: read: connection reset by peer\"", "Error: non zero exit code: 2: OCI runtime error"], "stdout": "", "stdout_lines": []}
    2021-07-28 16:15:18 |
    2021-07-28 16:15:18 | NO MORE HOSTS LEFT *************************************************************
    2021-07-28 16:15:18 |


I can trigger the tar error by truncating the tar archive, but this does not trigger the "read: connection reset by peer" error.

Comment 11 Marian Krcmarik 2021-07-30 10:49:58 UTC
I've filed https://bugzilla.redhat.com/show_bug.cgi?id=1988330 for the error from comment #6

Comment 13 Eric Nothen 2022-01-11 18:09:33 UTC
I can see on BZ #2008418 (the 16.2 version of this bug) that it has a target milestone of z2. Can we update this BZ with the same information for 16.1, so I can tell the customer when to expect it?

Thanks.

Comment 21 dabarzil 2022-03-02 08:53:00 UTC
Tested with :
$ rpm -qa|grep openstack-tripleo-heat-templates
openstack-tripleo-heat-templates-11.3.2-1.20220114223345.el8ost.noarch

Before scale down:
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| b59ee548-2f6b-4131-b915-a2b56065798d | controller-2 | ACTIVE | ctlplane=192.168.24.9  | overcloud-full | controller |
| 1ee28c02-504a-40d8-b8a5-d2c12db204c8 | controller-0 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller |
| d7fa4f32-79fd-4844-92ed-9f2178af8e54 | controller-1 | ACTIVE | ctlplane=192.168.24.32 | overcloud-full | controller |
| 1427a856-6371-49b1-9898-3f68592dff8b | compute-1    | ACTIVE | ctlplane=192.168.24.22 | overcloud-full | compute    |
| f66b055d-b7d8-4ccf-a2b1-958497a79855 | compute-0    | ACTIVE | ctlplane=192.168.24.42 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+


After scale down:
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| b59ee548-2f6b-4131-b915-a2b56065798d | controller-2 | ACTIVE | ctlplane=192.168.24.9  | overcloud-full | controller |
| 1ee28c02-504a-40d8-b8a5-d2c12db204c8 | controller-0 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller |
| d7fa4f32-79fd-4844-92ed-9f2178af8e54 | controller-1 | ACTIVE | ctlplane=192.168.24.32 | overcloud-full | controller |
| 1427a856-6371-49b1-9898-3f68592dff8b | compute-1    | ACTIVE | ctlplane=192.168.24.22 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+


During scale up:
[root@controller-0 ~]# podman restart haproxy-bundle-podman-0
b40a511b3a094e0f6cb12d81393c7a44bb1f88a2cff5c0047925d5e9a60fa04c

After Scale up:
(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 81e8bc2f-a845-4ecb-a0f4-ab423aa5ecce | compute-2    | ACTIVE | ctlplane=192.168.24.34 | overcloud-full | compute    |
| b59ee548-2f6b-4131-b915-a2b56065798d | controller-2 | ACTIVE | ctlplane=192.168.24.9  | overcloud-full | controller |
| 1ee28c02-504a-40d8-b8a5-d2c12db204c8 | controller-0 | ACTIVE | ctlplane=192.168.24.10 | overcloud-full | controller |
| d7fa4f32-79fd-4844-92ed-9f2178af8e54 | controller-1 | ACTIVE | ctlplane=192.168.24.32 | overcloud-full | controller |
| 1427a856-6371-49b1-9898-3f68592dff8b | compute-1    | ACTIVE | ctlplane=192.168.24.22 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+

Comment 26 errata-xmlrpc 2022-03-24 10:59:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0986