Description of problem: After the fix for BZ1849211 https://review.opendev.org/737486 merged in OSP16.1 puddle we have seen the OSP13 to OSP16.1 CI job failing with: 2021-07-06 20:18:03 | TASK [copy certificate, chgrp, restart haproxy] ******************************** 2021-07-06 20:18:03 | Tuesday 06 July 2021 20:18:01 +0000 (0:00:00.141) 0:01:50.266 ********** 2021-07-06 20:18:03 | skipping: [controller-2] => (item=) => {"ansible_loop_var": "item", "changed": false, "item": "", "skip_reason": "Conditional result was False"} 2021-07-06 20:18:03 | changed: [controller-1] => (item=1ddad457523d) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 1ddad457523d:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 1ddad457523d chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 1ddad457523d\n", "delta": "0:00:01.157213", "end": "2021-07-06 20:18:02.606124", "item": "1ddad457523d", "rc": 0, "start": "2021-07-06 20:18:01.448911", "stderr": "", "stderr_lines": [], "stdout": "1ddad457523de1b081c77d6c75f8fcd5bbd7c2e7a928145b395213487a9914f0", "stdout_lines": ["1ddad457523de1b081c77d6c75f8fcd5bbd7c2e7a928145b395213487a9914f0"]} 2021-07-06 20:18:03 | failed: [controller-0] (item=595bf6a82a2c) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 595bf6a82a2c:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 595bf6a82a2c chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 595bf6a82a2c\n", "delta": "0:00:01.495510", "end": "2021-07-06 20:18:02.923306", "item": "595bf6a82a2c", "msg": "non-zero return code", "rc": 125, "start": "2021-07-06 20:18:01.427796", "stderr": "Error: container 595bf6a82a2cef33afef55b41b4737c74747f37597a0462ccba5c2bfd4303b9b does not exist in database: no such container", "stderr_lines": ["Error: container 595bf6a82a2cef33afef55b41b4737c74747f37597a0462ccba5c2bfd4303b9b does not exist in database: no such container"], "stdout": "", "stdout_lines": []} 2021-07-06 20:18:03 | 2021-07-06 20:18:03 | NO MORE HOSTS LEFT ************************************************************* 2021-07-06 20:18:03 | 2021-07-06 20:18:03 | PLAY RECAP ********************************************************************* 2021-07-06 20:18:03 | controller-0 : ok=107 changed=26 unreachable=0 failed=1 skipped=51 rescued=0 ignored=0 2021-07-06 20:18:03 | controller-1 : ok=101 changed=27 unreachable=0 failed=0 skipped=51 rescued=0 ignored=0 2021-07-06 20:18:03 | controller-2 : ok=101 changed=53 unreachable=0 failed=0 skipped=51 rescued=0 ignored=0 http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.1-from-13-latest_cdn-3cont_1comp-ipv4-ovs_lvm/11/undercloud-0/home/stack/overcloud_upgrade_run-controller-2,controller-1,controller-0.log.gz Version-Release number of selected component (if applicable): How reproducible: Run CI job https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.1-from-13-latest_cdn-3cont_1comp-ipv4-ovs_lvm/ Steps to Reproduce: 1. 2. 3. Actual results: FFWD 13 to 16.1 fails when upgrading the third controller Expected results: FFWD 13 to 16.1 succeeeds Additional info:
Created attachment 1799651 [details] Patch moving thte "get container_id" task just before the failing task It looks like a race condition - the container_id is retrieved much to early. I'm attaching a patch to test this.
Hey Jose, were you able to test my patch?
Ok, I'll close it for now.
Hello Grzegorz, I was about to tell you the same...the run with your patch passed: 2021-07-19 16:03:41 | TASK [get container_id] ******************************************************** 2021-07-19 16:03:41 | Monday 19 July 2021 16:03:38 +0000 (0:00:00.136) 0:01:43.577 *********** 2021-07-19 16:03:41 | changed: [controller-2] => {"changed": true, "cmd": ["podman", "ps", "-q", "-f", "name=haproxy"], "delta": "0:00:00.085396", "end": "2021-07-19 16:03:39.230789", "rc": 0, "start": "2021-07-19 16:03:39.145393", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} 2021-07-19 16:03:41 | changed: [controller-0] => {"changed": true, "cmd": ["podman", "ps", "-q", "-f", "name=haproxy"], "delta": "0:00:00.150820", "end": "2021-07-19 16:03:39.287115", "rc": 0, "start": "2021-07-19 16:03:39.136295", "stderr": "", "stderr_lines": [], "stdout": "9ff4f1a24ea5", "stdout_lines": ["9ff4f1a24ea5"]} 2021-07-19 16:03:41 | changed: [controller-1] => {"changed": true, "cmd": ["podman", "ps", "-q", "-f", "name=haproxy"], "delta": "0:00:00.138453", "end": "2021-07-19 16:03:39.313995", "rc": 0, "start": "2021-07-19 16:03:39.175542", "stderr": "", "stderr_lines": [], "stdout": "b01d1bb1f5b5", "stdout_lines": ["b01d1bb1f5b5"]} 2021-07-19 16:03:41 | 2021-07-19 16:03:41 | TASK [copy certificate, chgrp, restart haproxy] ******************************** 2021-07-19 16:03:41 | Monday 19 July 2021 16:03:39 +0000 (0:00:00.596) 0:01:44.174 *********** 2021-07-19 16:03:41 | skipping: [controller-2] => (item=) => {"ansible_loop_var": "item", "changed": false, "item": "", "skip_reason": "Conditional result was False"} 2021-07-19 16:03:41 | changed: [controller-1] => (item=b01d1bb1f5b5) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem b01d1bb1f5b5:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root b01d1bb1f5b5 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP b01d1bb1f5b5\n", "delta": "0:00:01.198461", "end": "2021-07-19 16:03:41.012301", "item": "b01d1bb1f5b5", "rc": 0, "start": "2021-07-19 16:03:39.813840", "stderr": "", "stderr_lines": [], "stdout": "b01d1bb1f5b5eea73e783d5f1876c6e092dc15587f538531d3de43be9feef644", "stdout_lines": ["b01d1bb1f5b5eea73e783d5f1876c6e092dc15587f538531d3de43be9feef644"]} 2021-07-19 16:03:55 | 2021-07-19 16:03:55 | changed: [controller-0] => (item=9ff4f1a24ea5) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 9ff4f1a24ea5:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 9ff4f1a24ea5 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 9ff4f1a24ea5\n", "delta": "0:00:01.243203", "end": "2021-07-19 16:03:41.011685", "item": "9ff4f1a24ea5", "rc": 0, "start": "2021-07-19 16:03:39.768482", "stderr": "", "stderr_lines": [], "stdout": "9ff4f1a24ea5e64db402ecef61a4873e65c9da0281c4e439e8a22278a1f16a8f", "stdout_lines": ["9ff4f1a24ea5e64db402ecef61a4873e65c9da0281c4e439e8a22278a1f16a8f"]} 2021-07-19 16:03:55 | But it is also passing without it..so let's close it for the time being. Thanks for having a look
It happened again, this time the container was present, but stopped: 2021-07-26 21:24:54 | TASK [get container_id] ******************************************************** 2021-07-26 21:24:54 | Monday 26 July 2021 21:24:50 +0000 (0:00:00.629) 0:01:50.342 *********** 2021-07-26 21:24:54 | changed: [controller-0] => {"changed": true, "cmd": ["podman", "ps", "-q", "-f", "name=haproxy"], "delta": "0:00:00.137879", "end": "2021-07-26 21:24:50.905193", "rc": 0, "start": "2021-07-26 21:24:50.767314", "stderr": "", "stderr_lines": [], "stdout": "6bfc44160912", "stdout_lines": ["6bfc44160912"]} 2021-07-26 21:24:54 | changed: [controller-1] => {"changed": true, "cmd": ["podman", "ps", "-q", "-f", "name=haproxy"], "delta": "0:00:00.119417", "end": "2021-07-26 21:24:50.904214", "rc": 0, "start": "2021-07-26 21:24:50.784797", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} 2021-07-26 21:24:54 | 2021-07-26 21:24:54 | TASK [get pcs resource name for haproxy container] ***************************** 2021-07-26 21:24:54 | Monday 26 July 2021 21:24:50 +0000 (0:00:00.564) 0:01:50.907 *********** 2021-07-26 21:24:54 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | 2021-07-26 21:24:54 | TASK [remove DeployedSSLCertificatePath if is dir] ***************************** 2021-07-26 21:24:54 | Monday 26 July 2021 21:24:51 +0000 (0:00:00.107) 0:01:51.015 *********** 2021-07-26 21:24:54 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | 2021-07-26 21:24:54 | TASK [push certificate content] ************************************************ 2021-07-26 21:24:54 | Monday 26 July 2021 21:24:51 +0000 (0:00:00.106) 0:01:51.121 *********** 2021-07-26 21:24:54 | ok: [controller-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} 2021-07-26 21:24:54 | changed: [controller-1] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true} 2021-07-26 21:24:54 | 2021-07-26 21:24:54 | TASK [set certificate ownership] *********************************************** 2021-07-26 21:24:54 | Monday 26 July 2021 21:24:52 +0000 (0:00:01.225) 0:01:52.346 *********** 2021-07-26 21:24:54 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | 2021-07-26 21:24:54 | TASK [reload haproxy if enabled] *********************************************** 2021-07-26 21:24:54 | Monday 26 July 2021 21:24:52 +0000 (0:00:00.106) 0:01:52.453 *********** 2021-07-26 21:24:54 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | 2021-07-26 21:24:54 | TASK [restart pacemaker resource for haproxy] ********************************** 2021-07-26 21:24:54 | Monday 26 July 2021 21:24:52 +0000 (0:00:00.104) 0:01:52.557 *********** 2021-07-26 21:24:54 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | 2021-07-26 21:24:54 | TASK [copy certificate, chgrp, restart haproxy] ******************************** 2021-07-26 21:24:54 | Monday 26 July 2021 21:24:52 +0000 (0:00:00.107) 0:01:52.664 *********** 2021-07-26 21:24:54 | skipping: [controller-1] => (item=) => {"ansible_loop_var": "item", "changed": false, "item": "", "skip_reason": "Conditional result was False"} 2021-07-26 21:24:54 | failed: [controller-0] (item=6bfc44160912) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 6bfc44 160912:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 6bfc44160912 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 6bfc44160912\n", "delta": "0:00:01.540407", "end": "2021-07-26 21:24:54.671104", "item": "6bfc44160912", "msg": "non-zero return code", "rc": 125, "start": "2021-07-26 21:24:53.130697", "stderr": "Error: can only kill running containers. 6bfc4416091244b37457180ea5c0d26887d2e7a9e8384a5e7457bba93045f326 is in state stopped: container state improper", "stderr_lines": ["Error: can only kill running containers. 6bfc4416091244b37457180ea5c0d26887d2e7a9e8384a5e7457bba93045f326 is in state stopped: container state improper"], "stdout": "", "stdout_lines": []}