Bug 1979840 - [FFWD 13->16.1] Upgrade fails with "container does not exist in database: no such container"
Summary: [FFWD 13->16.1] Upgrade fails with "container does not exist in database: no...
Keywords:
Status: CLOSED DUPLICATE of bug 1973674
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Grzegorz Grasza
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-07 08:43 UTC by Jose Luis Franco
Modified: 2022-08-11 10:56 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 19:19:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch moving thte "get container_id" task just before the failing task (1.50 KB, patch)
2021-07-08 13:17 UTC, Grzegorz Grasza
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1935045 0 None None None 2021-07-08 13:27:55 UTC
OpenStack gerrit 802436 0 None NEW Get the container ID just before using it 2021-07-27 08:34:55 UTC
Red Hat Issue Tracker OSP-5925 0 None None None 2022-08-11 10:56:34 UTC

Description Jose Luis Franco 2021-07-07 08:43:49 UTC
Description of problem:

After the fix for BZ1849211 https://review.opendev.org/737486 merged in OSP16.1 puddle we have seen the OSP13 to OSP16.1 CI job failing with:

2021-07-06 20:18:03 | TASK [copy certificate, chgrp, restart haproxy] ********************************
2021-07-06 20:18:03 | Tuesday 06 July 2021  20:18:01 +0000 (0:00:00.141)       0:01:50.266 ********** 
2021-07-06 20:18:03 | skipping: [controller-2] => (item=)  => {"ansible_loop_var": "item", "changed": false, "item": "", "skip_reason": "Conditional result was False"}
2021-07-06 20:18:03 | changed: [controller-1] => (item=1ddad457523d) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 1ddad457523d:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 1ddad457523d chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 1ddad457523d\n", "delta": "0:00:01.157213", "end": "2021-07-06 20:18:02.606124", "item": "1ddad457523d", "rc": 0, "start": "2021-07-06 20:18:01.448911", "stderr": "", "stderr_lines": [], "stdout": "1ddad457523de1b081c77d6c75f8fcd5bbd7c2e7a928145b395213487a9914f0", "stdout_lines": ["1ddad457523de1b081c77d6c75f8fcd5bbd7c2e7a928145b395213487a9914f0"]}
2021-07-06 20:18:03 | failed: [controller-0] (item=595bf6a82a2c) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 595bf6a82a2c:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 595bf6a82a2c chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 595bf6a82a2c\n", "delta": "0:00:01.495510", "end": "2021-07-06 20:18:02.923306", "item": "595bf6a82a2c", "msg": "non-zero return code", "rc": 125, "start": "2021-07-06 20:18:01.427796", "stderr": "Error: container 595bf6a82a2cef33afef55b41b4737c74747f37597a0462ccba5c2bfd4303b9b does not exist in database: no such container", "stderr_lines": ["Error: container 595bf6a82a2cef33afef55b41b4737c74747f37597a0462ccba5c2bfd4303b9b does not exist in database: no such container"], "stdout": "", "stdout_lines": []}
2021-07-06 20:18:03 | 
2021-07-06 20:18:03 | NO MORE HOSTS LEFT *************************************************************
2021-07-06 20:18:03 | 
2021-07-06 20:18:03 | PLAY RECAP *********************************************************************
2021-07-06 20:18:03 | controller-0               : ok=107  changed=26   unreachable=0    failed=1    skipped=51   rescued=0    ignored=0   
2021-07-06 20:18:03 | controller-1               : ok=101  changed=27   unreachable=0    failed=0    skipped=51   rescued=0    ignored=0   
2021-07-06 20:18:03 | controller-2               : ok=101  changed=53   unreachable=0    failed=0    skipped=51   rescued=0    ignored=0   

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-upgrades-ffu-16.1-from-13-latest_cdn-3cont_1comp-ipv4-ovs_lvm/11/undercloud-0/home/stack/overcloud_upgrade_run-controller-2,controller-1,controller-0.log.gz


Version-Release number of selected component (if applicable):


How reproducible:

Run CI job https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-16.1-from-13-latest_cdn-3cont_1comp-ipv4-ovs_lvm/

Steps to Reproduce:
1.
2.
3.

Actual results:

FFWD 13 to 16.1 fails when upgrading the third controller

Expected results:


FFWD 13 to 16.1 succeeeds

Additional info:

Comment 1 Grzegorz Grasza 2021-07-08 13:17:04 UTC
Created attachment 1799651 [details]
Patch moving thte "get container_id" task just before the failing task

It looks like a race condition - the container_id is retrieved much to early. I'm attaching a patch to test this.

Comment 2 Grzegorz Grasza 2021-07-19 08:04:37 UTC
Hey Jose, were you able to test my patch?

Comment 4 Grzegorz Grasza 2021-07-19 13:03:01 UTC
Ok, I'll close it for now.

Comment 5 Jose Luis Franco 2021-07-20 08:30:19 UTC
Hello Grzegorz,

I was about to tell you the same...the run with your patch passed: 

2021-07-19 16:03:41 | TASK [get container_id] ********************************************************
2021-07-19 16:03:41 | Monday 19 July 2021  16:03:38 +0000 (0:00:00.136)       0:01:43.577 *********** 
2021-07-19 16:03:41 | changed: [controller-2] => {"changed": true, "cmd": ["podman", "ps", "-q", "-f", "name=haproxy"], "delta": "0:00:00.085396", "end": "2021-07-19 16:03:39.230789", "rc": 0, "start": "2021-07-19 16:03:39.145393", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
2021-07-19 16:03:41 | changed: [controller-0] => {"changed": true, "cmd": ["podman", "ps", "-q", "-f", "name=haproxy"], "delta": "0:00:00.150820", "end": "2021-07-19 16:03:39.287115", "rc": 0, "start": "2021-07-19 16:03:39.136295", "stderr": "", "stderr_lines": [], "stdout": "9ff4f1a24ea5", "stdout_lines": ["9ff4f1a24ea5"]}
2021-07-19 16:03:41 | changed: [controller-1] => {"changed": true, "cmd": ["podman", "ps", "-q", "-f", "name=haproxy"], "delta": "0:00:00.138453", "end": "2021-07-19 16:03:39.313995", "rc": 0, "start": "2021-07-19 16:03:39.175542", "stderr": "", "stderr_lines": [], "stdout": "b01d1bb1f5b5", "stdout_lines": ["b01d1bb1f5b5"]}
2021-07-19 16:03:41 | 
2021-07-19 16:03:41 | TASK [copy certificate, chgrp, restart haproxy] ********************************
2021-07-19 16:03:41 | Monday 19 July 2021  16:03:39 +0000 (0:00:00.596)       0:01:44.174 *********** 
2021-07-19 16:03:41 | skipping: [controller-2] => (item=)  => {"ansible_loop_var": "item", "changed": false, "item": "", "skip_reason": "Conditional result was False"}
2021-07-19 16:03:41 | changed: [controller-1] => (item=b01d1bb1f5b5) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem b01d1bb1f5b5:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root b01d1bb1f5b5 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP b01d1bb1f5b5\n", "delta": "0:00:01.198461", "end": "2021-07-19 16:03:41.012301", "item": "b01d1bb1f5b5", "rc": 0, "start": "2021-07-19 16:03:39.813840", "stderr": "", "stderr_lines": [], "stdout": "b01d1bb1f5b5eea73e783d5f1876c6e092dc15587f538531d3de43be9feef644", "stdout_lines": ["b01d1bb1f5b5eea73e783d5f1876c6e092dc15587f538531d3de43be9feef644"]}
2021-07-19 16:03:55 | 
2021-07-19 16:03:55 | changed: [controller-0] => (item=9ff4f1a24ea5) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 9ff4f1a24ea5:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 9ff4f1a24ea5 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 9ff4f1a24ea5\n", "delta": "0:00:01.243203", "end": "2021-07-19 16:03:41.011685", "item": "9ff4f1a24ea5", "rc": 0, "start": "2021-07-19 16:03:39.768482", "stderr": "", "stderr_lines": [], "stdout": "9ff4f1a24ea5e64db402ecef61a4873e65c9da0281c4e439e8a22278a1f16a8f", "stdout_lines": ["9ff4f1a24ea5e64db402ecef61a4873e65c9da0281c4e439e8a22278a1f16a8f"]}
2021-07-19 16:03:55 | 

But it is also passing without it..so let's close it for the time being.

Thanks for having a look

Comment 6 Grzegorz Grasza 2021-07-27 08:23:06 UTC
It happened again, this time the container was present, but stopped:


2021-07-26 21:24:54 | TASK [get container_id] ********************************************************
2021-07-26 21:24:54 | Monday 26 July 2021  21:24:50 +0000 (0:00:00.629)       0:01:50.342 *********** 
2021-07-26 21:24:54 | changed: [controller-0] => {"changed": true, "cmd": ["podman", "ps", "-q", "-f", "name=haproxy"], "delta": "0:00:00.137879", "end": "2021-07-26 21:24:50.905193", "rc": 
0, "start": "2021-07-26 21:24:50.767314", "stderr": "", "stderr_lines": [], "stdout": "6bfc44160912", "stdout_lines": ["6bfc44160912"]}
2021-07-26 21:24:54 | changed: [controller-1] => {"changed": true, "cmd": ["podman", "ps", "-q", "-f", "name=haproxy"], "delta": "0:00:00.119417", "end": "2021-07-26 21:24:50.904214", "rc": 
0, "start": "2021-07-26 21:24:50.784797", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
2021-07-26 21:24:54 | 
2021-07-26 21:24:54 | TASK [get pcs resource name for haproxy container] *****************************
2021-07-26 21:24:54 | Monday 26 July 2021  21:24:50 +0000 (0:00:00.564)       0:01:50.907 *********** 
2021-07-26 21:24:54 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | 
2021-07-26 21:24:54 | TASK [remove DeployedSSLCertificatePath if is dir] *****************************
2021-07-26 21:24:54 | Monday 26 July 2021  21:24:51 +0000 (0:00:00.107)       0:01:51.015 *********** 
2021-07-26 21:24:54 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | 
2021-07-26 21:24:54 | TASK [push certificate content] ************************************************
2021-07-26 21:24:54 | Monday 26 July 2021  21:24:51 +0000 (0:00:00.106)       0:01:51.121 *********** 
2021-07-26 21:24:54 | ok: [controller-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2021-07-26 21:24:54 | changed: [controller-1] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}
2021-07-26 21:24:54 | 
2021-07-26 21:24:54 | TASK [set certificate ownership] ***********************************************
2021-07-26 21:24:54 | Monday 26 July 2021  21:24:52 +0000 (0:00:01.225)       0:01:52.346 *********** 
2021-07-26 21:24:54 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | 
2021-07-26 21:24:54 | TASK [reload haproxy if enabled] ***********************************************
2021-07-26 21:24:54 | Monday 26 July 2021  21:24:52 +0000 (0:00:00.106)       0:01:52.453 *********** 
2021-07-26 21:24:54 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | 
2021-07-26 21:24:54 | TASK [restart pacemaker resource for haproxy] **********************************
2021-07-26 21:24:54 | Monday 26 July 2021  21:24:52 +0000 (0:00:00.104)       0:01:52.557 *********** 
2021-07-26 21:24:54 | skipping: [controller-0] => {"changed": false, "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | skipping: [controller-1] => {"changed": false, "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | 
2021-07-26 21:24:54 | TASK [copy certificate, chgrp, restart haproxy] ********************************
2021-07-26 21:24:54 | Monday 26 July 2021  21:24:52 +0000 (0:00:00.107)       0:01:52.664 *********** 
2021-07-26 21:24:54 | skipping: [controller-1] => (item=)  => {"ansible_loop_var": "item", "changed": false, "item": "", "skip_reason": "Conditional result was False"}
2021-07-26 21:24:54 | failed: [controller-0] (item=6bfc44160912) => {"ansible_loop_var": "item", "changed": true, "cmd": "set -e\npodman cp /etc/pki/tls/private/overcloud_endpoint.pem 6bfc44
160912:/etc/pki/tls/private/overcloud_endpoint.pem\npodman exec --user root 6bfc44160912 chgrp haproxy /etc/pki/tls/private/overcloud_endpoint.pem\npodman kill --signal=HUP 6bfc44160912\n", 
"delta": "0:00:01.540407", "end": "2021-07-26 21:24:54.671104", "item": "6bfc44160912", "msg": "non-zero return code", "rc": 125, "start": "2021-07-26 21:24:53.130697", "stderr": "Error: can
 only kill running containers. 6bfc4416091244b37457180ea5c0d26887d2e7a9e8384a5e7457bba93045f326 is in state stopped: container state improper", "stderr_lines": ["Error: can only kill running
 containers. 6bfc4416091244b37457180ea5c0d26887d2e7a9e8384a5e7457bba93045f326 is in state stopped: container state improper"], "stdout": "", "stdout_lines": []}


Note You need to log in before you can comment on or make changes to this bug.