Description of problem: I have seen this problem only twice so far. The symptoms are a hanging deployment and the hanging resource in my case was the HostPrepDeployment one. We see that the deployment is stuck on some Hosts* resource that was scheduled on a node. If you go on said node, you'll see nothing happening. For example we can see the resources hanging: | AllNodesDeploySteps | b58bee24-76d5-4a53-8d48-3e76b91407b7 | OS::TripleO::PostDeploySteps | CREATE_IN_PROGRESS | 2018-03-23T12:00:52Z | overcloud | | DatabaseHostPrepDeployment | 92469eaf-98be-4fbd-9d98-4101659cceb6 | OS::Heat::SoftwareDeploymentGroup | CREATE_IN_PROGRESS | 2018-03-23T12:16:50Z | overcloud-AllNodesDeploySteps-sapadflv4r6r | | 2 | af13b8fb-5926-4167-b5f7-16354864dca6 | OS::Heat::SoftwareDeployment | CREATE_IN_PROGRESS | 2018-03-23T12:16:54Z | overcloud-AllNodesDeploySteps-sapadflv4r6r-DatabaseHostPrepDeployment-kd4qwvkin4zk Now if we hop on all nodes to check if os-collect-config did indeed get this resource to be completed on the node we get the following: $ for i in $(nova list |grep ctlplane | awk '{ print $12 }' | cut -f2 -d=); do echo $i; ssh $i "sudo os-collect-config --print |grep -e 'name.*HostPrepDepl'"; done 2> /dev/null 192.168.24.10 "name": "CephStorageHostPrepDeployment", 192.168.24.8 "name": "ControllerHostPrepDeployment", 192.168.24.21 "name": "ControllerHostPrepDeployment", 192.168.24.12 "name": "ControllerHostPrepDeployment", 192.168.24.14 "name": "DatabaseHostPrepDeployment", 192.168.24.16 "name": "DatabaseHostPrepDeployment", 192.168.24.18 <-- BOOOM overcloud-database-2 has not gotten the memo 192.168.24.6 "name": "MessagingHostPrepDeployment", 192.168.24.9 "name": "MessagingHostPrepDeployment", 192.168.24.11 "name": "MessagingHostPrepDeployment", 192.168.24.19 "name": "ComputeInstanceHAHostPrepDeployment", So we're basically in the following state: A) Heat on the undercloud waits for DatabaseHostPrepDeployment to be completed on overcloud-database-2 B) overcloud-database-2 simply has no idea it needs to run that task A redeploy will usually not show this (rather rate) problem. Undercloud sosreport http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/sosreport-undercloud-0.redhat.local-20180323085448.tar.xz DB dump: http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/db.dump.gz All other sosreports here: http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/ Version-Release number of selected component (if applicable): $ rpm -qa |grep heat openstack-heat-api-cfn-10.0.1-0.20180302152334.c3bd928.el7ost.noarch python-heat-agent-1.5.4-0.20180301153730.ecf43c7.el7ost.noarch openstack-tripleo-heat-templates-8.0.0-0.20180304031148.el7ost.noarch heat-cfntools-1.3.0-2.el7ost.noarch openstack-heat-api-10.0.1-0.20180302152334.c3bd928.el7ost.noarch puppet-heat-12.3.1-0.20180221104603.27feed4.el7ost.noarch python2-heatclient-1.14.0-0.20180213175737.2ce6aa1.el7ost.noarch openstack-heat-engine-10.0.1-0.20180302152334.c3bd928.el7ost.noarch openstack-heat-common-10.0.1-0.20180302152334.c3bd928.el7ost.noarch
Chatted with therve, this is likely https://bugs.launchpad.net/tripleo/+bug/1731032
(In reply to Michele Baldessari from comment #1) > Chatted with therve, this is likely > https://bugs.launchpad.net/tripleo/+bug/1731032 Seems we already have the patches from the LP bug in this package (aka https://review.openstack.org/#/c/521468/), so this is likely something else
I looked at it a little bit, and it really looks like https://bugs.launchpad.net/tripleo/+bug/1731032 which ought to be fixed. The database is correct and contains the deployments metadata, but the node doesn't get it. I don't have the full Heat logs, that would be useful. Having swift logs would help too.
I managed to take /var/log from the undercloud before the whole system got reprovisioned: http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/var-log.tgz maybe it helps a bit?
Ah found swift logs in journal, definitely this issue: Mar 23 08:17:06 undercloud-0 proxy-server: 192.168.24.1 192.168.24.1 23/Mar/2018/12/17/06 PUT /v1/AUTH_f2cc1dec6db6422d82a5a8ef5556af52/ov--tw7r7g2htknr-2-s2w5wltduwsp-Database-xcon3myy6gbe/513ce19d-3a32-4ced-854e-ec31eda20744%3Ftemp_url_expires%3D2147483586%26temp_url_sig%3D394269aa972c830c59b731546d34e94d15d49738 HTTP/1.0 201 - python-requests/2.14.2 - 159751 - - txd4639a87568f40ffa6f23-005ab4f042 - 0.0108 - - 1521807426.890296936 1521807426.901119947 0 Mar 23 08:17:07 undercloud-0 proxy-server: 192.168.24.1 192.168.24.1 23/Mar/2018/12/17/07 PUT /v1/AUTH_f2cc1dec6db6422d82a5a8ef5556af52/ov--tw7r7g2htknr-2-s2w5wltduwsp-Database-xcon3myy6gbe/513ce19d-3a32-4ced-854e-ec31eda20744%3Ftemp_url_expires%3D2147483586%26temp_url_sig%3D394269aa972c830c59b731546d34e94d15d49738 HTTP/1.0 201 - python-requests/2.14.2 - 139580 - - tx8af2d2746ec54f86a2809-005ab4f043 - 0.0190 - - 1521807427.385188103 1521807427.404186010 0 We see 2 PUTs on the server metadata, and the 2nd is smaller than the first (139580 vs 159751). It looks like https://review.openstack.org/#/c/521468/ didn't work.
Looking at the code for the Swift object server, it looks like on PUT Swift supports If-None-Match, but not If-Match: http://git.openstack.org/cgit/openstack/swift/tree/swift/obj/server.py#n738 This is also consistent with the API docs (although I don't put too much weight on the docs): https://developer.openstack.org/api-ref/object-store/ So it's likely that https://review.openstack.org/#/c/521468/ didn't achieve anything. We could resurrect https://review.openstack.org/#/c/521204/ to be sure to be sure.
Issue blocks overcloud deployment, not happens all the time, but it reproduced downstream on latest osp13 puddle 2018-03-29.1 Raising Severity, Adding Blocker(?) flag.
unable to reproduce with: openstack-heat-engine-10.0.1-0.20180314232330.c2a66b1.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086