Bug 1559889

Summary: [OSP13] HostPrepDeployment not always pushed on all nodes
Product: Red Hat OpenStack Reporter: Michele Baldessari <michele>
Component: openstack-heatAssignee: Zane Bitter <zbitter>
Status: CLOSED ERRATA QA Contact: Ronnie Rasouli <rrasouli>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: aschultz, dpeacock, jschluet, mburns, mcornea, michele, ohochman, rhel-osp-director-maint, sbaker, shardy, srevivo, therve
Target Milestone: betaKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: All   
OS: Linux   
Fixed In Version: openstack-heat-10.0.1-0.20180314232330.c2a66b1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:48:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Michele Baldessari 2018-03-23 13:54:56 UTC
Description of problem:
I have seen this problem only twice so far. The symptoms are a hanging deployment and the hanging resource in my case was the HostPrepDeployment one.
We see that the deployment is stuck on some Hosts* resource that was scheduled on a node. If you go on said node, you'll see nothing happening.

For example we can see the resources hanging:
| AllNodesDeploySteps                           | b58bee24-76d5-4a53-8d48-3e76b91407b7                                                                                                                                                 | OS::TripleO::PostDeploySteps                                                                                                     | CREATE_IN_PROGRESS | 2018-03-23T12:00:52Z | overcloud                                                                                                                                                |
| DatabaseHostPrepDeployment                    | 92469eaf-98be-4fbd-9d98-4101659cceb6                                                                                                                                                 | OS::Heat::SoftwareDeploymentGroup                                                                                                | CREATE_IN_PROGRESS | 2018-03-23T12:16:50Z | overcloud-AllNodesDeploySteps-sapadflv4r6r                                                                                                               |
| 2                                             | af13b8fb-5926-4167-b5f7-16354864dca6                                                                                                                                                 | OS::Heat::SoftwareDeployment                                                                                                     | CREATE_IN_PROGRESS | 2018-03-23T12:16:54Z | overcloud-AllNodesDeploySteps-sapadflv4r6r-DatabaseHostPrepDeployment-kd4qwvkin4zk

Now if we hop on all nodes to check if os-collect-config did indeed get this resource to be completed on the node we get the following:
$ for i in $(nova list |grep ctlplane | awk '{ print $12 }' | cut -f2 -d=); do echo $i; ssh $i "sudo os-collect-config --print |grep -e 'name.*HostPrepDepl'"; done 2> /dev/null
    "name": "CephStorageHostPrepDeployment",
    "name": "ControllerHostPrepDeployment",
    "name": "ControllerHostPrepDeployment",
    "name": "ControllerHostPrepDeployment",
    "name": "DatabaseHostPrepDeployment",
    "name": "DatabaseHostPrepDeployment", <-- BOOOM overcloud-database-2 has not gotten the memo
    "name": "MessagingHostPrepDeployment",
    "name": "MessagingHostPrepDeployment",
    "name": "MessagingHostPrepDeployment",
    "name": "ComputeInstanceHAHostPrepDeployment", 

So we're basically in the following state:
A) Heat on the undercloud waits for DatabaseHostPrepDeployment to be completed on overcloud-database-2
B) overcloud-database-2 simply has no idea it needs to run that task

A redeploy will usually not show this (rather rate) problem.
Undercloud sosreport http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/sosreport-undercloud-0.redhat.local-20180323085448.tar.xz
DB dump: http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/db.dump.gz
All other sosreports here:

Version-Release number of selected component (if applicable):
$ rpm -qa |grep heat

Comment 1 Michele Baldessari 2018-03-23 14:48:04 UTC
Chatted with therve, this is likely https://bugs.launchpad.net/tripleo/+bug/1731032

Comment 2 Michele Baldessari 2018-03-23 15:01:22 UTC
(In reply to Michele Baldessari from comment #1)
> Chatted with therve, this is likely
> https://bugs.launchpad.net/tripleo/+bug/1731032

Seems we already have the patches from the LP bug in this package (aka https://review.openstack.org/#/c/521468/), so this is likely something else

Comment 3 Thomas Hervé 2018-03-23 15:34:52 UTC
I looked at it a little bit, and it really looks like https://bugs.launchpad.net/tripleo/+bug/1731032 which ought to be fixed. The database is correct and contains the deployments metadata, but the node doesn't get it. I don't have the full Heat logs, that would be useful. Having swift logs would help too.

Comment 4 Michele Baldessari 2018-03-23 15:44:05 UTC
I managed to take /var/log from the undercloud before the whole system got reprovisioned: http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/var-log.tgz maybe it helps a bit?

Comment 5 Thomas Hervé 2018-03-23 15:46:43 UTC
Ah found swift logs in journal, definitely this issue:

Mar 23 08:17:06 undercloud-0 proxy-server: 23/Mar/2018/12/17/06 PUT /v1/AUTH_f2cc1dec6db6422d82a5a8ef5556af52/ov--tw7r7g2htknr-2-s2w5wltduwsp-Database-xcon3myy6gbe/513ce19d-3a32-4ced-854e-ec31eda20744%3Ftemp_url_expires%3D2147483586%26temp_url_sig%3D394269aa972c830c59b731546d34e94d15d49738 HTTP/1.0 201 - python-requests/2.14.2 - 159751 - - txd4639a87568f40ffa6f23-005ab4f042 - 0.0108 - - 1521807426.890296936 1521807426.901119947 0

Mar 23 08:17:07 undercloud-0 proxy-server: 23/Mar/2018/12/17/07 PUT /v1/AUTH_f2cc1dec6db6422d82a5a8ef5556af52/ov--tw7r7g2htknr-2-s2w5wltduwsp-Database-xcon3myy6gbe/513ce19d-3a32-4ced-854e-ec31eda20744%3Ftemp_url_expires%3D2147483586%26temp_url_sig%3D394269aa972c830c59b731546d34e94d15d49738 HTTP/1.0 201 - python-requests/2.14.2 - 139580 - - tx8af2d2746ec54f86a2809-005ab4f043 - 0.0190 - - 1521807427.385188103 1521807427.404186010 0

We see 2 PUTs on the server metadata, and the 2nd is smaller than the first (139580 vs 159751).

It looks like https://review.openstack.org/#/c/521468/ didn't work.

Comment 6 Zane Bitter 2018-03-23 16:15:50 UTC
Looking at the code for the Swift object server, it looks like on PUT Swift supports If-None-Match, but not If-Match:


This is also consistent with the API docs (although I don't put too much weight on the docs): https://developer.openstack.org/api-ref/object-store/

So it's likely that https://review.openstack.org/#/c/521468/ didn't achieve anything. We could resurrect https://review.openstack.org/#/c/521204/ to be sure to be sure.

Comment 10 Omri Hochman 2018-04-03 16:35:38 UTC
Issue blocks overcloud deployment, not happens all the time, but it reproduced downstream on latest osp13 puddle 2018-03-29.1
Raising Severity, Adding Blocker(?) flag.

Comment 13 Omri Hochman 2018-04-16 19:01:17 UTC
unable to reproduce with: 

Comment 15 errata-xmlrpc 2018-06-27 13:48:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.