Bug 1559889 - [OSP13] HostPrepDeployment not always pushed on all nodes
Summary: [OSP13] HostPrepDeployment not always pushed on all nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 13.0 (Queens)
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: beta
: 13.0 (Queens)
Assignee: Zane Bitter
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-23 13:54 UTC by Michele Baldessari
Modified: 2018-06-27 13:48 UTC (History)
12 users (show)

Fixed In Version: openstack-heat-10.0.1-0.20180314232330.c2a66b1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 13:48:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 557311 0 None MERGED Resolve race in providing deployment data to Swift 2021-02-15 14:56:51 UTC
Red Hat Product Errata RHEA-2018:2086 0 None None None 2018-06-27 13:48:48 UTC

Description Michele Baldessari 2018-03-23 13:54:56 UTC
Description of problem:
I have seen this problem only twice so far. The symptoms are a hanging deployment and the hanging resource in my case was the HostPrepDeployment one.
We see that the deployment is stuck on some Hosts* resource that was scheduled on a node. If you go on said node, you'll see nothing happening.

For example we can see the resources hanging:
| AllNodesDeploySteps                           | b58bee24-76d5-4a53-8d48-3e76b91407b7                                                                                                                                                 | OS::TripleO::PostDeploySteps                                                                                                     | CREATE_IN_PROGRESS | 2018-03-23T12:00:52Z | overcloud                                                                                                                                                |
| DatabaseHostPrepDeployment                    | 92469eaf-98be-4fbd-9d98-4101659cceb6                                                                                                                                                 | OS::Heat::SoftwareDeploymentGroup                                                                                                | CREATE_IN_PROGRESS | 2018-03-23T12:16:50Z | overcloud-AllNodesDeploySteps-sapadflv4r6r                                                                                                               |
| 2                                             | af13b8fb-5926-4167-b5f7-16354864dca6                                                                                                                                                 | OS::Heat::SoftwareDeployment                                                                                                     | CREATE_IN_PROGRESS | 2018-03-23T12:16:54Z | overcloud-AllNodesDeploySteps-sapadflv4r6r-DatabaseHostPrepDeployment-kd4qwvkin4zk


Now if we hop on all nodes to check if os-collect-config did indeed get this resource to be completed on the node we get the following:
$ for i in $(nova list |grep ctlplane | awk '{ print $12 }' | cut -f2 -d=); do echo $i; ssh $i "sudo os-collect-config --print |grep -e 'name.*HostPrepDepl'"; done 2> /dev/null
192.168.24.10
    "name": "CephStorageHostPrepDeployment", 
192.168.24.8
    "name": "ControllerHostPrepDeployment", 
192.168.24.21
    "name": "ControllerHostPrepDeployment", 
192.168.24.12
    "name": "ControllerHostPrepDeployment", 
192.168.24.14
    "name": "DatabaseHostPrepDeployment", 
192.168.24.16
    "name": "DatabaseHostPrepDeployment", 
192.168.24.18 <-- BOOOM overcloud-database-2 has not gotten the memo
192.168.24.6
    "name": "MessagingHostPrepDeployment", 
192.168.24.9
    "name": "MessagingHostPrepDeployment", 
192.168.24.11
    "name": "MessagingHostPrepDeployment", 
192.168.24.19
    "name": "ComputeInstanceHAHostPrepDeployment", 



So we're basically in the following state:
A) Heat on the undercloud waits for DatabaseHostPrepDeployment to be completed on overcloud-database-2
B) overcloud-database-2 simply has no idea it needs to run that task

A redeploy will usually not show this (rather rate) problem.
Undercloud sosreport http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/sosreport-undercloud-0.redhat.local-20180323085448.tar.xz
DB dump: http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/db.dump.gz
All other sosreports here:
http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/

Version-Release number of selected component (if applicable):
$ rpm -qa |grep heat
openstack-heat-api-cfn-10.0.1-0.20180302152334.c3bd928.el7ost.noarch
python-heat-agent-1.5.4-0.20180301153730.ecf43c7.el7ost.noarch
openstack-tripleo-heat-templates-8.0.0-0.20180304031148.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-api-10.0.1-0.20180302152334.c3bd928.el7ost.noarch
puppet-heat-12.3.1-0.20180221104603.27feed4.el7ost.noarch
python2-heatclient-1.14.0-0.20180213175737.2ce6aa1.el7ost.noarch
openstack-heat-engine-10.0.1-0.20180302152334.c3bd928.el7ost.noarch
openstack-heat-common-10.0.1-0.20180302152334.c3bd928.el7ost.noarch

Comment 1 Michele Baldessari 2018-03-23 14:48:04 UTC
Chatted with therve, this is likely https://bugs.launchpad.net/tripleo/+bug/1731032

Comment 2 Michele Baldessari 2018-03-23 15:01:22 UTC
(In reply to Michele Baldessari from comment #1)
> Chatted with therve, this is likely
> https://bugs.launchpad.net/tripleo/+bug/1731032

Seems we already have the patches from the LP bug in this package (aka https://review.openstack.org/#/c/521468/), so this is likely something else

Comment 3 Thomas Hervé 2018-03-23 15:34:52 UTC
I looked at it a little bit, and it really looks like https://bugs.launchpad.net/tripleo/+bug/1731032 which ought to be fixed. The database is correct and contains the deployments metadata, but the node doesn't get it. I don't have the full Heat logs, that would be useful. Having swift logs would help too.

Comment 4 Michele Baldessari 2018-03-23 15:44:05 UTC
I managed to take /var/log from the undercloud before the whole system got reprovisioned: http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/var-log.tgz maybe it helps a bit?

Comment 5 Thomas Hervé 2018-03-23 15:46:43 UTC
Ah found swift logs in journal, definitely this issue:

Mar 23 08:17:06 undercloud-0 proxy-server: 192.168.24.1 192.168.24.1 23/Mar/2018/12/17/06 PUT /v1/AUTH_f2cc1dec6db6422d82a5a8ef5556af52/ov--tw7r7g2htknr-2-s2w5wltduwsp-Database-xcon3myy6gbe/513ce19d-3a32-4ced-854e-ec31eda20744%3Ftemp_url_expires%3D2147483586%26temp_url_sig%3D394269aa972c830c59b731546d34e94d15d49738 HTTP/1.0 201 - python-requests/2.14.2 - 159751 - - txd4639a87568f40ffa6f23-005ab4f042 - 0.0108 - - 1521807426.890296936 1521807426.901119947 0

Mar 23 08:17:07 undercloud-0 proxy-server: 192.168.24.1 192.168.24.1 23/Mar/2018/12/17/07 PUT /v1/AUTH_f2cc1dec6db6422d82a5a8ef5556af52/ov--tw7r7g2htknr-2-s2w5wltduwsp-Database-xcon3myy6gbe/513ce19d-3a32-4ced-854e-ec31eda20744%3Ftemp_url_expires%3D2147483586%26temp_url_sig%3D394269aa972c830c59b731546d34e94d15d49738 HTTP/1.0 201 - python-requests/2.14.2 - 139580 - - tx8af2d2746ec54f86a2809-005ab4f043 - 0.0190 - - 1521807427.385188103 1521807427.404186010 0

We see 2 PUTs on the server metadata, and the 2nd is smaller than the first (139580 vs 159751).

It looks like https://review.openstack.org/#/c/521468/ didn't work.

Comment 6 Zane Bitter 2018-03-23 16:15:50 UTC
Looking at the code for the Swift object server, it looks like on PUT Swift supports If-None-Match, but not If-Match:

http://git.openstack.org/cgit/openstack/swift/tree/swift/obj/server.py#n738

This is also consistent with the API docs (although I don't put too much weight on the docs): https://developer.openstack.org/api-ref/object-store/

So it's likely that https://review.openstack.org/#/c/521468/ didn't achieve anything. We could resurrect https://review.openstack.org/#/c/521204/ to be sure to be sure.

Comment 10 Omri Hochman 2018-04-03 16:35:38 UTC
Issue blocks overcloud deployment, not happens all the time, but it reproduced downstream on latest osp13 puddle 2018-03-29.1
 
Raising Severity, Adding Blocker(?) flag.

Comment 13 Omri Hochman 2018-04-16 19:01:17 UTC
unable to reproduce with: 
openstack-heat-engine-10.0.1-0.20180314232330.c2a66b1.el7ost.noarch

Comment 15 errata-xmlrpc 2018-06-27 13:48:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.