Bug 1236167 - CephStorageNodesPostDeployment fails with "Deployment exited with non-zero status code: 6"
Summary: CephStorageNodesPostDeployment fails with "Deployment exited with non-zero st...
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ga
: Director
Assignee: Jay Dobies
QA Contact: Amit Ugol
URL:
Whiteboard:
Keywords: Triaged
: 1236969 (view as bug list)
Depends On:
Blocks: 1191185 1243520
TreeView+ depends on / blocked
 
Reported: 2015-06-26 18:03 UTC by Dan Sneddon
Modified: 2015-09-09 13:54 UTC (History)
13 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2015-08-05 13:57:00 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2015:1549 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform director Release 2015-08-05 17:49:10 UTC
Red Hat Bugzilla 1235908 None None None Never

Description Dan Sneddon 2015-06-26 18:03:56 UTC
Description of problem:
When installing the latest poodle today, I ran into a situation where Ceph would not finish post-deployment. This happened on multiple runs, and Mike Orazi ran into the same thing last night.

Version-Release number of selected component (if applicable):
2015-06-26-poodle

How reproducible:
Unknown, but happening to multiple testers

Steps to Reproduce:
1. Deploy overcloud with "openstack overcloud deploy --plan-uuid <UUID>"
2.
3.

Actual results:
CREATE_FAILED due to CephStorageNodesPostDeployment error:
ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6"" |
| resource_type          | OS::TripleO::CephStoragePostDeployment

Expected results:
The overcloud should deploy.

Additional info:
This was tested on a fresh instack VM using the poodle from mid-day today.

Comment 3 Dan Sneddon 2015-06-26 19:05:26 UTC
Note that I just tried doing a deployment with --ceph-storage-scale 0 and it still bombed out on CephStorageNodesPostDeployment.

Comment 4 Dan Sneddon 2015-06-27 21:38:11 UTC
I confirmed that this is also happening on bare metal, at least with network isolation enabled. Here is the error from /var/log/messages on the Ceph node:

un 27 17:10:29 localhost os-collect-config: -prepare-/srv/data]/returns: + test -b /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: + mkdir -p /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: + ceph-disk prepare /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: executed successfully\u001b[0m\n\u001b[mNotice: Finished catalog run in 302.67 seconds\u001b[0m\n", "deploy_stderr": "\u001b[1;31mError: Command exceeded timeout\nWrapped exception:\nexecution expired\u001b[0m\n\u001b[1;31mError: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns: change from notrun to 0 failed: Command exceeded timeout\u001b[0m\n", "deploy_status_code": 6}
Jun 27 17:10:29 localhost os-collect-config: [2015-06-27 17:10:29,895] (heat-config) [INFO] Error: Command exceeded timeout
Jun 27 17:10:29 localhost os-collect-config: Error: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns: change from notrun to 0 failed: Command exceeded timeout
Jun 27 17:10:29 localhost os-collect-config: [2015-06-27 17:10:29,895] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-puppet/aab751ec-95dc-40e0-ae22-db4d452084b3.pp. [6]

Comment 5 Mike Burns 2015-06-30 10:43:42 UTC
*** Bug 1236969 has been marked as a duplicate of this bug. ***

Comment 6 Dan Sneddon 2015-06-30 23:46:16 UTC
Just did a bare metal deployment with keystone's auth token timeout increased to 7200 seconds. It got to CREATE_COMPLETE and the errors with Ceph were not seen.

I think that means that when this patch lands we are golden:
https://code.engineering.redhat.com/gerrit/#/c/51898/2

The other bug to track along with this (the patch should fix both) is https://bugzilla.redhat.com/show_bug.cgi?id=1235908

Comment 8 Steven Hardy 2015-07-23 09:15:32 UTC
I'm not sure the "Command exceeded timeout" is actually related to a token (or heat) timeout?  It looks to me more like the command on the box is timing out, e.g due to either a puppet or ceph timeout.

For example see this upstream bug related to driving ceph-deploy via puppet:

https://bugs.launchpad.net/fuel/+bug/1304268

It exhibits the same symptoms, so it may be that the command failure is unrelated to the heat/token timeouts.

Comment 9 Amit Ugol 2015-07-23 11:46:59 UTC
This exact bug still happens when I deploy without tuskar

http://pastebin.test.redhat.com/299709

Comment 10 Udi 2015-07-23 11:49:33 UTC
This happened to me also when I wasn't trying to use network isolation.

Comment 11 Giulio Fidente 2015-07-23 14:51:36 UTC
I was unable to reproduce, can someone who did check NTP on the nodes (controllers and cephstorage) and attach output of ceph -s ?

Comment 14 Amit Ugol 2015-07-27 14:34:27 UTC
The deployments works well now. Note that I tested with only a few nodes, and have no idea what happens if we're trying to deploy a large number of nodes. The specific error does not reproduce though, so this specific issue is verified from my POV.

Comment 16 errata-xmlrpc 2015-08-05 13:57:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549

Comment 17 jliberma@redhat.com 2015-08-07 17:42:50 UTC
You can see this error when deploying ceph with tuskar or deploying with templates but missing the -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml parameter.

Error in undercloud heat-engine.log:
2015-08-07 01:01:09.691 17016 INFO heat.engine.stack [-] Stack CREATE FAILED (overcloud): Resource CREATE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6"

Correct deployment command syntax:
openstack overcloud deploy -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/network-environment.yaml --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph --ntp-server 10.16.255.2 --control-scale 3 --compute-scale 4 --ceph-storage-scale 4 --block-storage-scale 0 --swift-storage-scale 0 -t 90 --templates -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml

Comment 18 jliberma@redhat.com 2015-08-09 05:55:08 UTC
You can also get this error if using hiera to customize Ceph OSD disks, and the existing disks are either tagged for LVM or have non-GPT disk labels.


Note You need to log in before you can comment on or make changes to this bug.