1236167 – CephStorageNodesPostDeployment fails with "Deployment exited with non-zero status code: 6"

Bug 1236167 - CephStorageNodesPostDeployment fails with "Deployment exited with non-zero status code: 6"

Summary: CephStorageNodesPostDeployment fails with "Deployment exited with non-zero st...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhosp-director
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	ga
Target Release:	Director
Assignee:	Jay Dobies
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1236969 (view as bug list)
Depends On:
Blocks:	1191185 1243520
TreeView+	depends on / blocked

Reported:	2015-06-26 18:03 UTC by Dan Sneddon
Modified:	2023-02-22 23:02 UTC (History)
CC List:	12 users (show)
Fixed In Version:	python-rdomanager-oscplugin-0.0.8-42.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-08-05 13:57:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1235908	0	high	CLOSED	Heat error when deploying with network isolation enabled	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2015:1549	0	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform director Release	2015-08-05 17:49:10 UTC

Description Dan Sneddon 2015-06-26 18:03:56 UTC

Description of problem:
When installing the latest poodle today, I ran into a situation where Ceph would not finish post-deployment. This happened on multiple runs, and Mike Orazi ran into the same thing last night.

Version-Release number of selected component (if applicable):
2015-06-26-poodle

How reproducible:
Unknown, but happening to multiple testers

Steps to Reproduce:
1. Deploy overcloud with "openstack overcloud deploy --plan-uuid <UUID>"
2.
3.

Actual results:
CREATE_FAILED due to CephStorageNodesPostDeployment error:
ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6"" |
| resource_type          | OS::TripleO::CephStoragePostDeployment

Expected results:
The overcloud should deploy.

Additional info:
This was tested on a fresh instack VM using the poodle from mid-day today.

Comment 3 Dan Sneddon 2015-06-26 19:05:26 UTC

Note that I just tried doing a deployment with --ceph-storage-scale 0 and it still bombed out on CephStorageNodesPostDeployment.

Comment 4 Dan Sneddon 2015-06-27 21:38:11 UTC

I confirmed that this is also happening on bare metal, at least with network isolation enabled. Here is the error from /var/log/messages on the Ceph node:

un 27 17:10:29 localhost os-collect-config: -prepare-/srv/data]/returns: + test -b /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: + mkdir -p /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: + ceph-disk prepare /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: executed successfully\u001b[0m\n\u001b[mNotice: Finished catalog run in 302.67 seconds\u001b[0m\n", "deploy_stderr": "\u001b[1;31mError: Command exceeded timeout\nWrapped exception:\nexecution expired\u001b[0m\n\u001b[1;31mError: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns: change from notrun to 0 failed: Command exceeded timeout\u001b[0m\n", "deploy_status_code": 6}
Jun 27 17:10:29 localhost os-collect-config: [2015-06-27 17:10:29,895] (heat-config) [INFO] Error: Command exceeded timeout
Jun 27 17:10:29 localhost os-collect-config: Error: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns: change from notrun to 0 failed: Command exceeded timeout
Jun 27 17:10:29 localhost os-collect-config: [2015-06-27 17:10:29,895] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-puppet/aab751ec-95dc-40e0-ae22-db4d452084b3.pp. [6]

Comment 5 Mike Burns 2015-06-30 10:43:42 UTC

*** Bug 1236969 has been marked as a duplicate of this bug. ***

Comment 6 Dan Sneddon 2015-06-30 23:46:16 UTC

Just did a bare metal deployment with keystone's auth token timeout increased to 7200 seconds. It got to CREATE_COMPLETE and the errors with Ceph were not seen.

I think that means that when this patch lands we are golden:
https://code.engineering.redhat.com/gerrit/#/c/51898/2

The other bug to track along with this (the patch should fix both) is https://bugzilla.redhat.com/show_bug.cgi?id=1235908

Comment 8 Steven Hardy 2015-07-23 09:15:32 UTC

I'm not sure the "Command exceeded timeout" is actually related to a token (or heat) timeout?  It looks to me more like the command on the box is timing out, e.g due to either a puppet or ceph timeout.

For example see this upstream bug related to driving ceph-deploy via puppet:

https://bugs.launchpad.net/fuel/+bug/1304268

It exhibits the same symptoms, so it may be that the command failure is unrelated to the heat/token timeouts.

Comment 9 Amit Ugol 2015-07-23 11:46:59 UTC

This exact bug still happens when I deploy without tuskar

http://pastebin.test.redhat.com/299709

Comment 10 Udi Kalifon 2015-07-23 11:49:33 UTC

This happened to me also when I wasn't trying to use network isolation.

Comment 11 Giulio Fidente 2015-07-23 14:51:36 UTC

I was unable to reproduce, can someone who did check NTP on the nodes (controllers and cephstorage) and attach output of ceph -s ?

Comment 12 Giulio Fidente 2015-07-23 15:52:17 UTC

Should be fixed by https://github.com/rdo-management/python-rdomanager-oscplugin/commit/ae39af33200b171be4dbac72ee2b91ad83e85abd

Comment 14 Amit Ugol 2015-07-27 14:34:27 UTC

The deployments works well now. Note that I tested with only a few nodes, and have no idea what happens if we're trying to deploy a large number of nodes. The specific error does not reproduce though, so this specific issue is verified from my POV.

Comment 16 errata-xmlrpc 2015-08-05 13:57:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549

Comment 17 jliberma@redhat.com 2015-08-07 17:42:50 UTC

You can see this error when deploying ceph with tuskar or deploying with templates but missing the -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml parameter.

Error in undercloud heat-engine.log:
2015-08-07 01:01:09.691 17016 INFO heat.engine.stack [-] Stack CREATE FAILED (overcloud): Resource CREATE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6"

Correct deployment command syntax:
openstack overcloud deploy -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/network-environment.yaml --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph --ntp-server 10.16.255.2 --control-scale 3 --compute-scale 4 --ceph-storage-scale 4 --block-storage-scale 0 --swift-storage-scale 0 -t 90 --templates -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml

Comment 18 jliberma@redhat.com 2015-08-09 05:55:08 UTC

You can also get this error if using hiera to customize Ceph OSD disks, and the existing disks are either tagged for LVM or have non-GPT disk labels.

Note You need to log in before you can comment on or make changes to this bug.