Description of problem: When installing the latest poodle today, I ran into a situation where Ceph would not finish post-deployment. This happened on multiple runs, and Mike Orazi ran into the same thing last night. Version-Release number of selected component (if applicable): 2015-06-26-poodle How reproducible: Unknown, but happening to multiple testers Steps to Reproduce: 1. Deploy overcloud with "openstack overcloud deploy --plan-uuid <UUID>" 2. 3. Actual results: CREATE_FAILED due to CephStorageNodesPostDeployment error: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6"" | | resource_type | OS::TripleO::CephStoragePostDeployment Expected results: The overcloud should deploy. Additional info: This was tested on a fresh instack VM using the poodle from mid-day today.
Note that I just tried doing a deployment with --ceph-storage-scale 0 and it still bombed out on CephStorageNodesPostDeployment.
I confirmed that this is also happening on bare metal, at least with network isolation enabled. Here is the error from /var/log/messages on the Ceph node: un 27 17:10:29 localhost os-collect-config: -prepare-/srv/data]/returns: + test -b /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: + mkdir -p /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: + ceph-disk prepare /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: executed successfully\u001b[0m\n\u001b[mNotice: Finished catalog run in 302.67 seconds\u001b[0m\n", "deploy_stderr": "\u001b[1;31mError: Command exceeded timeout\nWrapped exception:\nexecution expired\u001b[0m\n\u001b[1;31mError: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns: change from notrun to 0 failed: Command exceeded timeout\u001b[0m\n", "deploy_status_code": 6} Jun 27 17:10:29 localhost os-collect-config: [2015-06-27 17:10:29,895] (heat-config) [INFO] Error: Command exceeded timeout Jun 27 17:10:29 localhost os-collect-config: Error: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns: change from notrun to 0 failed: Command exceeded timeout Jun 27 17:10:29 localhost os-collect-config: [2015-06-27 17:10:29,895] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-puppet/aab751ec-95dc-40e0-ae22-db4d452084b3.pp. [6]
*** Bug 1236969 has been marked as a duplicate of this bug. ***
Just did a bare metal deployment with keystone's auth token timeout increased to 7200 seconds. It got to CREATE_COMPLETE and the errors with Ceph were not seen. I think that means that when this patch lands we are golden: https://code.engineering.redhat.com/gerrit/#/c/51898/2 The other bug to track along with this (the patch should fix both) is https://bugzilla.redhat.com/show_bug.cgi?id=1235908
I'm not sure the "Command exceeded timeout" is actually related to a token (or heat) timeout? It looks to me more like the command on the box is timing out, e.g due to either a puppet or ceph timeout. For example see this upstream bug related to driving ceph-deploy via puppet: https://bugs.launchpad.net/fuel/+bug/1304268 It exhibits the same symptoms, so it may be that the command failure is unrelated to the heat/token timeouts.
This exact bug still happens when I deploy without tuskar http://pastebin.test.redhat.com/299709
This happened to me also when I wasn't trying to use network isolation.
I was unable to reproduce, can someone who did check NTP on the nodes (controllers and cephstorage) and attach output of ceph -s ?
Should be fixed by https://github.com/rdo-management/python-rdomanager-oscplugin/commit/ae39af33200b171be4dbac72ee2b91ad83e85abd
The deployments works well now. Note that I tested with only a few nodes, and have no idea what happens if we're trying to deploy a large number of nodes. The specific error does not reproduce though, so this specific issue is verified from my POV.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1549
You can see this error when deploying ceph with tuskar or deploying with templates but missing the -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml parameter. Error in undercloud heat-engine.log: 2015-08-07 01:01:09.691 17016 INFO heat.engine.stack [-] Stack CREATE FAILED (overcloud): Resource CREATE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: ResourceUnknownStatus: Resource failed - Unknown status FAILED due to "Resource CREATE failed: Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6" Correct deployment command syntax: openstack overcloud deploy -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/network-environment.yaml --control-flavor control --compute-flavor compute --ceph-storage-flavor ceph --ntp-server 10.16.255.2 --control-scale 3 --compute-scale 4 --ceph-storage-scale 4 --block-storage-scale 0 --swift-storage-scale 0 -t 90 --templates -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml
You can also get this error if using hiera to customize Ceph OSD disks, and the existing disks are either tagged for LVM or have non-GPT disk labels.