Bug 1250654

Summary: rhel-osp-director: overcloud deployment fails on " CephStorageDeployment_Step1" , Error: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns: change from notrun to 0 failed: Command exceeded timeout.
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: rhosp-directorAssignee: Jiri Stransky <jstransk>
Status: CLOSED DUPLICATE QA Contact: yeylon <yeylon>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: djuran, jdonohue, jstransk, mburns, morazi, rhel-osp-director-maint, rnishtal, sasha, srevivo
Target Milestone: y2Keywords: Reopened, ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-04 17:20:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1245737    
Bug Blocks: 1191185, 1243520    
Attachments:
Description Flags
heat-engine from the undercloud
none
messages file from ceph and heat logs from the undercloud. none

Description Alexander Chuzhoy 2015-08-05 16:53:37 UTC
rhel-osp-director: overcloud deployment fails on " CephStorageDeployment_Step1" , Error: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns: change from notrun to 0 failed: Command exceeded timeout.


Environment:
ceph-osd-0.94.1-13.el7cp.x86_64
ceph-0.94.1-13.el7cp.x86_64
ceph-common-0.94.1-13.el7cp.x86_64
ceph-mon-0.94.1-13.el7cp.x86_64
instack-undercloud-2.1.2-22.el7ost.noarch

Steps to reproduce:
1. Deploy the undercloud.
2. Attempt to deploy the overcloud with 1 controller, 1 compute and 1 ceph storage.

Result:
The deployment fails.
--------------------+
| resource_name                               | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource
                    |
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-------------------------
--------------------+
| CephStorageNodesPostDeployment              | ee3b8caa-6a5c-48e0-a3f6-5849bdeddb52          | OS::TripleO::CephStoragePostDeployment            | CREATE_FAILED   | 2015-08-05T15:34:18Z |
                    |
| CephStorageDeployment_Step1                 | af269a91-6213-4f4b-9a83-694155b1d84b          | OS::Heat::StructuredDeployments                   | CREATE_FAILED   | 2015-08-05T15:59:37Z | CephStorageNodesPostDepl
oyment              |
| 0                                           | d0114bb3-5bd9-487a-bb2d-67b3c4cc7336          | OS::Heat::StructuredDeployment                    | CREATE_FAILED   | 2015-08-05T15:59:38Z | CephStorageDeployment_St
ep1                 |
+---------------------------




[root@overcloud-cephstorage-0 ~]# journalctl -u os-collect-config|grep -i error
Aug 05 12:05:09 overcloud-cephstorage-0.localdomain os-collect-config[4840]: d 'refresh' from 1 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph/Ceph_config[global/osd_pool_default_size]/ensure: created\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Keys/Ceph::Key[client.admin]/Exec[ceph-key-client.admin]/returns: + ceph-authtool /etc/ceph/ceph.client.admin.keyring --name client.admin --add-key AQDzLMJVAAAAABAAYgFxSJn0uFTEqet5IACsLw== --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Keys/Ceph::Key[client.admin]/Exec[ceph-key-client.admin]/returns: added entity client.admin auth auth(auid = 18446744073709551615 key=AQDzLMJVAAAAABAAYgFxSJn0uFTEqet5IACsLw== with 0 caps)\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Keys/Ceph::Key[client.admin]/Exec[ceph-key-client.admin]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph/Ceph_config[global/osd_pool_default_pg_num]/ensure: created\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph/Ceph_config[global/public_network]/ensure: created\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: + test -b /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: + mkdir -p /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: + ceph-disk prepare /srv/data\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-prepare-/srv/data]/returns: executed successfully\u001b[0m\n\u001b[mNotice: Finished catalog run in 303.12 seconds\u001b[0m\n", "deploy_stderr": "\u001b[1;31mError: Command exceeded timeout\nWrapped exception:\nexecution expired\u001b[0m\n\u001b[1;31mError: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns: change from notrun to 0 failed: Command exceeded timeout\u001b[0m\n", "deploy_status_code": 6}
Aug 05 12:05:09 overcloud-cephstorage-0.localdomain os-collect-config[4840]: [2015-08-05 12:05:09,453] (heat-config) [INFO] Error: Command exceeded timeout
Aug 05 12:05:09 overcloud-cephstorage-0.localdomain os-collect-config[4840]: Error: /Stage[main]/Ceph::Osds/Ceph::Osd[/srv/data]/Exec[ceph-osd-activate-/srv/data]/returns: change from notrun to 0 failed: Command exceeded timeout
Aug 05 12:05:09 overcloud-cephstorage-0.localdomain os-collect-config[4840]: [2015-08-05 12:05:09,453] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-puppet/8295d161-635a-4edb-8d4a-ede3e76b073c.pp. [6]


Expected result:
The deployment shouldn't fail on ceph storage.

Comment 3 Alexander Chuzhoy 2015-08-05 17:02:30 UTC
Created attachment 1059571 [details]
heat-engine from the undercloud

Comment 4 Alexander Chuzhoy 2015-08-05 17:05:50 UTC
Created attachment 1059572 [details]
messages file from ceph and heat logs from the undercloud.

Comment 5 Mike Burns 2015-08-05 21:17:50 UTC
I think this is related to the CLI/template changes that jistr put in for bug 1247585.

Comment 6 Jiri Stransky 2015-08-06 08:20:58 UTC
@mburns yeah it could be.

@sasha what was the command line you used to deploy? Please try passing the environment file as described here:

https://bugzilla.redhat.com/show_bug.cgi?id=1247585#c6

Comment 7 Alexander Chuzhoy 2015-08-06 13:06:49 UTC
Here's the command I use (same as on the last puddle):
openstack overcloud deploy --plan overcloud --control-scale 1  --compute-scale 1  --ceph-storage-scale 1 --block-storage-scale 0 --swift-storage-scale 0 -e /home/stack/network-environment.yaml --ntp-server [IP] --timeout 90

No yaml file for cinder.

Comment 8 Mike Burns 2015-08-06 13:11:35 UTC
@sasha -- can you try passing -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml as well and see if that works?

Comment 9 Alexander Chuzhoy 2015-08-06 13:22:38 UTC
Environment: openstack-tripleo-heat-templates-0.8.6-45.el7ost.noarch

The file /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml doesn't exist.

Comment 10 Mike Burns 2015-08-06 21:51:07 UTC
(In reply to Alexander Chuzhoy from comment #9)
> Environment: openstack-tripleo-heat-templates-0.8.6-45.el7ost.noarch
> 
> The file
> /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.
> yaml doesn't exist.

Note:  this was resolved in a conversation.  The fix requires 0.8.6-46, not -45.

Comment 11 Alexander Chuzhoy 2015-08-07 16:15:08 UTC
Was able to deploy the overcloud using this command:
openstack overcloud deploy --templates --control-scale 1  --compute-scale 1  --ceph-storage-scale 1 --block-storage-scale 0 --swift-storage-scale 0 -e /home/stack/network-environment.yaml --ntp-server [IP] --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml


Using this THT build:
openstack-tripleo-heat-templates-0.8.6-46.el7ost.noarch

Comment 12 Mike Burns 2015-08-13 06:51:05 UTC
Based on comment 11, this is notabug

Comment 13 David Juran 2015-09-04 13:46:20 UTC
Re-opening this bug. As discussed on IRC, if a user selects to install a ceph-node, we should provide a reasonable default. Or at least point out that the template is needed. failing the deployment with a non-obvious error-message is not a good option

Comment 14 Jiri Stransky 2015-09-15 12:53:12 UTC
We already had a smart default, but it wasn't overridable, causing a number of storage configurations to be impossible (see bug 1247585). We had to remove the smart default in favor of configurability. Re-adding that smart default should be possible once we have parameter overridability on CLI (bug 1245737).

Comment 15 Mike Orazi 2015-11-04 17:20:26 UTC
We are planning on providing this functionality via the param override functionality in https://bugzilla.redhat.com/show_bug.cgi?id=1245737 and we should track it there.  if this solution is insufficient, please feel free to reopen this bug so we can track it distinctly.

*** This bug has been marked as a duplicate of bug 1245737 ***

Comment 16 Rama 2015-11-04 17:27:04 UTC
The following files in puppet/manifests was hardcoded for ceph installation to go through.
overcloud_cephstorage.pp
     23 
     24 Exec {
     25 timeout => 9000,
     26 }
     27 
     28 if str2bool(hiera('ceph_osd_selinux_permissive', true)) {
overcloud_controller.pp"
     33 
     34 Exec {
     35 timeout => 9000,
     36 }
     37 

overcloud_controller_pacemaker.pp
     37 
     38 Exec {
     39 timeout => 9000,
     40 }
     41 
     42 if hiera('step') >= 1 {

The timeout has been increased to 9000.