Bug 1370439

Summary:	Puppet should exit with error if disk activate fails
Product:	Red Hat OpenStack	Reporter:	Marius Cornea <mcornea>
Component:	puppet-ceph	Assignee:	John Fulton <johfulto>
Status:	CLOSED ERRATA	QA Contact:	Yogev Rabl <yrabl>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	10.0 (Newton)	CC:	bengland, dbecker, dmacpher, gfidente, jefbrown, jjoyce, johfulto, jschluet, jslagle, mburns, morazi, nyechiel, rhel-osp-director-maint, sasha, sclewis, scohen, seb, slinaber, tvignaud, yrabl
Target Milestone:	rc	Keywords:	Triaged
Target Release:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	puppet-ceph-2.1.0-0.20160926220714.c764ef8.el7ost	Doc Type:	Bug Fix
Doc Text:	Reusing Ceph nodes from an previous cluster in a new overcloud caused the new Ceph cluster to fail without any indication during the overcloud deployment process. This was because the old Ceph OSD node disks needed cleaning before reusing them. This fix adds a check to the Ceph OpenStack Puppet module to make sure the disks are clean as per the instructions in the OpenStack Platform documentation [1]. Now the overcloud deplyoment process properly fails if it detects non-clean OSD disks. The 'openstack stack failures list overcloud' command indicates the disks which have a FSID mismatch. [1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/single/red-hat-ceph-storage-for-the-overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-12-14 15:53:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marius Cornea 2016-08-26 10:35:54 UTC

Description of problem:
Ceph OSDs don't get created on a 2nd overcloud deploy run even though the overcloud deploy finishes successfully and nothing indicates to an error. 

Deploy command:

source ~/stackrc
#export THT=/usr/share/openstack-tripleo-heat-templates
export THT=~/templates/tht/
openstack overcloud deploy --templates $THT \
-e $THT/environments/network-isolation.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com 

[stack@undercloud ~]$ cat templates/disk-layout.yaml 
parameter_defaults:
  ExtraConfig:
    ceph::profile::params::osds:
        '/dev/vdb': {}
        '/dev/vdc': {}


Version-Release number of selected component (if applicable):
puppet-ceph-2.0.0-0.20160813061329.aa78806.el7ost.noarch
openstack-tripleo-heat-templates-5.0.0-0.20160817161003.bacc2c6.1.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with ceph nodes with OSD disks
2. Delete deployment
3. Redeploy 

Actual results:
Successful deployment but no OSD created:

[root@overcloud-cephstorage-0 ~]# ceph osd tree
ID WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1      0 root default                                   

Expected results:
OSDs get created or the deployment fails and indicates to the cause of not creating the OSDs.

Additional info:
In the logs the OSD activation appears to be successful:

[root@overcloud-cephstorage-0 heat-admin]# journalctl -l -u os-collect-config | grep activate
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: 01b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb1\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[fcontext_/dev/vdc]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc1\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /File[/etc/localtime]/seltype: seltype changed 'locale_t' to 'etc_t'\u001b[0m\n\u001b[mNotice: Finished catalog run in 4.91 seconds\u001b[0m\n", "deploy_stderr": "\u001b[1;31mWarning: Scope(Class[Ntp]): deprecation. puppet_3_type_check. This method is deprecated, please use the stdlib validate_legacy function, with Stdlib::Compat::Bool. There is further documentation for validate_legacy function in the README.\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Ntp]): deprecation. puppet_3_type_check. Thi
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb1
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: executed successfully
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc1
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: executed successfully


As a workaround you need to zap out the OSD disks before doing a subsequent deployment:

sgdisk --zap /dev/vdb
sgdisk --zap /dev/vdc

This looks pretty much the same as this upstream bug with the difference that I couldn't find any errors within the os-collect-config journal:
https://bugs.launchpad.net/puppet-ceph/+bug/1604728

Comment 2 Giulio Fidente 2016-08-26 12:23:45 UTC

Hi Marius, we don't overwrite disks which were previously used for Ceph so on the second attempt puppet is just skipping them.

If the goal is to make it fail in such a circumstance, can we update the subject to match what is in https://bugs.launchpad.net/puppet-ceph/+bug/1604728 ?

Comment 3 Marius Cornea 2016-08-26 12:52:44 UTC

(In reply to Giulio Fidente from comment #2)
> Hi Marius, we don't overwrite disks which were previously used for Ceph so
> on the second attempt puppet is just skipping them.
> 
> If the goal is to make it fail in such a circumstance, can we update the
> subject to match what is in
> https://bugs.launchpad.net/puppet-ceph/+bug/1604728 ?

I changed the subject. A small note though - the behavior that I'm seeing now is a bit different than the one I reported in the upstream bug where ceph-osd-activate showed an error. Now I can see ceph-osd-activate completing successfully without error.

Comment 4 Giulio Fidente 2016-09-16 12:14:39 UTC

It looks like the problem is as follows:

after an initial deployment using dedicated disks for Ceph, if we repeat a deployment trying to re-use those same disks without cleaning them up, the 'ceph-disk prepare' command from puppet-ceph at [1] will exit 0 and continue skipping 'ceph-disk activate' (supposed to be triggered via udev when using block devices) and finally attempt a systemctl start ceph-osd which will also exit 0 (making puppet thing everything went fine) except the ceph-osd daemon will later die

1. https://github.com/openstack/puppet-ceph/blob/master/manifests/osd.pp#L102

Comment 5 Giulio Fidente 2016-09-16 15:08:55 UTC

To be clear, the problem only occures when re-deploying on disks previously used for another Ceph cluster. The OSDs activation (and deployment) do fail as intended in other circumstances.

Comment 6 Alexander Chuzhoy 2016-09-20 20:35:07 UTC

Filed an RFE for automating the zapping with optional arg:
https://bugzilla.redhat.com/show_bug.cgi?id=1377867

Thanks.

Comment 7 John Fulton 2016-09-23 19:28:23 UTC

Upstream change merged. 

https://review.openstack.org/#/c/371756/

Comment 10 Ben England 2016-10-12 19:39:23 UTC

you may want to use ceph-disk zap instead of sgdisk directly, because there are other cleanups that may be needed (systemctl, etc).  This reduces the amount that you have to know about ceph internals.  Or you could just use ceph-ansible.

With ceph-ansible, I usually run purge-cluster.yml to clean out previous deployment.  It uses ceph-disk zap.

https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/purge-cluster.yml#L404

Comment 13 Yogev Rabl 2016-11-01 12:40:58 UTC

verified on version puppet-ceph-2.2.1-3.el7ost.noarch.

Ran overcloud deployment on ceph storage nodes with disks that already had OSDs installed on them.

Comment 14 John Fulton 2016-11-18 15:29:58 UTC

I added doctext for this bug fix. I'll also include what the new error message looks like if a redeployment did not follow the docs [1]. 

[1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/single/red-hat-ceph-storage-for-the-overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT

The new error (users should use the command below to get error details): 

[stack@hci-director ~]$ openstack stack failures list overcloud 
overcloud.AllNodesDeploySteps.CephStorageDeployment_Step3.1:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: bddf058f-3852-42d6-a0a2-153cb3ae5db5
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
  deploy_stdout: |
    ...
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sde] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdl] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdj] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdk] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdh] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdi] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdf] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdg] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdd] has failures: true
    Notice: Finished catalog run in 237.16 seconds
    (truncated, view all with --long)
  deploy_stderr: |
    ...
     returned 1 instead of one of [0]
    Error: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-check-fsid-mismatch-/dev/sdh]/returns: change from notrun to 0 failed: /bin/true # comment to satisfy puppet syntax requirements
    set -ex
    test 17d9a5a2-a061-11e6-a8e1-525400330666 = $(ceph-disk list /dev/sdh | egrep -o '[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}')
     returned 1 instead of one of [0]
    Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-prepare-/dev/sdh]: Skipping because of failed dependencies
    Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[fcontext_/dev/sdh]: Skipping because of failed dependencies
    Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-activate-/dev/sdh]: Skipping because of failed dependencies
    Warning: /Firewall[998 log all]: Skipping because of failed dependencies
    Warning: /Firewall[999 drop all]: Skipping because of failed dependencies
    (truncated, view all with --long)
[stack@hci-director ~]$

Comment 16 errata-xmlrpc 2016-12-14 15:53:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html