1370439 – Puppet should exit with error if disk activate fails

Bug 1370439 - Puppet should exit with error if disk activate fails

Summary: Puppet should exit with error if disk activate fails

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-ceph
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	10.0 (Newton)
Assignee:	John Fulton
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-26 10:35 UTC by Marius Cornea
Modified:	2023-02-22 23:02 UTC (History)
CC List:	20 users (show)
Fixed In Version:	puppet-ceph-2.1.0-0.20160926220714.c764ef8.el7ost
Doc Type:	Bug Fix
Doc Text:	Reusing Ceph nodes from an previous cluster in a new overcloud caused the new Ceph cluster to fail without any indication during the overcloud deployment process. This was because the old Ceph OSD node disks needed cleaning before reusing them. This fix adds a check to the Ceph OpenStack Puppet module to make sure the disks are clean as per the instructions in the OpenStack Platform documentation [1]. Now the overcloud deplyoment process properly fails if it detects non-clean OSD disks. The 'openstack stack failures list overcloud' command indicates the disks which have a FSID mismatch. [1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/single/red-hat-ceph-storage-for-the-overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT
Clone Of:
Environment:
Last Closed:	2016-12-14 15:53:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1604728	None	None	None	2016-08-26 12:23:45 UTC
OpenStack gerrit	371756	None	MERGED	Deployment should fail when trying to add another Ceph cluster's OSD	2020-06-09 06:14:40 UTC
Red Hat Product Errata	RHEA-2016:2948	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 enhancement update	2016-12-14 19:55:27 UTC

Internal Links: 1252158 1312190

Description Marius Cornea 2016-08-26 10:35:54 UTC

Description of problem:
Ceph OSDs don't get created on a 2nd overcloud deploy run even though the overcloud deploy finishes successfully and nothing indicates to an error. 

Deploy command:

source ~/stackrc
#export THT=/usr/share/openstack-tripleo-heat-templates
export THT=~/templates/tht/
openstack overcloud deploy --templates $THT \
-e $THT/environments/network-isolation.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com 

[stack@undercloud ~]$ cat templates/disk-layout.yaml 
parameter_defaults:
  ExtraConfig:
    ceph::profile::params::osds:
        '/dev/vdb': {}
        '/dev/vdc': {}


Version-Release number of selected component (if applicable):
puppet-ceph-2.0.0-0.20160813061329.aa78806.el7ost.noarch
openstack-tripleo-heat-templates-5.0.0-0.20160817161003.bacc2c6.1.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with ceph nodes with OSD disks
2. Delete deployment
3. Redeploy 

Actual results:
Successful deployment but no OSD created:

[root@overcloud-cephstorage-0 ~]# ceph osd tree
ID WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1      0 root default                                   

Expected results:
OSDs get created or the deployment fails and indicates to the cause of not creating the OSDs.

Additional info:
In the logs the OSD activation appears to be successful:

[root@overcloud-cephstorage-0 heat-admin]# journalctl -l -u os-collect-config | grep activate
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: 01b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb1\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[fcontext_/dev/vdc]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc1\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /File[/etc/localtime]/seltype: seltype changed 'locale_t' to 'etc_t'\u001b[0m\n\u001b[mNotice: Finished catalog run in 4.91 seconds\u001b[0m\n", "deploy_stderr": "\u001b[1;31mWarning: Scope(Class[Ntp]): deprecation. puppet_3_type_check. This method is deprecated, please use the stdlib validate_legacy function, with Stdlib::Compat::Bool. There is further documentation for validate_legacy function in the README.\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Ntp]): deprecation. puppet_3_type_check. Thi
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb1
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: executed successfully
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc1
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled
Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: executed successfully


As a workaround you need to zap out the OSD disks before doing a subsequent deployment:

sgdisk --zap /dev/vdb
sgdisk --zap /dev/vdc

This looks pretty much the same as this upstream bug with the difference that I couldn't find any errors within the os-collect-config journal:
https://bugs.launchpad.net/puppet-ceph/+bug/1604728

Comment 2 Giulio Fidente 2016-08-26 12:23:45 UTC

Hi Marius, we don't overwrite disks which were previously used for Ceph so on the second attempt puppet is just skipping them.

If the goal is to make it fail in such a circumstance, can we update the subject to match what is in https://bugs.launchpad.net/puppet-ceph/+bug/1604728 ?

Comment 3 Marius Cornea 2016-08-26 12:52:44 UTC

(In reply to Giulio Fidente from comment #2)
> Hi Marius, we don't overwrite disks which were previously used for Ceph so
> on the second attempt puppet is just skipping them.
> 
> If the goal is to make it fail in such a circumstance, can we update the
> subject to match what is in
> https://bugs.launchpad.net/puppet-ceph/+bug/1604728 ?

I changed the subject. A small note though - the behavior that I'm seeing now is a bit different than the one I reported in the upstream bug where ceph-osd-activate showed an error. Now I can see ceph-osd-activate completing successfully without error.

Comment 4 Giulio Fidente 2016-09-16 12:14:39 UTC

It looks like the problem is as follows:

after an initial deployment using dedicated disks for Ceph, if we repeat a deployment trying to re-use those same disks without cleaning them up, the 'ceph-disk prepare' command from puppet-ceph at [1] will exit 0 and continue skipping 'ceph-disk activate' (supposed to be triggered via udev when using block devices) and finally attempt a systemctl start ceph-osd which will also exit 0 (making puppet thing everything went fine) except the ceph-osd daemon will later die

1. https://github.com/openstack/puppet-ceph/blob/master/manifests/osd.pp#L102

Comment 5 Giulio Fidente 2016-09-16 15:08:55 UTC

To be clear, the problem only occures when re-deploying on disks previously used for another Ceph cluster. The OSDs activation (and deployment) do fail as intended in other circumstances.

Comment 6 Alexander Chuzhoy 2016-09-20 20:35:07 UTC

Filed an RFE for automating the zapping with optional arg:
https://bugzilla.redhat.com/show_bug.cgi?id=1377867

Thanks.

Comment 7 John Fulton 2016-09-23 19:28:23 UTC

Upstream change merged. 

https://review.openstack.org/#/c/371756/

Comment 10 Ben England 2016-10-12 19:39:23 UTC

you may want to use ceph-disk zap instead of sgdisk directly, because there are other cleanups that may be needed (systemctl, etc).  This reduces the amount that you have to know about ceph internals.  Or you could just use ceph-ansible.

With ceph-ansible, I usually run purge-cluster.yml to clean out previous deployment.  It uses ceph-disk zap.

https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/purge-cluster.yml#L404

Comment 13 Yogev Rabl 2016-11-01 12:40:58 UTC

verified on version puppet-ceph-2.2.1-3.el7ost.noarch.

Ran overcloud deployment on ceph storage nodes with disks that already had OSDs installed on them.

Comment 14 John Fulton 2016-11-18 15:29:58 UTC

I added doctext for this bug fix. I'll also include what the new error message looks like if a redeployment did not follow the docs [1]. 

[1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/single/red-hat-ceph-storage-for-the-overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT

The new error (users should use the command below to get error details): 

[stack@hci-director ~]$ openstack stack failures list overcloud 
overcloud.AllNodesDeploySteps.CephStorageDeployment_Step3.1:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: bddf058f-3852-42d6-a0a2-153cb3ae5db5
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
  deploy_stdout: |
    ...
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sde] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdl] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdj] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdk] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdh] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdi] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdf] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdg] has failures: true
    Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdd] has failures: true
    Notice: Finished catalog run in 237.16 seconds
    (truncated, view all with --long)
  deploy_stderr: |
    ...
     returned 1 instead of one of [0]
    Error: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-check-fsid-mismatch-/dev/sdh]/returns: change from notrun to 0 failed: /bin/true # comment to satisfy puppet syntax requirements
    set -ex
    test 17d9a5a2-a061-11e6-a8e1-525400330666 = $(ceph-disk list /dev/sdh | egrep -o '[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}')
     returned 1 instead of one of [0]
    Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-prepare-/dev/sdh]: Skipping because of failed dependencies
    Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[fcontext_/dev/sdh]: Skipping because of failed dependencies
    Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-activate-/dev/sdh]: Skipping because of failed dependencies
    Warning: /Firewall[998 log all]: Skipping because of failed dependencies
    Warning: /Firewall[999 drop all]: Skipping because of failed dependencies
    (truncated, view all with --long)
[stack@hci-director ~]$

Comment 16 errata-xmlrpc 2016-12-14 15:53:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.

bengland
dbecker
dmacpher
gfidente
jefbrown
jjoyce
johfulto
jschluet
jslagle
mburns
morazi
nyechiel
rhel-osp-director-maint
sasha
sclewis
scohen
seb
slinaber
tvignaud
yrabl