Description of problem: Ceph OSDs don't get created on a 2nd overcloud deploy run even though the overcloud deploy finishes successfully and nothing indicates to an error. Deploy command: source ~/stackrc #export THT=/usr/share/openstack-tripleo-heat-templates export THT=~/templates/tht/ openstack overcloud deploy --templates $THT \ -e $THT/environments/network-isolation.yaml \ -e ~/templates/network-environment.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/disk-layout.yaml \ --control-scale 3 \ --control-flavor controller \ --compute-scale 1 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ --ntp-server clock.redhat.com [stack@undercloud ~]$ cat templates/disk-layout.yaml parameter_defaults: ExtraConfig: ceph::profile::params::osds: '/dev/vdb': {} '/dev/vdc': {} Version-Release number of selected component (if applicable): puppet-ceph-2.0.0-0.20160813061329.aa78806.el7ost.noarch openstack-tripleo-heat-templates-5.0.0-0.20160817161003.bacc2c6.1.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with ceph nodes with OSD disks 2. Delete deployment 3. Redeploy Actual results: Successful deployment but no OSD created: [root@overcloud-cephstorage-0 ~]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0 root default Expected results: OSDs get created or the deployment fails and indicates to the cause of not creating the OSDs. Additional info: In the logs the OSD activation appears to be successful: [root@overcloud-cephstorage-0 heat-admin]# journalctl -l -u os-collect-config | grep activate Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: 01b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb1\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[fcontext_/dev/vdc]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc1\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled\u001b[0m\n\u001b[mNotice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: executed successfully\u001b[0m\n\u001b[mNotice: /File[/etc/localtime]/seltype: seltype changed 'locale_t' to 'etc_t'\u001b[0m\n\u001b[mNotice: Finished catalog run in 4.91 seconds\u001b[0m\n", "deploy_stderr": "\u001b[1;31mWarning: Scope(Class[Ntp]): deprecation. puppet_3_type_check. This method is deprecated, please use the stdlib validate_legacy function, with Stdlib::Compat::Bool. There is further documentation for validate_legacy function in the README.\u001b[0m\n\u001b[1;31mWarning: Scope(Class[Ntp]): deprecation. puppet_3_type_check. Thi Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -b /dev/vdb1 Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdb]/Exec[ceph-osd-activate-/dev/vdb]/returns: executed successfully Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -b /dev/vdc1 Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: + test -f /usr/lib/udev/rules.d/95-ceph-osd.rules.disabled Aug 26 09:28:35 overcloud-cephstorage-0.localdomain os-collect-config[4065]: Notice: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/vdc]/Exec[ceph-osd-activate-/dev/vdc]/returns: executed successfully As a workaround you need to zap out the OSD disks before doing a subsequent deployment: sgdisk --zap /dev/vdb sgdisk --zap /dev/vdc This looks pretty much the same as this upstream bug with the difference that I couldn't find any errors within the os-collect-config journal: https://bugs.launchpad.net/puppet-ceph/+bug/1604728
Hi Marius, we don't overwrite disks which were previously used for Ceph so on the second attempt puppet is just skipping them. If the goal is to make it fail in such a circumstance, can we update the subject to match what is in https://bugs.launchpad.net/puppet-ceph/+bug/1604728 ?
(In reply to Giulio Fidente from comment #2) > Hi Marius, we don't overwrite disks which were previously used for Ceph so > on the second attempt puppet is just skipping them. > > If the goal is to make it fail in such a circumstance, can we update the > subject to match what is in > https://bugs.launchpad.net/puppet-ceph/+bug/1604728 ? I changed the subject. A small note though - the behavior that I'm seeing now is a bit different than the one I reported in the upstream bug where ceph-osd-activate showed an error. Now I can see ceph-osd-activate completing successfully without error.
It looks like the problem is as follows: after an initial deployment using dedicated disks for Ceph, if we repeat a deployment trying to re-use those same disks without cleaning them up, the 'ceph-disk prepare' command from puppet-ceph at [1] will exit 0 and continue skipping 'ceph-disk activate' (supposed to be triggered via udev when using block devices) and finally attempt a systemctl start ceph-osd which will also exit 0 (making puppet thing everything went fine) except the ceph-osd daemon will later die 1. https://github.com/openstack/puppet-ceph/blob/master/manifests/osd.pp#L102
To be clear, the problem only occures when re-deploying on disks previously used for another Ceph cluster. The OSDs activation (and deployment) do fail as intended in other circumstances.
Filed an RFE for automating the zapping with optional arg: https://bugzilla.redhat.com/show_bug.cgi?id=1377867 Thanks.
Upstream change merged. https://review.openstack.org/#/c/371756/
you may want to use ceph-disk zap instead of sgdisk directly, because there are other cleanups that may be needed (systemctl, etc). This reduces the amount that you have to know about ceph internals. Or you could just use ceph-ansible. With ceph-ansible, I usually run purge-cluster.yml to clean out previous deployment. It uses ceph-disk zap. https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/purge-cluster.yml#L404
verified on version puppet-ceph-2.2.1-3.el7ost.noarch. Ran overcloud deployment on ceph storage nodes with disks that already had OSDs installed on them.
I added doctext for this bug fix. I'll also include what the new error message looks like if a redeployment did not follow the docs [1]. [1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/single/red-hat-ceph-storage-for-the-overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT The new error (users should use the command below to get error details): [stack@hci-director ~]$ openstack stack failures list overcloud overcloud.AllNodesDeploySteps.CephStorageDeployment_Step3.1: resource_type: OS::Heat::StructuredDeployment physical_resource_id: bddf058f-3852-42d6-a0a2-153cb3ae5db5 status: CREATE_FAILED status_reason: | Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6 deploy_stdout: | ... Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sde] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdl] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdj] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdk] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdh] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdi] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdf] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdg] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdd] has failures: true Notice: Finished catalog run in 237.16 seconds (truncated, view all with --long) deploy_stderr: | ... returned 1 instead of one of [0] Error: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-check-fsid-mismatch-/dev/sdh]/returns: change from notrun to 0 failed: /bin/true # comment to satisfy puppet syntax requirements set -ex test 17d9a5a2-a061-11e6-a8e1-525400330666 = $(ceph-disk list /dev/sdh | egrep -o '[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}') returned 1 instead of one of [0] Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-prepare-/dev/sdh]: Skipping because of failed dependencies Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[fcontext_/dev/sdh]: Skipping because of failed dependencies Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-activate-/dev/sdh]: Skipping because of failed dependencies Warning: /Firewall[998 log all]: Skipping because of failed dependencies Warning: /Firewall[999 drop all]: Skipping because of failed dependencies (truncated, view all with --long) [stack@hci-director ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html