Bug 1370439
Summary: | Puppet should exit with error if disk activate fails | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> |
Component: | puppet-ceph | Assignee: | John Fulton <johfulto> |
Status: | CLOSED ERRATA | QA Contact: | Yogev Rabl <yrabl> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 10.0 (Newton) | CC: | bengland, dbecker, dmacpher, gfidente, jefbrown, jjoyce, johfulto, jschluet, jslagle, mburns, morazi, nyechiel, rhel-osp-director-maint, sasha, sclewis, scohen, seb, slinaber, tvignaud, yrabl |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | 10.0 (Newton) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | puppet-ceph-2.1.0-0.20160926220714.c764ef8.el7ost | Doc Type: | Bug Fix |
Doc Text: |
Reusing Ceph nodes from an previous cluster in a new overcloud caused the new Ceph cluster to fail without any indication during the overcloud deployment process. This was because the old Ceph OSD node disks needed cleaning before reusing them. This fix adds a check to the Ceph OpenStack Puppet module to make sure the disks are clean as per the instructions in the OpenStack Platform documentation [1]. Now the overcloud deplyoment process properly fails if it detects non-clean OSD disks. The 'openstack stack failures list overcloud' command indicates the disks which have a FSID mismatch.
[1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/10/single/red-hat-ceph-storage-for-the-overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-12-14 15:53:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Marius Cornea
2016-08-26 10:35:54 UTC
Hi Marius, we don't overwrite disks which were previously used for Ceph so on the second attempt puppet is just skipping them. If the goal is to make it fail in such a circumstance, can we update the subject to match what is in https://bugs.launchpad.net/puppet-ceph/+bug/1604728 ? (In reply to Giulio Fidente from comment #2) > Hi Marius, we don't overwrite disks which were previously used for Ceph so > on the second attempt puppet is just skipping them. > > If the goal is to make it fail in such a circumstance, can we update the > subject to match what is in > https://bugs.launchpad.net/puppet-ceph/+bug/1604728 ? I changed the subject. A small note though - the behavior that I'm seeing now is a bit different than the one I reported in the upstream bug where ceph-osd-activate showed an error. Now I can see ceph-osd-activate completing successfully without error. It looks like the problem is as follows: after an initial deployment using dedicated disks for Ceph, if we repeat a deployment trying to re-use those same disks without cleaning them up, the 'ceph-disk prepare' command from puppet-ceph at [1] will exit 0 and continue skipping 'ceph-disk activate' (supposed to be triggered via udev when using block devices) and finally attempt a systemctl start ceph-osd which will also exit 0 (making puppet thing everything went fine) except the ceph-osd daemon will later die 1. https://github.com/openstack/puppet-ceph/blob/master/manifests/osd.pp#L102 To be clear, the problem only occures when re-deploying on disks previously used for another Ceph cluster. The OSDs activation (and deployment) do fail as intended in other circumstances. Filed an RFE for automating the zapping with optional arg: https://bugzilla.redhat.com/show_bug.cgi?id=1377867 Thanks. Upstream change merged. https://review.openstack.org/#/c/371756/ you may want to use ceph-disk zap instead of sgdisk directly, because there are other cleanups that may be needed (systemctl, etc). This reduces the amount that you have to know about ceph internals. Or you could just use ceph-ansible. With ceph-ansible, I usually run purge-cluster.yml to clean out previous deployment. It uses ceph-disk zap. https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/purge-cluster.yml#L404 verified on version puppet-ceph-2.2.1-3.el7ost.noarch. Ran overcloud deployment on ceph storage nodes with disks that already had OSDs installed on them. I added doctext for this bug fix. I'll also include what the new error message looks like if a redeployment did not follow the docs [1]. [1] https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/single/red-hat-ceph-storage-for-the-overcloud/#Formatting_Ceph_Storage_Nodes_Disks_to_GPT The new error (users should use the command below to get error details): [stack@hci-director ~]$ openstack stack failures list overcloud overcloud.AllNodesDeploySteps.CephStorageDeployment_Step3.1: resource_type: OS::Heat::StructuredDeployment physical_resource_id: bddf058f-3852-42d6-a0a2-153cb3ae5db5 status: CREATE_FAILED status_reason: | Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6 deploy_stdout: | ... Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sde] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdl] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdj] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdk] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdh] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdi] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdf] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdg] has failures: true Notice: /Firewall[999 drop all]: Dependency Exec[ceph-osd-check-fsid-mismatch-/dev/sdd] has failures: true Notice: Finished catalog run in 237.16 seconds (truncated, view all with --long) deploy_stderr: | ... returned 1 instead of one of [0] Error: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-check-fsid-mismatch-/dev/sdh]/returns: change from notrun to 0 failed: /bin/true # comment to satisfy puppet syntax requirements set -ex test 17d9a5a2-a061-11e6-a8e1-525400330666 = $(ceph-disk list /dev/sdh | egrep -o '[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}') returned 1 instead of one of [0] Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-prepare-/dev/sdh]: Skipping because of failed dependencies Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[fcontext_/dev/sdh]: Skipping because of failed dependencies Warning: /Stage[main]/Ceph::Osds/Ceph::Osd[/dev/sdh]/Exec[ceph-osd-activate-/dev/sdh]: Skipping because of failed dependencies Warning: /Firewall[998 log all]: Skipping because of failed dependencies Warning: /Firewall[999 drop all]: Skipping because of failed dependencies (truncated, view all with --long) [stack@hci-director ~]$ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |