Bug 1414647

Summary:

[ceph-ansible]: purge cluster fails to zap OSD disks having encrypted OSDs

Product:

[Red Hat Storage] Red Hat Storage Console

Reporter:

Tejas <tchandra>

Component:

ceph-ansible

Assignee:

Sébastien Han <shan>

Status:

CLOSED ERRATA

QA Contact:

Tejas <tchandra>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

CC:

adeza, aschoen, ceph-eng-bugs, edonnell, gmeno, hnallurv, kdreyer, nthomas, sankarshan, seb, shan

Target Milestone:

---

Target Release:

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

ceph-ansible-2.1.9-1.el7scon

Doc Type:

Bug Fix

Doc Text:

Previously, the ceph-ansible utility was unable to purge a cluster with encrypted OSD devices because the underlying ceph-disk utility was unable to destroy the partition table on an encrypted device by using the "--zap-disk" option. The underlying source code has been fixed allowing ceph-disk to use the "--zap-disk" option on encrypted devices. As a result, ceph-ansible can purge clusters with encrypted OSD devices as expected.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-03-14 15:53:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ansible playbook log	none
ansible playbook log	none

Description Tejas 2017-01-19 06:48:16 UTC

Created attachment 1242381 [details]
ansible playbook  log

Description of problem:
           I have a cluster having 8 colocated encrypted OSD, and 2 encrypted OSD with dedicated journals. purge-cluster.yml is failing on this cluster:

TASK [zap osd disks] ***********************************************************

failed: [magna058] (item=/dev/sdb) => {"changed": true, "cmd": "ceph-disk zap \"/dev/sdb\"", "delta": "0:05:08.418620", "end": "2017-01-19 06:06:24.523658", "failed": true, "item": "/dev/sdb", "rc": 1, "start": "2017-01-19 06:01:16.105038", "stderr": "\u0007Caution: invalid backup GPT header, but valid main header; regenerating\nbackup header from main header.\n\nWarning! Main and backup partition tables differ! Use the 'c' and 'e' options\non the recovery & transformation menu to examine the two tables.\n\nWarning! One or more CRCs don't match. You should repair the disk!\n\nceph-disk: Error: partprobe /dev/sdb failed : Error: Partition(s) 1, 2 on /dev/sdb have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.", "stdout": "\u0007\u0007****************************************************************************\nCaution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk\nverification and recovery are STRONGLY recommended.\n****************************************************************************\nWarning: The kernel is still using the old partition table.\nThe new table will be used at the next reboot.\nGPT data structures destroyed! You may now partition the disk using fdisk or\nother utilities.\nCreating new GPT entries.\nWarning: The kernel is still using the old partition table.\nThe new table will be used at the next reboot.\nThe operation has completed successfully.", "stdout_lines": ["\u0007\u0007****************************************************************************", "Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk", "verification and recovery are STRONGLY recommended.", "****************************************************************************", "Warning: The kernel is still using the old partition table.", "The new table will be used at the next reboot.", "GPT data structures destroyed! You may now partition the disk using fdisk or", "other utilities.", "Creating new GPT entries.", "Warning: The kernel is still using the old partition table.", "The new table will be used at the next reboot.", "The operation has completed successfully."], "warnings": []}

Version-Release number of selected component (if applicable):
ceph-ansible-2.1.3-1.el7scon.noarch
ansible-2.2.1.0-1.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. A ceph having a mix of colocated and dedicated journal encrypted OSDs
2. Purge the cluster.



Additional info:

The encrypted partitions are not purged:

 ~]# ceph-disk list
/dev/dm-0 other, unknown
/dev/dm-1 other, xfs
/dev/dm-2 other, unknown
/dev/dm-3 other, xfs
/dev/sda :
 /dev/sda1 other, ext4, mounted on /
/dev/sdb :
 /dev/sdb1 other, crypto_LUKS
/dev/sdc :
 /dev/sdc1 other, crypto_LUKS
/dev/sdd :
 /dev/sdd1 ceph journal (dmcrypt LUKS /dev/dm-0)
 /dev/sdd2 ceph journal (dmcrypt LUKS /dev/dm-2)


The hosts file:

[mons]
magna028 monitor_address="10.8.128.28"
magna031
magna046

[osds]
magna046 devices="[ '/dev/sdb', '/dev/sdc', '/dev/sdd' ]"
magna052 devices="[ '/dev/sdb', '/dev/sdc', '/dev/sdd' ]"
magna058 devices="[ '/dev/sdb', '/dev/sdc' ]"
magna031 devices="[ '/dev/sdb', '/dev/sdc' ]"

[mdss]
magna061

[rgws]
magna061

[clients]
magna061



Attached the playbook log here.

Comment 3 seb 2017-01-19 14:07:05 UTC

I can almost reproduce this, this is fix for ceph-disk, we can workaround this with ceph-ansible, I'm working on a fix.

Comment 4 seb 2017-01-19 14:30:55 UTC

Fix part of that PR: https://github.com/ceph/ceph-ansible/pull/1235

Comment 6 Ken Dreyer (Red Hat) 2017-02-07 17:24:10 UTC

Sebastien, what is the next step with this BZ?

I'm guessing QE should re-test with ceph-ansible-2.1.6-1.el7scon?

Comment 7 seb 2017-02-07 21:19:28 UTC

Correct ken.

Comment 9 Tejas 2017-02-13 07:04:51 UTC

Created attachment 1249770 [details]
ansible playbook  log

Comment 11 seb 2017-02-13 10:09:51 UTC

Can you share the ansible play of the purge playbook?

Comment 12 Tejas 2017-02-13 10:12:28 UTC

Its in the attachment attachment 1249770 [details]

Comment 13 seb 2017-02-15 14:31:07 UTC

Can you run ansible with -vvvv and paste the debug in a file log?
Thanks!

Comment 14 Andrew Schoen 2017-02-15 15:56:41 UTC

From looking at the log it seems like the 'zap ceph journal partitions' task was skipped. I've opened an upstream PR that I think might address this: https://github.com/ceph/ceph-ansible/pull/1311

Comment 15 Andrew Schoen 2017-02-15 19:18:29 UTC

The PR (#1311) has been merged and backported to stable-2.1, v2.1.9 of ceph-ansible will contain the fix for this issue.

Comment 18 Tejas 2017-02-17 11:10:37 UTC

Verified in version:
ceph-ansible-2.1.9-1.el7scon.noarch

Comment 20 errata-xmlrpc 2017-03-14 15:53:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:0515