Bug 1414647

Summary: [ceph-ansible]: purge cluster fails to zap OSD disks having encrypted OSDs
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Tejas <tchandra>
Component: ceph-ansibleAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Tejas <tchandra>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2CC: adeza, aschoen, ceph-eng-bugs, edonnell, gmeno, hnallurv, kdreyer, nthomas, sankarshan, seb, shan
Target Milestone: ---   
Target Release: 2   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-ansible-2.1.9-1.el7scon Doc Type: Bug Fix
Doc Text:
Previously, the ceph-ansible utility was unable to purge a cluster with encrypted OSD devices because the underlying ceph-disk utility was unable to destroy the partition table on an encrypted device by using the "--zap-disk" option. The underlying source code has been fixed allowing ceph-disk to use the "--zap-disk" option on encrypted devices. As a result, ceph-ansible can purge clusters with encrypted OSD devices as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-14 15:53:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ansible playbook log
none
ansible playbook log none

Description Tejas 2017-01-19 06:48:16 UTC
Created attachment 1242381 [details]
ansible playbook  log

Description of problem:
           I have a cluster having 8 colocated encrypted OSD, and 2 encrypted OSD with dedicated journals. purge-cluster.yml is failing on this cluster:

TASK [zap osd disks] ***********************************************************

failed: [magna058] (item=/dev/sdb) => {"changed": true, "cmd": "ceph-disk zap \"/dev/sdb\"", "delta": "0:05:08.418620", "end": "2017-01-19 06:06:24.523658", "failed": true, "item": "/dev/sdb", "rc": 1, "start": "2017-01-19 06:01:16.105038", "stderr": "\u0007Caution: invalid backup GPT header, but valid main header; regenerating\nbackup header from main header.\n\nWarning! Main and backup partition tables differ! Use the 'c' and 'e' options\non the recovery & transformation menu to examine the two tables.\n\nWarning! One or more CRCs don't match. You should repair the disk!\n\nceph-disk: Error: partprobe /dev/sdb failed : Error: Partition(s) 1, 2 on /dev/sdb have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use.  As a result, the old partition(s) will remain in use.  You should reboot now before making further changes.", "stdout": "\u0007\u0007****************************************************************************\nCaution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk\nverification and recovery are STRONGLY recommended.\n****************************************************************************\nWarning: The kernel is still using the old partition table.\nThe new table will be used at the next reboot.\nGPT data structures destroyed! You may now partition the disk using fdisk or\nother utilities.\nCreating new GPT entries.\nWarning: The kernel is still using the old partition table.\nThe new table will be used at the next reboot.\nThe operation has completed successfully.", "stdout_lines": ["\u0007\u0007****************************************************************************", "Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk", "verification and recovery are STRONGLY recommended.", "****************************************************************************", "Warning: The kernel is still using the old partition table.", "The new table will be used at the next reboot.", "GPT data structures destroyed! You may now partition the disk using fdisk or", "other utilities.", "Creating new GPT entries.", "Warning: The kernel is still using the old partition table.", "The new table will be used at the next reboot.", "The operation has completed successfully."], "warnings": []}

Version-Release number of selected component (if applicable):
ceph-ansible-2.1.3-1.el7scon.noarch
ansible-2.2.1.0-1.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. A ceph having a mix of colocated and dedicated journal encrypted OSDs
2. Purge the cluster.



Additional info:

The encrypted partitions are not purged:

 ~]# ceph-disk list
/dev/dm-0 other, unknown
/dev/dm-1 other, xfs
/dev/dm-2 other, unknown
/dev/dm-3 other, xfs
/dev/sda :
 /dev/sda1 other, ext4, mounted on /
/dev/sdb :
 /dev/sdb1 other, crypto_LUKS
/dev/sdc :
 /dev/sdc1 other, crypto_LUKS
/dev/sdd :
 /dev/sdd1 ceph journal (dmcrypt LUKS /dev/dm-0)
 /dev/sdd2 ceph journal (dmcrypt LUKS /dev/dm-2)


The hosts file:

[mons]
magna028 monitor_address="10.8.128.28"
magna031
magna046

[osds]
magna046 devices="[ '/dev/sdb', '/dev/sdc', '/dev/sdd' ]"
magna052 devices="[ '/dev/sdb', '/dev/sdc', '/dev/sdd' ]"
magna058 devices="[ '/dev/sdb', '/dev/sdc' ]"
magna031 devices="[ '/dev/sdb', '/dev/sdc' ]"

[mdss]
magna061

[rgws]
magna061

[clients]
magna061



Attached the playbook log here.

Comment 3 seb 2017-01-19 14:07:05 UTC
I can almost reproduce this, this is fix for ceph-disk, we can workaround this with ceph-ansible, I'm working on a fix.

Comment 4 seb 2017-01-19 14:30:55 UTC
Fix part of that PR: https://github.com/ceph/ceph-ansible/pull/1235

Comment 6 Ken Dreyer (Red Hat) 2017-02-07 17:24:10 UTC
Sebastien, what is the next step with this BZ?

I'm guessing QE should re-test with ceph-ansible-2.1.6-1.el7scon?

Comment 7 seb 2017-02-07 21:19:28 UTC
Correct ken.

Comment 9 Tejas 2017-02-13 07:04:51 UTC
Created attachment 1249770 [details]
ansible playbook  log

Comment 11 seb 2017-02-13 10:09:51 UTC
Can you share the ansible play of the purge playbook?

Comment 12 Tejas 2017-02-13 10:12:28 UTC
Its in the attachment attachment 1249770 [details]

Comment 13 seb 2017-02-15 14:31:07 UTC
Can you run ansible with -vvvv and paste the debug in a file log?
Thanks!

Comment 14 Andrew Schoen 2017-02-15 15:56:41 UTC
From looking at the log it seems like the 'zap ceph journal partitions' task was skipped. I've opened an upstream PR that I think might address this: https://github.com/ceph/ceph-ansible/pull/1311

Comment 15 Andrew Schoen 2017-02-15 19:18:29 UTC
The PR (#1311) has been merged and backported to stable-2.1, v2.1.9 of ceph-ansible will contain the fix for this issue.

Comment 18 Tejas 2017-02-17 11:10:37 UTC
Verified in version:
ceph-ansible-2.1.9-1.el7scon.noarch

Comment 20 errata-xmlrpc 2017-03-14 15:53:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:0515