Bug 1413985
Summary: | [ceph-ansible]: purge-cluster.yml fails with ""Partition number 2 out of range!" on RHEL | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Storage Console | Reporter: | Tejas <tchandra> | ||||
Component: | ceph-ansible | Assignee: | Andrew Schoen <aschoen> | ||||
Status: | CLOSED ERRATA | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 2 | CC: | adeza, aschoen, ceph-eng-bugs, edonnell, gmeno, kdreyer, nthomas, sankarshan, seb | ||||
Target Milestone: | --- | ||||||
Target Release: | 2 | ||||||
Hardware: | Unspecified | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | ceph-ansible-2.1.4-1.el7scon | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-03-14 15:53:45 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Was this cluster setup with collocated journals? It seems that way from seeing this: TASK [zap osd disks] *********************************************************** changed: [magna052] => (item=/dev/sdb) And then: TASK [zap ceph journal partitions] ********************************************* failed: [magna052] (item=/dev/sdb2) If you are using Ansible 2.2 can you set this environment variable to get a bit more readable output: ANSIBLE_STDOUT_CALLBACK=debug Hi Alfredo, yes, it had colocated journals. Sure I will set the variable. Thanks, Tejas This is fixed upstream by this commit: 321cea8ba96cbca19b58aa9bbb76a584c268e2b1 I will port that to the stable-2.1 branch to fix it. It will need a re-spin of ceph-ansible Upstream pull request with cherry-picked fixes: https://github.com/ceph/ceph-ansible/pull/1231 Not a bug, for a collocated scenario, you need to pass -e zap_block_devs=false and this should go fine. However this BZ is a good opportunity to refactor some logic, so let's keep it open. (In reply to seb from comment #7) > Not a bug, for a collocated scenario, you need to pass -e > zap_block_devs=false and this should go fine. However this BZ is a good > opportunity to refactor some logic, so let's keep it open. I still think we should get the PR from c6 merged and downstream. We'll also need these additional commits or the upstream CI tests won't pass on the stable-2.1 branch for purge cluster: https://github.com/ceph/ceph-ansible/pull/1221/commits Without those commits there is also a chance that including default variables in the playbook will override the values set in group_vars. Right Andrew, I'm not saying 1221 shouldn't go through, I'm offering a workaround for this issue. :) 1221 will be merged Sounds like we might have this fixed in the RH Ceph Storage 2.2 timeframe. Re-targeting so we track it. The workaround has been merged to the stable-2.1 branch upstream, https://github.com/ceph/ceph-ansible/pull/1235 A refactor has been introduced in https://github.com/ceph/ceph-ansible/pull/1235 by Sebastian, but I don't think it's necessary to wait on moving this bug forward while we wait on that to be merged. This issue was seen with non encrypted OSDs. Not seeing this anymore with : ceph-ansible-2.1.4-1.el7scon.noarch ansible-2.2.1.0-1.el7.noarch There is a separate bug to track purge of encypted OSD. So moving this to Verified. Thanks, Tejas Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:0515 |
Created attachment 1241780 [details] playbook log Description of problem: I tried to purge a RHEL cluster on Jewel. The playbook is failing at this: TASK [zap ceph journal partitions] ********************************************* failed: [magna052] (item=/dev/sdb2) => {"changed": true, "cmd": "# if the disk passed is a raw device AND the boot system disk\n if echo \"/dev/sdb2\" | egrep -sq '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}$' && parted -s $(echo \"/dev/sdb2\" | egrep -o '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}') print | grep -sq boot; then\n echo \"Looks like /dev/sdb2 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n raw_device=$(echo \"/dev/sdb2\" | egrep -o '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}')\n partition_nb=$(echo \"/dev/sdb2\" | egrep -o '[0-9]{1,2}$')\n sgdisk --delete $partition_nb $raw_device", "delta": "0:00:00.055806", "end": "2017-01-17 12:36:22.468337", "failed": true, "item": "/dev/sdb2", "rc": 4, "start": "2017-01-17 12:36:22.412531", "stderr": "Partition number 2 out of range!\nError 0 deleting partition!\nError encountered; not saving changes.", "stdout": "", "stdout_lines": [], "warnings": []} Version-Release number of selected component (if applicable): ceph-ansible-2.1.1-1.el7scon.noarch ansible-2.2.1.0-1.el7.noarch ceph version 10.2.5-7.el7cp (59e9fee4a935fdd2bc8197e07596dc4313c410a3) How reproducible: Always Steps to Reproduce: 1. Create a RHEL cluster running Jewel. 2. Run a purge-cluster.yml Additional info: ceph -s --cluster ceph12 cluster 38717a85-f857-4fc6-a863-265043248ce1 health HEALTH_OK monmap e1: 3 mons at {magna028=10.8.128.28:6789/0,magna031=10.8.128.31:6789/0,magna046=10.8.128.46:6789/0} election epoch 6, quorum 0,1,2 magna028,magna031,magna046 fsmap e6: 1/1/1 up {0=magna061=up:active} osdmap e40: 9 osds: 9 up, 9 in flags sortbitwise,require_jewel_osds pgmap v80: 360 pgs, 8 pools, 3656 bytes data, 191 objects 320 MB used, 8291 GB / 8291 GB avail 360 active+clean TASK [zap ceph journal partitions] ********************************************* failed: [magna052] (item=/dev/sdb2) => {"changed": true, "cmd": "# if the disk passed is a raw device AND the boot system disk\n if echo \"/dev/sdb2\" | egrep -sq '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}$' && parted -s $(echo \"/dev/sdb2\" | egrep -o '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}') print | grep -sq boot; then\n echo \"Looks like /dev/sdb2 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n raw_device=$(echo \"/dev/sdb2\" | egrep -o '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}')\n partition_nb=$(echo \"/dev/sdb2\" | egrep -o '[0-9]{1,2}$')\n sgdisk --delete $partition_nb $raw_device", "delta": "0:00:00.055806", "end": "2017-01-17 12:36:22.468337", "failed": true, "item": "/dev/sdb2", "rc": 4, "start": "2017-01-17 12:36:22.412531", "stderr": "Partition number 2 out of range!\nError 0 deleting partition!\nError encountered; not saving changes.", "stdout": "", "stdout_lines": [], "warnings": []} failed: [magna046] (item=/dev/sdc2) => {"changed": true, "cmd": "# if the disk passed is a raw device AND the boot system disk\n if echo \"/dev/sdc2\" | egrep -sq '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}$' && parted -s $(echo \"/dev/sdc2\" | egrep -o '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}') print | grep -sq boot; then\n echo \"Looks like /dev/sdc2 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n raw_device=$(echo \"/dev/sdc2\" | egrep -o '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}')\n partition_nb=$(echo \"/dev/sdc2\" | egrep -o '[0-9]{1,2}$')\n sgdisk --delete $partition_nb $raw_device", "delta": "0:00:00.073971", "end": "2017-01-17 12:36:22.478779", "failed": true, "item": "/dev/sdc2", "rc": 4, "start": "2017-01-17 12:36:22.404808", "stderr": "Partition number 2 out of range!\nError 0 deleting partition!\nError encountered; not saving changes.", "stdout": "", "stdout_lines": [], "warnings": []} State of cluster after purge: root@magna028 ~]# ceph -s --cluster ceph12 2017-01-17 13:40:08.286310 7f16f8579700 0 -- :/3637730168 >> 10.8.128.46:6789/0 pipe(0x7f16fc05d8d0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f16fc05eb90).fault cluster 38717a85-f857-4fc6-a863-265043248ce1 health HEALTH_ERR 269 pgs are stuck inactive for more than 300 seconds 269 pgs stale 269 pgs stuck stale too many PGs per OSD (540 > max 300) mds rank 0 has failed mds cluster is degraded 1 mons down, quorum 0,1 magna028,magna031 monmap e1: 3 mons at {magna028=10.8.128.28:6789/0,magna031=10.8.128.31:6789/0,magna046=10.8.128.46:6789/0} election epoch 8, quorum 0,1 magna028,magna031 fsmap e8: 0/1/1 up, 1 failed osdmap e45: 9 osds: 2 up, 2 in; 207 remapped pgs flags sortbitwise,require_jewel_osds pgmap v175: 360 pgs, 8 pools, 3656 bytes data, 191 objects 74100 kB used, 1842 GB / 1842 GB avail 269 stale+active+clean 91 active+clean Looks like the partitions are purged but still failure is seen: root@magna052 ~]# fdisk -l WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion. Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes, 1953525168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: gpt # Start End Size Type Name WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion. Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes, 1953525168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: gpt # Start End Size Type Name WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion. Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes, 1953525168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: gpt I will attach the playbook log with this bug.