Bug 1384846
Summary: | [ceph-ansible]: can fail with "Invalid partition data!" | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Storage Console | Reporter: | John Harrigan <jharriga> | ||||
Component: | ceph-ansible | Assignee: | Sébastien Han <shan> | ||||
Status: | CLOSED ERRATA | QA Contact: | Tejas <tchandra> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 2 | CC: | adeza, aschoen, bengland, ceph-eng-bugs, ddharwar, gmeno, hnallurv, icolle, jharriga, kdreyer, nthomas, sankarshan, seb, shan, tchandra | ||||
Target Milestone: | --- | ||||||
Target Release: | 2 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | ceph-ansible-2.1.9-1.el7scon | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-06-19 13:15:40 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
John Harrigan
2016-10-14 09:17:24 UTC
The approach to run a seperate instance of "sgdisk --zap-all" is described here: https://bugs.launchpad.net/ubuntu/+source/gdisk/+bug/1303903 See comment #4 I'm confused, the error below doesn't purge anything but tries to create a partition, where is the actual error of purge-cluster while trying to purge the nvme device? Thanks! I have attached the output from purge-cluster.yml And also added Ben England to the CC since he helped me debug this. Created attachment 1212132 [details]
purge-cluster behavior
This looks like ceph-disk should also get this ticket. ceph-ansible already does a "double-down" on calling zap in other places: https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-osd/tasks/check_devices_static.yml#L18-L24 I wanted to add that this cluster was failing deployment due to firewalld settings, getting stuck at the 'activate OSD devices' task. Once firewalld svc was stopped ceph-ansible successfully deployed. I don't believe my errors in the 'prepare OSD devices' task cited in this BZ were due to firewall mis-settings but I wanted to add the info. Can you try with the last version of purge-cluster.yml? https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/purge-cluster.yml Thanks! I downloaded the new version of *only* that file and ran it. Errors resulted. Note this is on an existing RHCS 2.0 cluster ------------------------- # ansible-playbook purge-cluster.yml Are you sure you want to purge the cluster? [no]: yes PLAY [confirm whether user really meant to purge the cluster] ***************** TASK: [exit playbook, if user did not mean to purge cluster] ****************** skipping: [localhost] PLAY [gather facts and check if using systemd] ******************************** GATHERING FACTS *************************************************************** ok: [gprfc092.sbu.lab.eng.bos.redhat.com] ok: [gprfs044.sbu.lab.eng.bos.redhat.com] ok: [gprfs042.sbu.lab.eng.bos.redhat.com] ok: [gprfs041.sbu.lab.eng.bos.redhat.com] TASK: [are we using systemd] ************************************************** changed: [gprfs042.sbu.lab.eng.bos.redhat.com] changed: [gprfs041.sbu.lab.eng.bos.redhat.com] changed: [gprfc092.sbu.lab.eng.bos.redhat.com] changed: [gprfs044.sbu.lab.eng.bos.redhat.com] PLAY [purge ceph mds cluster] ************************************************* skipping: no hosts matched PLAY [purge ceph rgw cluster] ************************************************* skipping: no hosts matched PLAY [purge ceph rbd-mirror cluster] ****************************************** skipping: no hosts matched PLAY [purge ceph nfs cluster] ************************************************* skipping: no hosts matched PLAY [purge ceph osd cluster] ************************************************* TASK: [include_vars ../roles/ceph-common/defaults/main.yml] ******************* failed: [gprfs041.sbu.lab.eng.bos.redhat.com] => {"failed": true, "file": "/usr/share/roles/ceph-common/defaults/main.yml"} msg: Source file not found. failed: [gprfs042.sbu.lab.eng.bos.redhat.com] => {"failed": true, "file": "/usr/share/roles/ceph-common/defaults/main.yml"} msg: Source file not found. failed: [gprfs044.sbu.lab.eng.bos.redhat.com] => {"failed": true, "file": "/usr/share/roles/ceph-common/defaults/main.yml"} msg: Source file not found. FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/purge-cluster.retry gprfc092.sbu.lab.eng.bos.redhat.com : ok=2 changed=1 unreachable=0 failed=0 gprfs041.sbu.lab.eng.bos.redhat.com : ok=2 changed=1 unreachable=0 failed=1 gprfs042.sbu.lab.eng.bos.redhat.com : ok=2 changed=1 unreachable=0 failed=1 gprfs044.sbu.lab.eng.bos.redhat.com : ok=2 changed=1 unreachable=0 failed=1 localhost : ok=0 changed=0 unreachable=0 failed=0 ----------------------------------- I then reverted to the original version of 'purge-cluster.yml' and got a clean run, including 'zapping' tasks. At this point I need to take the cluster to latest version of RHCS 2.1 so I am re-installing. We think this is fixed in the latest builds currently undergoing testing (ceph-ansible-2.1.9-1.el7scon as of this writing.) Would you please retest with these? Hello Ken, I am out of the office early next week. Expect I can take a look at this end of next week. - John I will not be able to verify this in the nearterm. The condition that triggered the failure requires considerable setup time, namely starting with a RHCS 2.0 cluster, purging it and then installing RHCS 2.1. Based on other project priorities and limited hardware resources I cannot reproduce this now. sorry, John Hi Seb, This issue was seen only on NVMe disks. Since we dont have the hardware, any other way I can verify this, or atleast simulate this? Thanks, Tejas Hum you can try to symlink any given partition to /dev/nvme0n1. Something like ln -f /dev/sdb1 /dev/nvme0n1 Then add /dev/nvme0n1 to your device list. Hi Seb, Creating a link to a normal hard disk doesnt work. I tried creating the /dev/nvme0n1 link file, but ceph-ansible fails to read the partition table of this disk. Any other way to verify this BZ? Thanks, Tejas Can I see the error? Ok it seems that this little hack won't work then. It looks like we might to wait for a nvme drive... cc'ing Deepthi Dharwar. She has NVMe drives in her ceph-ansible BAGL configuration, she has used NVMe partitions both as SSD journal devices and as OSDs. cc'ing her to see if she has observed this problem.we have NVMe drives in the scale lab. From what I'm told, NVMe drives as SSD journals works fine in the scale lab - all the storage servers there have at least 1 NVME drive, I think. So we can try out ceph-ansible when we get a pike build there and see. Or we can just run ceph-ansible directly on these machines. Deepti, can you please check as per comment 23 and help us verify this bug? Thanks for jumping in Ben :) I have been using NVMe drives both as SSD journal devices and OSDs for my benchmark runs. I have in the past purged the cluster a few times but not seen this issue. At present I do not want to tear down my cluster as I am amidst runs. I will definitely keep you guys updated if I do so in the near future. Running RHEL 7.3 ceph-ansible-1.0.5-34.el7scon.noarch ceph version 10.2.3-2.el7cp (e3499ea386b9456f7e17417e091f0a1fefddb3f5) (In reply to John Harrigan from comment #14) > I will not be able to verify this in the nearterm. > The condition that triggered the failure requires considerable setup > time, namely starting with a RHCS 2.0 cluster, purging it and then > installing RHCS 2.1. Based on other project priorities and limited > hardware resources I cannot reproduce this now. > > sorry, > John Hi John, would it be possible for you to test this bz with latest versions of ceph-ansible(>= ceph-ansible-2.1.9-1.el7scon) and rhceph (10.2.7-x) Please let me know. Regards, Harish Actually I just finished installing RHCS 2.3 pre-release on a cluster which was previously running RHCS 2.2. I grabbed the bits from here: baseurl=http://download-node-02.eng.bos.redhat.com/rcm-guest/ceph-drops/auto/ceph-2-rhel-7-compose/latest-RHCEPH-2-RHEL-7/compose which installed: # yum list installed | grep ceph ceph-ansible.noarch 2.2.6-1.el7scon @RHSCON-2_3 ceph-common.x86_64 1:10.2.7-19.el7cp @RHCEPH-23-MON ceph-iscsi-ansible.noarch 1.5-4.el7scon installed libcephfs1.x86_64 1:10.2.7-19.el7cp @RHCEPH-23-MON python-cephfs.x86_64 1:10.2.7-19.el7cp @RHCEPH-23-MON The TASK: [ceph-osd | prepare osd disk(s)] completed with no issues. - John Thanks John. If issue is resolved, could you please move the defect to VERIFIED state? This should not be part of the release note. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1496 |