Bug 1498303
Summary: | Safety net option to prevent clobbering of existing partitions by OSD creation | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Justin Bautista <jbautist> | ||||
Component: | Ceph-Ansible | Assignee: | Sébastien Han <shan> | ||||
Status: | CLOSED ERRATA | QA Contact: | subhash <vpoliset> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 2.4 | CC: | adeza, agunn, anharris, aschoen, ceph-eng-bugs, gmeno, jbautist, jbrier, kdreyer, linuxkidd, mhackett, nthomas, rperiyas, sankarshan, shan, vpoliset | ||||
Target Milestone: | rc | ||||||
Target Release: | 3.1 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RHEL: ceph-ansible-3.1.0-0.1.rc3.el7cp Ubuntu: ceph-ansible_3.1.0~rc3-2redhat1 | Doc Type: | Bug Fix | ||||
Doc Text: |
.Ceph Ansible no longer overwrites existing OSD partitions
On a OSD node reboot, it is possible that disk devices will get a different device path. For example, prior to restarting the OSD node, `/dev/sda` was an OSD, but after a reboot, the same OSD is now `/dev/sdb`. Previously, if no "ceph" partition was found on the disk, it was a valid OSD disk. With this release, if any partition is found on the disk, then the disk will not be used as an OSD.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-09-26 18:16:44 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1584264 | ||||||
Attachments: |
|
Description
Justin Bautista
2017-10-03 22:50:13 UTC
I guess the best solution would be to pass devices with /dev/disk/by-path to avoid this kind of issues. /dev/disk/by-path is the best we can do now. It'll remain consistent unless the device is plugged into a different port of the controller. To be honest, I'd like to close this with a doc fix. Does that sound reasonable to you? Thanks! I'm not sure how using the /dev/disk/by-path would help. Documentation doesn't help this either if there's a difference in enumeration of the disks on boot-up vs what was used during last boot ( e.g. an unclean hotswap of a disk ). Would it not be simple to: * look for partition structure on the disk - if /dev/sda is specified, check for /dev/sda[1-9] or similar - If present, and a new option ( force_osd_partition_overwrite or whatever ) is false, error for that disk? Thoughts? Thanks for the heads-up, I just wrote a patch for this. Sorry for the wait. Will be in v3.1.0beta7 Present in v3.1.0rc3. @leseb ,@Justin Bautista can you help with the steps needed to verify this bz ? what i have tried is ..created a partition on one of the disk (which is supposed to be osd) before running the playbook(site.yml) 1.created a partition on sdb disk(of magna028) before running playbook 2.ran the playbook site.yml,the playbook ran with 1 TASK FAILURE (cluster gets deployed fine without the sdb disk~osd of magna028) * inventory file [mons] magna021 [osds] magna028 dedicated_devices="['/dev/sdd','/dev/sdd']" devices="['/dev/sdb','/dev/sdc']" osd_scenario="non-collocated" magna031 dedicated_devices="['/dev/sdd','/dev/sdd']" devices="['/dev/sdb','/dev/sdc']" osd_scenario="non-collocated" magna030 devices="['/dev/sdb','/dev/sdc','/dev/sdd']" osd_scenario="collocated" dmcrypt="true" [mgrs] magna021 FAILED TASK: ***** TASK [ceph-osd : activate osd(s) when device is a disk] **************************************************************************************************************** task path: /usr/share/ceph-ansible/roles/ceph-osd/tasks/activate_osds.yml:5 Thursday 28 June 2018 22:59:34 +0000 (0:00:00.281) 0:13:29.920 ********* skipping: [magna030] => (item=/dev/sdb) => { "changed": false, "item": "/dev/sdb", "skip_reason": "Conditional result was False", "skipped": true } skipping: [magna030] => (item=/dev/sdc) => { "changed": false, "item": "/dev/sdc", "skip_reason": "Conditional result was False", "skipped": true } skipping: [magna030] => (item=/dev/sdd) => { "changed": false, "item": "/dev/sdd", "skip_reason": "Conditional result was False", "skipped": true } Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py <magna028> ESTABLISH SSH CONNECTION FOR USER: None <magna028> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=30 -o ControlPath=/home/ubuntu/.ansible/cp/%h-%r-%p magna028 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-yfdrrgjfqnrijotnaqzuejotgvwioegb; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"'' Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py <magna031> ESTABLISH SSH CONNECTION FOR USER: None <magna031> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=30 -o ControlPath=/home/ubuntu/.ansible/cp/%h-%r-%p magna031 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-entgqwhgobfekncajjsncdudckkqspya; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"'' <magna028> (1, '\n{"changed": true, "end": "2018-06-28 22:59:43.180496", "stdout": "", "cmd": ["ceph-disk", "activate", "/dev/sdb1"], "failed": true, "delta": "0:00:07.908368", "stderr": "ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command \'/sbin/blkid\' returned non-zero exit status 2", "rc": 1, "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "ceph-disk activate \\"/dev/sdb1\\"", "removes": null, "creates": null, "chdir": null, "stdin": null}}, "start": "2018-06-28 22:59:35.272128", "msg": "non-zero return code"}\n', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017\r\ndebug1: Reading configuration data /home/ubuntu/.ssh/config\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 8: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 4787\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 1\r\n') failed: [magna028] (item=/dev/sdb) => { "changed": false, "cmd": [ "ceph-disk", "activate", "/dev/sdb1" ], "delta": "0:00:07.908368", "end": "2018-06-28 22:59:43.180496", "failed": true, "invocation": { "module_args": { "_raw_params": "ceph-disk activate \"/dev/sdb1\"", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "item": "/dev/sdb", "msg": "non-zero return code", "rc": 1, "start": "2018-06-28 22:59:35.272128", "stderr": "ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command '/sbin/blkid' returned non-zero exit status 2", "stderr_lines": [ "ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command '/sbin/blkid' returned non-zero exit status 2" ], "stdout": "", "stdout_lines": [] } Verified with version: ceph-ansible-3.1.0-0.1.rc9.el7cp.noarch ansible-2.4.5.0-1.el7ae.noarch Created attachment 1455486 [details]
Ceph-Ansible Playbook log
Why would you create a partition manually? The goal to verify this is to run the first deployment, then run the playbook again. The playbook should not prepare the devices on the second run. This BZ is hard to test, because the bug appeared on devices where their name changed. So we just added a safety net to prevent this. The best way to verify is to run twice normally and you should not see any errors. The error you see is normal, the playbook tries to activate the OSD but there is nothing since you created a partition manually and didn't prepare the OSD. Verified with version: ceph-ansible-3.1.0-0.1.rc9.el7cp.noarch ansible-2.4.5.0-1.el7ae.noarch Ran ansible-playbook site.yml twice and it passed without effecting the cluster state.Moving to verified. Thanks Sebastien! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2819 |