Bug 1498303

Summary: Safety net option to prevent clobbering of existing partitions by OSD creation
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Justin Bautista <jbautist>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: subhash <vpoliset>
Severity: high Docs Contact:
Priority: high    
Version: 2.4CC: adeza, agunn, anharris, aschoen, ceph-eng-bugs, gmeno, jbautist, jbrier, kdreyer, linuxkidd, mhackett, nthomas, rperiyas, sankarshan, shan, vpoliset
Target Milestone: rc   
Target Release: 3.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.1.0-0.1.rc3.el7cp Ubuntu: ceph-ansible_3.1.0~rc3-2redhat1 Doc Type: Bug Fix
Doc Text:
.Ceph Ansible no longer overwrites existing OSD partitions On a OSD node reboot, it is possible that disk devices will get a different device path. For example, prior to restarting the OSD node, `/dev/sda` was an OSD, but after a reboot, the same OSD is now `/dev/sdb`. Previously, if no "ceph" partition was found on the disk, it was a valid OSD disk. With this release, if any partition is found on the disk, then the disk will not be used as an OSD.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-26 18:16:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1584264    
Attachments:
Description Flags
Ceph-Ansible Playbook log none

Description Justin Bautista 2017-10-03 22:50:13 UTC
Description of problem:

-Customer encountered issue where existing disk partitions were overwritten by ceph-ansible during OSD creation. After a reboot, disks were enumerated differently than expected & disks that were not previously used as OSDs were zapped/re-purposed as OSDs by mistake.

Technically ceph-ansible worked as expected because these devices were listed in osds.yml file, but I'm filing this request on their behalf in hopes a solution can be provided to avoid issues like this in the future.

Version-Release number of selected component (if applicable):
ansible 2.2.1.0

Comment 3 Sébastien Han 2017-10-04 11:48:28 UTC
I guess the best solution would be to pass devices with /dev/disk/by-path to avoid this kind of issues. /dev/disk/by-path is the best we can do now. It'll remain consistent unless the device is plugged into a different port of the controller.

To be honest, I'd like to close this with a doc fix.
Does that sound reasonable to you?

Thanks!

Comment 4 Michael J. Kidd 2017-10-16 21:09:19 UTC
I'm not sure how using the /dev/disk/by-path would help.  Documentation doesn't help this either if there's a difference in enumeration of the disks on boot-up vs what was used during last boot ( e.g. an unclean hotswap of a disk ).

Would it not be simple to:
* look for partition structure on the disk 
  - if /dev/sda is specified, check for /dev/sda[1-9] or similar
  - If present, and a new option ( force_osd_partition_overwrite or whatever ) is false, error for that disk?

Thoughts?

Comment 6 Sébastien Han 2018-04-13 14:41:22 UTC
Thanks for the heads-up, I just wrote a patch for this.
Sorry for the wait.

Comment 7 Ken Dreyer (Red Hat) 2018-04-16 22:54:46 UTC
Will be in v3.1.0beta7

Comment 8 Sébastien Han 2018-05-18 11:46:34 UTC
Present in v3.1.0rc3.

Comment 10 subhash 2018-06-29 09:48:45 UTC
@leseb ,@Justin Bautista

can you help with the steps needed to verify this bz ?

what i have tried is ..created a partition on one of the disk (which is supposed to be osd) before running the playbook(site.yml)

1.created a partition on sdb disk(of magna028) before running playbook

2.ran the playbook site.yml,the playbook ran with 1 TASK FAILURE (cluster gets deployed fine without the sdb disk~osd of magna028)
* inventory file
[mons]
magna021

[osds]
magna028 dedicated_devices="['/dev/sdd','/dev/sdd']" devices="['/dev/sdb','/dev/sdc']" osd_scenario="non-collocated"
magna031 dedicated_devices="['/dev/sdd','/dev/sdd']" devices="['/dev/sdb','/dev/sdc']" osd_scenario="non-collocated"
magna030 devices="['/dev/sdb','/dev/sdc','/dev/sdd']" osd_scenario="collocated" dmcrypt="true"

[mgrs]
magna021

FAILED TASK:

*****
TASK [ceph-osd : activate osd(s) when device is a disk] ****************************************************************************************************************
task path: /usr/share/ceph-ansible/roles/ceph-osd/tasks/activate_osds.yml:5
Thursday 28 June 2018  22:59:34 +0000 (0:00:00.281)       0:13:29.920 ********* 
skipping: [magna030] => (item=/dev/sdb)  => {
    "changed": false, 
    "item": "/dev/sdb", 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}
skipping: [magna030] => (item=/dev/sdc)  => {
    "changed": false, 
    "item": "/dev/sdc", 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}
skipping: [magna030] => (item=/dev/sdd)  => {
    "changed": false, 
    "item": "/dev/sdd", 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}
Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
<magna028> ESTABLISH SSH CONNECTION FOR USER: None
<magna028> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=30 -o ControlPath=/home/ubuntu/.ansible/cp/%h-%r-%p magna028 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-yfdrrgjfqnrijotnaqzuejotgvwioegb; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
<magna031> ESTABLISH SSH CONNECTION FOR USER: None
<magna031> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=30 -o ControlPath=/home/ubuntu/.ansible/cp/%h-%r-%p magna031 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-entgqwhgobfekncajjsncdudckkqspya; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<magna028> (1, '\n{"changed": true, "end": "2018-06-28 22:59:43.180496", "stdout": "", "cmd": ["ceph-disk", "activate", "/dev/sdb1"], "failed": true, "delta": "0:00:07.908368", "stderr": "ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command \'/sbin/blkid\' returned non-zero exit status 2", "rc": 1, "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "ceph-disk activate \\"/dev/sdb1\\"", "removes": null, "creates": null, "chdir": null, "stdin": null}}, "start": "2018-06-28 22:59:35.272128", "msg": "non-zero return code"}\n', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017\r\ndebug1: Reading configuration data /home/ubuntu/.ssh/config\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 8: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 4787\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 1\r\n')
failed: [magna028] (item=/dev/sdb) => {
    "changed": false, 
    "cmd": [
        "ceph-disk", 
        "activate", 
        "/dev/sdb1"
    ], 
    "delta": "0:00:07.908368", 
    "end": "2018-06-28 22:59:43.180496", 
    "failed": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "ceph-disk activate \"/dev/sdb1\"", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "item": "/dev/sdb", 
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2018-06-28 22:59:35.272128", 
    "stderr": "ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command '/sbin/blkid' returned non-zero exit status 2", 
    "stderr_lines": [
        "ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command '/sbin/blkid' returned non-zero exit status 2"
    ], 
    "stdout": "", 
    "stdout_lines": []
}


Verified with version: 
ceph-ansible-3.1.0-0.1.rc9.el7cp.noarch 
ansible-2.4.5.0-1.el7ae.noarch

Comment 11 subhash 2018-06-29 09:50:11 UTC
Created attachment 1455486 [details]
Ceph-Ansible Playbook log

Comment 12 Sébastien Han 2018-06-29 10:00:24 UTC
Why would you create a partition manually?

The goal to verify this is to run the first deployment, then run the playbook again. The playbook should not prepare the devices on the second run.

This BZ is hard to test, because the bug appeared on devices where their name changed. So we just added a safety net to prevent this. The best way to verify is to run twice normally and you should not see any errors.


The error you see is normal, the playbook tries to activate the OSD but there is nothing since you created a partition manually and didn't prepare the OSD.

Comment 13 subhash 2018-06-29 11:31:17 UTC
Verified with version: 
ceph-ansible-3.1.0-0.1.rc9.el7cp.noarch 
ansible-2.4.5.0-1.el7ae.noarch

Ran ansible-playbook site.yml twice and it passed without effecting the cluster state.Moving to verified.

Comment 15 John Brier 2018-08-31 16:45:15 UTC
Thanks Sebastien!

Comment 17 errata-xmlrpc 2018-09-26 18:16:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819