Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1498303 - Safety net option to prevent clobbering of existing partitions by OSD creation
Safety net option to prevent clobbering of existing partitions by OSD creation
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Ansible (Show other bugs)
2.4
x86_64 Linux
high Severity high
: rc
: 3.1
Assigned To: leseb
subhash
:
Depends On:
Blocks: 1584264
  Show dependency treegraph
 
Reported: 2017-10-03 18:50 EDT by Justin Bautista
Modified: 2018-09-26 14:18 EDT (History)
16 users (show)

See Also:
Fixed In Version: RHEL: ceph-ansible-3.1.0-0.1.rc3.el7cp Ubuntu: ceph-ansible_3.1.0~rc3-2redhat1
Doc Type: Bug Fix
Doc Text:
.Ceph Ansible no longer overwrites existing OSD partitions On a OSD node reboot, it is possible that disk devices will get a different device path. For example, prior to restarting the OSD node, `/dev/sda` was an OSD, but after a reboot, the same OSD is now `/dev/sdb`. Previously, if no "ceph" partition was found on the disk, it was a valid OSD disk. With this release, if any partition is found on the disk, then the disk will not be used as an OSD.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-09-26 14:16:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Ceph-Ansible Playbook log (1.09 MB, text/plain)
2018-06-29 05:50 EDT, subhash
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Github ceph/ceph-ansible/pull/2523 None None None 2018-04-13 10:41 EDT
Red Hat Product Errata RHBA-2018:2819 None None None 2018-09-26 14:18 EDT

  None (edit)
Description Justin Bautista 2017-10-03 18:50:13 EDT
Description of problem:

-Customer encountered issue where existing disk partitions were overwritten by ceph-ansible during OSD creation. After a reboot, disks were enumerated differently than expected & disks that were not previously used as OSDs were zapped/re-purposed as OSDs by mistake.

Technically ceph-ansible worked as expected because these devices were listed in osds.yml file, but I'm filing this request on their behalf in hopes a solution can be provided to avoid issues like this in the future.

Version-Release number of selected component (if applicable):
ansible 2.2.1.0
Comment 3 leseb 2017-10-04 07:48:28 EDT
I guess the best solution would be to pass devices with /dev/disk/by-path to avoid this kind of issues. /dev/disk/by-path is the best we can do now. It'll remain consistent unless the device is plugged into a different port of the controller.

To be honest, I'd like to close this with a doc fix.
Does that sound reasonable to you?

Thanks!
Comment 4 Michael J. Kidd 2017-10-16 17:09:19 EDT
I'm not sure how using the /dev/disk/by-path would help.  Documentation doesn't help this either if there's a difference in enumeration of the disks on boot-up vs what was used during last boot ( e.g. an unclean hotswap of a disk ).

Would it not be simple to:
* look for partition structure on the disk 
  - if /dev/sda is specified, check for /dev/sda[1-9] or similar
  - If present, and a new option ( force_osd_partition_overwrite or whatever ) is false, error for that disk?

Thoughts?
Comment 6 leseb 2018-04-13 10:41:22 EDT
Thanks for the heads-up, I just wrote a patch for this.
Sorry for the wait.
Comment 7 Ken Dreyer (Red Hat) 2018-04-16 18:54:46 EDT
Will be in v3.1.0beta7
Comment 8 leseb 2018-05-18 07:46:34 EDT
Present in v3.1.0rc3.
Comment 10 subhash 2018-06-29 05:48:45 EDT
@leseb ,@Justin Bautista

can you help with the steps needed to verify this bz ?

what i have tried is ..created a partition on one of the disk (which is supposed to be osd) before running the playbook(site.yml)

1.created a partition on sdb disk(of magna028) before running playbook

2.ran the playbook site.yml,the playbook ran with 1 TASK FAILURE (cluster gets deployed fine without the sdb disk~osd of magna028)
* inventory file
[mons]
magna021

[osds]
magna028 dedicated_devices="['/dev/sdd','/dev/sdd']" devices="['/dev/sdb','/dev/sdc']" osd_scenario="non-collocated"
magna031 dedicated_devices="['/dev/sdd','/dev/sdd']" devices="['/dev/sdb','/dev/sdc']" osd_scenario="non-collocated"
magna030 devices="['/dev/sdb','/dev/sdc','/dev/sdd']" osd_scenario="collocated" dmcrypt="true"

[mgrs]
magna021

FAILED TASK:

*****
TASK [ceph-osd : activate osd(s) when device is a disk] ****************************************************************************************************************
task path: /usr/share/ceph-ansible/roles/ceph-osd/tasks/activate_osds.yml:5
Thursday 28 June 2018  22:59:34 +0000 (0:00:00.281)       0:13:29.920 ********* 
skipping: [magna030] => (item=/dev/sdb)  => {
    "changed": false, 
    "item": "/dev/sdb", 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}
skipping: [magna030] => (item=/dev/sdc)  => {
    "changed": false, 
    "item": "/dev/sdc", 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}
skipping: [magna030] => (item=/dev/sdd)  => {
    "changed": false, 
    "item": "/dev/sdd", 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}
Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
<magna028> ESTABLISH SSH CONNECTION FOR USER: None
<magna028> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=30 -o ControlPath=/home/ubuntu/.ansible/cp/%h-%r-%p magna028 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-yfdrrgjfqnrijotnaqzuejotgvwioegb; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
<magna031> ESTABLISH SSH CONNECTION FOR USER: None
<magna031> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=30 -o ControlPath=/home/ubuntu/.ansible/cp/%h-%r-%p magna031 '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-entgqwhgobfekncajjsncdudckkqspya; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<magna028> (1, '\n{"changed": true, "end": "2018-06-28 22:59:43.180496", "stdout": "", "cmd": ["ceph-disk", "activate", "/dev/sdb1"], "failed": true, "delta": "0:00:07.908368", "stderr": "ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command \'/sbin/blkid\' returned non-zero exit status 2", "rc": 1, "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "ceph-disk activate \\"/dev/sdb1\\"", "removes": null, "creates": null, "chdir": null, "stdin": null}}, "start": "2018-06-28 22:59:35.272128", "msg": "non-zero return code"}\n', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017\r\ndebug1: Reading configuration data /home/ubuntu/.ssh/config\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 8: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 4787\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 1\r\n')
failed: [magna028] (item=/dev/sdb) => {
    "changed": false, 
    "cmd": [
        "ceph-disk", 
        "activate", 
        "/dev/sdb1"
    ], 
    "delta": "0:00:07.908368", 
    "end": "2018-06-28 22:59:43.180496", 
    "failed": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "ceph-disk activate \"/dev/sdb1\"", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "item": "/dev/sdb", 
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2018-06-28 22:59:35.272128", 
    "stderr": "ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command '/sbin/blkid' returned non-zero exit status 2", 
    "stderr_lines": [
        "ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command '/sbin/blkid' returned non-zero exit status 2"
    ], 
    "stdout": "", 
    "stdout_lines": []
}


Verified with version: 
ceph-ansible-3.1.0-0.1.rc9.el7cp.noarch 
ansible-2.4.5.0-1.el7ae.noarch
Comment 11 subhash 2018-06-29 05:50 EDT
Created attachment 1455486 [details]
Ceph-Ansible Playbook log
Comment 12 leseb 2018-06-29 06:00:24 EDT
Why would you create a partition manually?

The goal to verify this is to run the first deployment, then run the playbook again. The playbook should not prepare the devices on the second run.

This BZ is hard to test, because the bug appeared on devices where their name changed. So we just added a safety net to prevent this. The best way to verify is to run twice normally and you should not see any errors.


The error you see is normal, the playbook tries to activate the OSD but there is nothing since you created a partition manually and didn't prepare the OSD.
Comment 13 subhash 2018-06-29 07:31:17 EDT
Verified with version: 
ceph-ansible-3.1.0-0.1.rc9.el7cp.noarch 
ansible-2.4.5.0-1.el7ae.noarch

Ran ansible-playbook site.yml twice and it passed without effecting the cluster state.Moving to verified.
Comment 15 John Brier 2018-08-31 12:45:15 EDT
Thanks Sebastien!
Comment 17 errata-xmlrpc 2018-09-26 14:16:44 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819

Note You need to log in before you can comment on or make changes to this bug.