Bug 1389484

Summary:

Ceph OSD disks failed to auto-mount after reboot

Product:

[Red Hat Storage] Red Hat Ceph Storage

Reporter:

Hemanth Kumar <hyelloji>

Component:

RBD

Assignee:

Mike Christie <mchristi>

Status:

CLOSED WONTFIX

QA Contact:

ceph-qe-bugs <ceph-qe-bugs>

Severity:

medium

Docs Contact:

Bara Ancincova <bancinco>

Priority:

unspecified

Version:

2.1

CC:

ceph-eng-bugs, hnallurv, jdillama, kdreyer, mchristi, pcuzner

Target Milestone:

Target Release:

2.2

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Known Issue

Doc Text:

.Ceph OSD daemons fail to initialize and DM-Multipath disks are not automatically mounted on iSCSI nodes The `ceph-iscsi-gw.yml` Ansible playbook enables device mapper multipathing (DM-Multipath) and disables the `kpartx` utility. This behavior causes the multipath layer to claim a device before Ceph disables automatic partition setup for other system disks that use DM-Multipath. Consequently, after a reboot, Ceph OSD daemons fail to initialize, and system disks that use DM-Multipath with partitions are not automatically mounted. Because of that the system can fail to boot. To work around this problem: . After executing the `ceph-iscsi-gw.yml`, log into each node that runs an iSCSI target and display the current multipath configuration: + ---- $ multipath -ll ---- . If you see any devices that you did not intend to be used by DM-Multipath, for example OSD disks, remove them from the DM-Multipath configuration. .. Remove the devices World Wide Identifiers (WWIDs) from the WWIDs file: + ---- $ multipath -w <device_name> ---- .. Flush the devices multipath device maps: + ---- $ multipath -f device_name ---- . Edit the `/etc/multipath.conf` file on each node that runs an iSCSI target. .. Comment out the `skip-partx` variable. .. Set the `user_friendly_names` variable to `yes`: + ---- defaults { user_friendly_names yes find_multipaths no } ---- .. Blacklist all devices: + ---- blacklist { devnode ".*" } ---- .. DM-Multipath is used with Ceph Block Devices, therefore you must add an exception for them. Edit `^rbd[0-9]` as needed: + ---- blacklist_exceptions { devnode "^rbd[0-9]" } ---- + .. Add the following entry for the Ceph Block Devices: + ---- devices { device { vendor "Ceph" product "RBD" skip_kpartx yes user_friendly_names no } } ---- . Reboot the nodes. The OSD and iSCSI gateway services will initialize automatically after the reboot.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-01-04 21:11:25 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1383917

Attachments:

Description	Flags
Failover on N/W Failure	none

Description Hemanth Kumar 2016-10-27 17:26:30 UTC

Description of problem:
-----------------------
Had configured iSCSI Gateway on one of the OSD nodes - basically a co-located iscsi configuration. After a planned reboot of the OSD node, the OSD Disks are not getting auto-mounted.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
V10.2.3-10

How reproducible:
-----------------
always

Steps to Reproduce:
1. Configure iSCSI on a 3:1 gateway setup (3 Co-located and 1 dedicated)
2. Mount the Created iSCSI Luns on KVM Host and create few VM's 
3. While performing the multipath tests - by failing the path(gw node) - I had to reboot one of the co-located OSD node and the OSD's failed to come back after reboot
 
Actual results:
---------------
OSD's are not coming up after reboot

Expected results:
-----------------
OSD's should be auto-mounted after reboot

Additional info:
-----------------

2016-10-27 15:45:51.248911 7fb4b5e31700 10 -- 10.8.128.5:6801/27438 reaper deleted pipe 0x7fb4cbd27400
2016-10-27 15:45:51.248915 7fb4b5e31700 10 -- 10.8.128.5:6801/27438 reaper done
2016-10-27 15:45:51.248918 7fb4b5e31700 10 -- 10.8.128.5:6801/27438 reaper_entry done
2016-10-27 15:45:51.248959 7fb4bf8c5800 20 -- 10.8.128.5:6801/27438 wait: stopped reaper thread
2016-10-27 15:45:51.248972 7fb4bf8c5800 10 -- 10.8.128.5:6801/27438 wait: closing pipes
2016-10-27 15:45:51.248977 7fb4bf8c5800 10 -- 10.8.128.5:6801/27438 reaper
2016-10-27 15:45:51.248983 7fb4bf8c5800 10 -- 10.8.128.5:6801/27438 reaper done
2016-10-27 15:45:51.248987 7fb4bf8c5800 10 -- 10.8.128.5:6801/27438 wait: waiting for pipes  to close
2016-10-27 15:45:51.248992 7fb4bf8c5800 10 -- 10.8.128.5:6801/27438 wait: done.
2016-10-27 15:45:51.248999 7fb4bf8c5800  1 -- 10.8.128.5:6801/27438 shutdown complete.
2016-10-27 15:45:51.249004 7fb4bf8c5800 10 -- :/27438 wait: waiting for dispatch queue
2016-10-27 15:45:51.249037 7fb4bf8c5800 10 -- :/27438 wait: dispatch queue is stopped
2016-10-27 15:45:51.249046 7fb4bf8c5800 20 -- :/27438 wait: stopping reaper thread
2016-10-27 15:45:51.249062 7fb4b5630700 10 -- :/27438 reaper_entry done
2016-10-27 15:45:51.249131 7fb4bf8c5800 20 -- :/27438 wait: stopped reaper thread
2016-10-27 15:45:51.249140 7fb4bf8c5800 10 -- :/27438 wait: closing pipes
2016-10-27 15:45:51.249142 7fb4bf8c5800 10 -- :/27438 reaper
2016-10-27 15:45:51.249143 7fb4bf8c5800 10 -- :/27438 reaper done
2016-10-27 15:45:51.249144 7fb4bf8c5800 10 -- :/27438 wait: waiting for pipes  to close
2016-10-27 15:45:51.249146 7fb4bf8c5800 10 -- :/27438 wait: done.
2016-10-27 15:45:51.249147 7fb4bf8c5800  1 -- :/27438 shutdown complete.
2016-10-27 15:52:38.373193 7f5ac1d70800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-27 15:52:38.373213 7f5ac1d70800  0 ceph version 10.2.3-10.el7cp (1829b6c4f0010d6aba2cd51cc6c23f74c2e189e0), process ceph-osd, pid 2539
2016-10-27 15:52:38.373478 7f5ac1d70800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1: (2) No such file or directory^[[0m
2016-10-27 15:52:38.762645 7f6d4368c800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-27 15:52:38.762665 7f6d4368c800  0 ceph version 10.2.3-10.el7cp (1829b6c4f0010d6aba2cd51cc6c23f74c2e189e0), process ceph-osd, pid 2731
2016-10-27 15:52:38.762824 7f6d4368c800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1: (2) No such file or directory^[[0m
2016-10-27 15:52:39.260364 7fc77af05800  0 set uid:gid to 167:167 (ceph:ceph)
2016-10-27 15:52:39.260381 7fc77af05800  0 ceph version 10.2.3-10.el7cp (1829b6c4f0010d6aba2cd51cc6c23f74c2e189e0), process ceph-osd, pid 2904
2016-10-27 15:52:39.260575 7fc77af05800 -1 ^[[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1: (2) No such file or directory^[[0m


------------------------------------------------
mount output doesn't show the disks, Tried mounting manually and the disk seems to be busy and also says it's already mounted which is not..

[root@XXX005 ceph-1]# mount /dev/sdb /var/lib/ceph/osd/ceph-1/
mount: /dev/sdb is already mounted or /var/lib/ceph/osd/ceph-1 busy

[root@XXX005 ceph-1]# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=16363800k,nr_inodes=4090950,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
configfs on /sys/kernel/config type configfs (rw,relatime)
/dev/sda1 on / type ext4 (rw,relatime,seclabel,data=ordered)
selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=32,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
mqueue on /dev/mqueue type mqueue (rw,relatime,seclabel)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
nfsd on /proc/fs/nfsd type nfsd (rw,relatime)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,seclabel,size=3274680k,mode=700)

Comment 3 Jason Dillaman 2016-10-27 18:06:53 UTC

multipathd claimed the sdb, sdc, and sdd devices and prevented them from being used directly:

# multipath -ll
Hitachi_HUA722010CLA330_JPW9M0N20D247E dm-1 ATA     ,Hitachi HUA72201
size=932G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 1:0:0:0 sdb 8:16 active ready running
Hitachi_HUA722010CLA330_JPW9J0N20BMZHC dm-2 ATA     ,Hitachi HUA72201
size=932G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 3:0:0:0 sdd 8:48 active ready running
Hitachi_HUA722010CLA330_JPW9M0N20D268E dm-0 ATA     ,Hitachi HUA72201
size=932G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 2:0:0:0 sdc 8:32 active ready running

Either these disks needs to be blacklisted from multipath or the systemctl ceph-disk need to use the multipath device.

Comment 9 Mike Christie 2016-10-28 02:39:09 UTC

To be completely safe with users preferences in case they are using dm-multipath for some system or OSD disks, we probably want find_multipaths = yes. I guess users seem to prefer this and for RHEL we override the upstream default and set that to yes. The ceph iscsi tools were setting it back to no because you cannot set it at the per device level like other settings.

In the ceph iscsi config modules we will want to add code (I made a patch for this) to run /sbin/multipath device_name for the specific rbd images, so it will not matter what the user has set for find_multipaths.

Assuming we cannot make any code changes, then here are the manual instructions that we could add to the ceph iscsi ansible doc to work around the bug here and they will work ok for when we can release a code fix later:

1. After ansible-playbook ceph-iscsi-gw.yml is run log into each node running a iSCSI target and run

multipath -ll

if there are disks that the user did not intend to be used by dm-multipath, for example disks being used by OSDs, run

multipath -w device_name
multipath -f device_name

example:

multipath -w mpatha
multipath -f mpatha

2. Open /etc/multiapth.conf on each node running a iSCSI target and in the defaults section remove the global skip_partx and change the global user_friendly_names value to yes:

defaults {
user_friendly_names yes
find_multipaths no
}

2. By default, the ansible iscsi modules unblacklisted everything. Unless, you are using dm-multipath for specific devices you can blacklist everything again by adding

devnode ".*"

to the uncommenented out blacklist {} section at the bottom of the file so it looks like this:

blacklist {
devnode ".*"
}

3. We do want dm-multipath for rbd devices, so add an exception for it by adding the following to the multipath.conf:

blacklist_exceptions {
devnode "^rbd[0-9]"
}

4. For rbd devices add the following to multipath.conf:

devices {
device {
vendor "Ceph"
product "RBD"
skip_kpartx yes
user_friendly_names no
}
}

5. Reload the new settings:

systemctl reload multipathd

Hemanth, if you want I can run those commands on your system for you.

Comment 10 Hemanth Kumar 2016-10-28 18:58:41 UTC

(In reply to Mike Christie from comment #9)
> To be completely safe with users preferences in case they are using
> dm-multipath for some system or OSD disks, we probably want find_multipaths
> = yes. I guess users seem to prefer this and for RHEL we override the
> upstream default and set that to yes. The ceph iscsi tools were setting it
> back to no because you cannot set it at the per device level like other
> settings.
> 
> In the ceph iscsi config modules we will want to add code (I made a patch
> for this) to run /sbin/multipath device_name for the specific rbd images, so
> it will not matter what the user has set for find_multipaths.
> 
> 
> Assuming we cannot make any code changes, then here are the manual
> instructions that we could add to the ceph iscsi ansible doc to work around
> the bug here and they will work ok for when we can release a code fix later:
> 
> 
> 1. After ansible-playbook ceph-iscsi-gw.yml is run log into each node
> running a iSCSI target and run
> 
> multipath -ll
> 
> if there are disks that the user did not intend to be used by dm-multipath,
> for example disks being used by OSDs, run
> 
> multipath -w device_name
> multipath -f device_name
> 
> example:
> 
> multipath -w mpatha
> multipath -f mpatha
> 
> 2. Open /etc/multiapth.conf on each node running a iSCSI target and in the
> defaults section remove the global skip_partx and change the global
> user_friendly_names value to yes:
> 
> defaults {
>         user_friendly_names yes
>         find_multipaths no
> }
> 
> 
> 2. By default, the ansible iscsi modules unblacklisted everything. Unless,
> you are using dm-multipath for specific devices you can blacklist everything
> again by adding
> 
> devnode ".*"
> 
> to the uncommenented out blacklist {} section at the bottom of the file so
> it looks like this:
> 
> blacklist {
>         devnode ".*"
> }
> 
> 3. We do want dm-multipath for rbd devices, so add an exception for it by
> adding the following to the multipath.conf:
> 
> blacklist_exceptions {
>         devnode "^rbd[0-9]"
> }
> 
> 4.  For rbd devices add the following to multipath.conf:
> 
> devices {
>         device {
>                 vendor  "Ceph"
>                 product "RBD"
>                 skip_kpartx yes
>                 user_friendly_names no
>         }
> }
> 
> 5. Reload the new settings:
> 
> systemctl reload multipathd
> 
> 
> 
> Hemanth, if you want I can run those commands on your system for you.

Mike, The machines are no more in that state to run the commands..

If this are the steps we are documenting as a workaround for the customers  then I am okay in running them. If not I will wait for the fix.

Comment 12 Hemanth Kumar 2016-11-03 07:35:22 UTC

As the BZ is moved to 2.2, Can comment #9 be documented in Known Issues of 2.1??

Comment 13 Mike Christie 2016-11-03 18:05:59 UTC

(In reply to Hemanth Kumar from comment #12)
> As the BZ is moved to 2.2, Can comment #9 be documented in Known Issues of
> 2.1??

Yes. I am working on it.

For the most part the instructions are what you will do.

I am just trying to test and add more info about how to handle if the user is using dm-multipath for root or other disks on the OSD/gw machine.

Comment 14 Hemanth Kumar 2016-11-10 21:19:03 UTC

Created attachment 1219550 [details]
Failover on N/W Failure

Hi Paul,

Failed the primary GW Node's N/W and the Failover happened within 15 secs..

Refer the attachment for the Performance monitor stats on Windows..


Will update the same after reboot..

Comment 15 Hemanth Kumar 2016-11-10 21:29:18 UTC

(In reply to Hemanth Kumar from comment #14)
> Created attachment 1219550 [details]
> Failover on N/W Failure
> 
> Hi Paul,
> 
> Failed the primary GW Node's N/W and the Failover happened within 15 secs..
> 
> Refer the attachment for the Performance monitor stats on Windows..
> 
> 
> Will update the same after reboot..

Ignore Comment#14

Updated in the wrong BZ..

Comment 19 Mike Christie 2016-11-22 19:46:05 UTC

Looks good. Thanks Bara.

Comment 20 Jason Dillaman 2017-01-04 21:11:25 UTC

This issue only affects RHCS 2.1 since RHCS 2.2 will utilize a new approach for incorporating RBD-backed iSCSI.