Description of problem: I have these disks: $ lsblk -a -t NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME sda 0 512 126976 512 512 1 deadline 128 4096 32M └─sda1 0 512 126976 512 512 1 deadline 128 4096 32M sdb 0 512 126976 512 512 1 deadline 128 4096 32M vda 0 512 0 512 512 1 128 4096 0B ├─vda1 0 512 0 512 512 1 128 4096 0B └─vda2 0 512 0 512 512 1 128 4096 0B vdb 0 512 0 512 512 0 128 4096 0B └─vdb1 0 512 0 512 512 0 128 4096 0B vdc 0 512 0 512 512 1 128 4096 0B loop0 0 512 0 512 512 1 128 4096 0B └─bad_disk 0 512 0 512 512 1 128 4096 0B loop1 0 512 0 512 512 1 128 4096 0B └─bad_disk2 0 512 0 512 512 1 128 128 0B $ lsblk -a -t -S NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME NAME HCTL TYPE VENDOR MODEL REV TRAN sda 0 512 126976 512 512 1 deadline 128 4096 32M sda 9:0:0:0 disk LIO-ORG block1 4.0 iscsi sdb 0 512 126976 512 512 1 deadline 128 4096 32M sdb 10:0:0:0 disk LIO-ORG block2 4.0 iscsi $ lsblk -a NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 5G 0 disk └─sda1 8:1 0 5G 0 part /var/lib/ceph/osd/cluster1-0 sdb 8:16 0 5G 0 disk vda 253:0 0 20G 0 disk ├─vda1 253:1 0 2G 0 part └─vda2 253:2 0 18G 0 part / vdb 253:16 0 10G 0 disk └─vdb1 253:17 0 5G 0 part vdc 253:32 0 10G 0 disk loop0 7:0 0 5G 0 loop └─bad_disk 252:0 0 5G 0 dm loop1 7:1 0 5G 0 loop └─bad_disk2 252:1 0 5G 0 dm I've tried to list info about disks by ceph-disk, but it stucks, so I've killed it by CTRL-C. $ ceph-disk list ^CTraceback (most recent call last): File "/usr/sbin/ceph-disk", line 9, in <module> load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')() File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4994, in run main(sys.argv[1:]) File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4945, in main args.func(args) File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4157, in main_list activate_lock.acquire() # noqa File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 253, in acquire fcntl.lockf(self.fd, fcntl.LOCK_EX) KeyboardInterrupt I don't see any errors in logs. Version-Release number of selected component (if applicable): ceph-base-10.2.2-5.el7cp.x86_64 ceph-common-10.2.2-5.el7cp.x86_64 ceph-osd-10.2.2-5.el7cp.x86_64 ceph-selinux-10.2.2-5.el7cp.x86_64 libcephfs1-10.2.2-5.el7cp.x86_64 python-cephfs-10.2.2-5.el7cp.x86_64 [root@dhcp-126-35 log]# rpm -qa | egrep "ceph|python" | sort abrt-addon-python-2.1.11-35.el7.x86_64 abrt-python-2.1.11-35.el7.x86_64 audit-libs-python-2.4.1-5.el7.x86_64 ceph-base-10.2.2-5.el7cp.x86_64 ceph-common-10.2.2-5.el7cp.x86_64 ceph-osd-10.2.2-5.el7cp.x86_64 ceph-selinux-10.2.2-5.el7cp.x86_64 dbus-python-1.1.1-9.el7.x86_64 langtable-python-0.0.31-3.el7.noarch libcephfs1-10.2.2-5.el7cp.x86_64 libreport-python-2.1.11-31.el7.x86_64 libselinux-python-2.2.2-6.el7.x86_64 libsemanage-python-2.1.10-18.el7.x86_64 libstoragemgmt-python-1.2.3-4.el7.noarch libxml2-python-2.9.1-6.el7_2.3.x86_64 newt-python-0.52.15-4.el7.x86_64 policycoreutils-python-2.2.5-20.el7.x86_64 python-2.7.5-34.el7.x86_64 python-augeas-0.5.0-2.el7.noarch python-babel-0.9.6-8.el7.noarch python-backports-1.0-8.el7.x86_64 python-backports-ssl_match_hostname-3.4.0.2-4.el7.noarch python-cephfs-10.2.2-5.el7cp.x86_64 python-chardet-2.2.1-1.el7_1.noarch python-configobj-4.7.2-7.el7.noarch python-configshell-1.1.fb18-1.el7.noarch python-cpopen-1.4-0.1.el7.x86_64 python-crypto-2.6.1-1.1.el7.x86_64 python-dateutil-1.5-7.el7.noarch python-decorator-3.4.0-3.el7.noarch python-dmidecode-3.10.13-11.el7.x86_64 python-ethtool-0.8-5.el7.x86_64 python-gudev-147.2-7.el7.x86_64 python-hwdata-1.7.3-4.el7.noarch python-iniparse-0.4-9.el7.noarch python-IPy-0.75-6.el7.noarch python-jinja2-2.7.2-2.1.el7.noarch python-kitchen-1.1.1-5.el7.noarch python-kmod-0.9-4.el7.x86_64 python-libs-2.7.5-34.el7.x86_64 python-lxml-3.2.1-5.el7ost.x86_64 python-magic-5.11-31.el7.noarch python-markupsafe-0.11-10.el7.x86_64 python-msgpack-0.4.5-1.1.el7.x86_64 python-perf-3.10.0-327.22.2.el7.x86_64 python-pycurl-7.19.0-17.el7.x86_64 python-pyudev-0.15-7.el7_2.1.noarch python-rados-10.2.2-5.el7cp.x86_64 python-rbd-10.2.2-5.el7cp.x86_64 python-requests-2.7.0-1.el7ost.noarch python-rhsm-1.15.4-5.el7.x86_64 python-rtslib-2.1.fb57-3.el7.noarch python-setuptools-0.9.8-4.el7.noarch python-six-1.9.0-2.el7.noarch python-slip-0.4.0-2.el7.noarch python-slip-dbus-0.4.0-2.el7.noarch python-urlgrabber-3.10-7.el7.noarch python-urllib3-1.10.4-1.20150503gita91975b.el7ost.noarch python-urwid-1.1.1-3.el7.x86_64 python-zmq-14.3.1-1.1.el7.x86_64 redhat-support-lib-python-0.9.7-3.el7.noarch rpm-python-4.11.3-17.el7.x86_64 systemd-python-219-19.el7_2.11.x86_64 How reproducible: I think 100% Steps to Reproduce: 1. create some iscsi disks. 2. create osds and journals on those disks 3. try ceph-disk list command Actual results: ceph-disk stucks. Expected results: ceph-disk list timeouts and print error message in case of error.
This is because another process is holding the activate lock. Did you try to verify if a ceph-disk process is hung ?
There were stuck some ceph-disk commands so there were locks for devices. But anyway in case of lock existence ceph-disk should output error message and ends after reasonable timeout. I think this is correct behaviour for "ceph-disk list" for sure. It's on you if this can be correct behaviour also for another subcommands. I think command should not stuck or end without proper error message.
@mkudlej you're right. I created http://tracker.ceph.com/issues/16580 to fix this.
Would you please explain how to induce the backend disk failure? Does it have something to do with the iscsi environment?
Because Console works only with empty disks and i cannot use partitions, I've created this structure described in comment #1: $ lsblk -a NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 5G 0 disk └─sda1 8:1 0 5G 0 part /var/lib/ceph/osd/cluster1-0 sdb 8:16 0 5G 0 disk vda 253:0 0 20G 0 disk ├─vda1 253:1 0 2G 0 part └─vda2 253:2 0 18G 0 part / vdb 253:16 0 10G 0 disk └─vdb1 253:17 0 5G 0 part vdc 253:32 0 10G 0 disk loop0 7:0 0 5G 0 loop └─bad_disk 252:0 0 5G 0 dm loop1 7:1 0 5G 0 loop └─bad_disk2 252:1 0 5G 0 dm Where: loop0 is device created by losetup from file. bad_disk is device created by Device mapper which includes about in the middle error mapping target https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/device_mapper.html A.1.5 And then I've created scsi block device from bad_disk. Similar is loop1 -> bad_disk2 -> sdb I've created disks with bad sectors for testing purpose to simulate disk with bad sectors and I've hoped that I see proper error messages in Ceph and console. If you would like to look at it, please contact me. I will rebuild that cluster in about week.
timeout PR undergoing review upstream: https://github.com/ceph/ceph/pull/10262
Discussed this with Product Management (Neil) today. Since this is a negative test case where failures are intentionally induced into the loopback devices, we are going to re-target this to RHCS 2.1. (This will provide more time for a ceph-disk solution to stabilize upstream.)
Moving the bug to verified state. Verified in below ceph and RHEL kernel version [user@node ~]$ ceph -v ceph version 10.2.3-4.el7cp (852125d923e43802a51f681ca2ae9e721eec91ca) [user@node ~]$ uname -a Linux magna109 3.10.0-511.el7.x86_64 #1 SMP Wed Sep 28 12:25:44 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux RHEL 7.3 [user@node ~]$ lsblk -t -a NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME sda 0 512 0 512 512 1 cfq 128 4096 0B └─sda1 0 512 0 512 512 1 cfq 128 4096 0B sdb 0 512 0 512 512 1 cfq 128 4096 0B ├─sdb1 0 512 0 512 512 1 cfq 128 4096 0B └─sdb2 0 512 0 512 512 1 cfq 128 4096 0B sdc 0 512 0 512 512 1 cfq 128 4096 0B ├─sdc1 0 512 0 512 512 1 cfq 128 4096 0B └─sdc2 0 512 0 512 512 1 cfq 128 4096 0B sdd 0 512 0 512 512 1 cfq 128 4096 0B ├─sdd1 0 512 0 512 512 1 cfq 128 4096 0B └─sdd2 0 512 0 512 512 1 cfq 128 4096 0B [user@node ~]$ lsblk -a NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 931.5G 0 disk └─sda1 8:1 0 931.5G 0 part / sdb 8:16 0 931.5G 0 disk ├─sdb1 8:17 0 921.5G 0 part /var/lib/ceph/osd/ceph-2 └─sdb2 8:18 0 10G 0 part sdc 8:32 0 931.5G 0 disk ├─sdc1 8:33 0 921.5G 0 part /var/lib/ceph/osd/ceph-5 └─sdc2 8:34 0 10G 0 part sdd 8:48 0 931.5G 0 disk ├─sdd1 8:49 0 921.5G 0 part /var/lib/ceph/osd/ceph-7 └─sdd2 8:50 0 10G 0 part [user@node ~]$ sudo ceph-disk list /dev/sda : /dev/sda1 other, ext4, mounted on / /dev/sdb : /dev/sdb2 ceph journal, for /dev/sdb1 /dev/sdb1 ceph data, active, cluster ceph, osd.2, journal /dev/sdb2 /dev/sdc : /dev/sdc2 ceph journal, for /dev/sdc1 /dev/sdc1 ceph data, active, cluster ceph, osd.5, journal /dev/sdc2 /dev/sdd : /dev/sdd2 ceph journal, for /dev/sdd1 /dev/sdd1 ceph data, active, cluster ceph, osd.7, journal /dev/sdd2
The corresponding patch that resolves this problem is at https://github.com/ceph/ceph/commit/430ab1b83e67dfb697b034e669b06b7a600bcc6b
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2815.html