Bug 1351484
Summary: | ceph-disk should timeout when a lock cannot be acquired | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Martin Kudlej <mkudlej> |
Component: | Ceph-Disk | Assignee: | Loic Dachary <ldachary> |
Status: | CLOSED ERRATA | QA Contact: | Ramakrishnan Periyasamy <rperiyas> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 2.0 | CC: | flucifre, hnallurv, icolle, kdreyer, mkudlej, uboppana |
Target Milestone: | rc | ||
Target Release: | 2.1 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | RHEL: ceph-10.2.3-2.el7cp Ubuntu: ceph_10.2.3-3redhat1xenial | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-22 19:28:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Martin Kudlej
2016-06-30 07:39:57 UTC
This is because another process is holding the activate lock. Did you try to verify if a ceph-disk process is hung ? There were stuck some ceph-disk commands so there were locks for devices. But anyway in case of lock existence ceph-disk should output error message and ends after reasonable timeout. I think this is correct behaviour for "ceph-disk list" for sure. It's on you if this can be correct behaviour also for another subcommands. I think command should not stuck or end without proper error message. @mkudlej you're right. I created http://tracker.ceph.com/issues/16580 to fix this. Would you please explain how to induce the backend disk failure? Does it have something to do with the iscsi environment? Because Console works only with empty disks and i cannot use partitions, I've created this structure described in comment #1: $ lsblk -a NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 5G 0 disk └─sda1 8:1 0 5G 0 part /var/lib/ceph/osd/cluster1-0 sdb 8:16 0 5G 0 disk vda 253:0 0 20G 0 disk ├─vda1 253:1 0 2G 0 part └─vda2 253:2 0 18G 0 part / vdb 253:16 0 10G 0 disk └─vdb1 253:17 0 5G 0 part vdc 253:32 0 10G 0 disk loop0 7:0 0 5G 0 loop └─bad_disk 252:0 0 5G 0 dm loop1 7:1 0 5G 0 loop └─bad_disk2 252:1 0 5G 0 dm Where: loop0 is device created by losetup from file. bad_disk is device created by Device mapper which includes about in the middle error mapping target https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/device_mapper.html A.1.5 And then I've created scsi block device from bad_disk. Similar is loop1 -> bad_disk2 -> sdb I've created disks with bad sectors for testing purpose to simulate disk with bad sectors and I've hoped that I see proper error messages in Ceph and console. If you would like to look at it, please contact me. I will rebuild that cluster in about week. timeout PR undergoing review upstream: https://github.com/ceph/ceph/pull/10262 Discussed this with Product Management (Neil) today. Since this is a negative test case where failures are intentionally induced into the loopback devices, we are going to re-target this to RHCS 2.1. (This will provide more time for a ceph-disk solution to stabilize upstream.) Moving the bug to verified state. Verified in below ceph and RHEL kernel version [user@node ~]$ ceph -v ceph version 10.2.3-4.el7cp (852125d923e43802a51f681ca2ae9e721eec91ca) [user@node ~]$ uname -a Linux magna109 3.10.0-511.el7.x86_64 #1 SMP Wed Sep 28 12:25:44 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux RHEL 7.3 [user@node ~]$ lsblk -t -a NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME sda 0 512 0 512 512 1 cfq 128 4096 0B └─sda1 0 512 0 512 512 1 cfq 128 4096 0B sdb 0 512 0 512 512 1 cfq 128 4096 0B ├─sdb1 0 512 0 512 512 1 cfq 128 4096 0B └─sdb2 0 512 0 512 512 1 cfq 128 4096 0B sdc 0 512 0 512 512 1 cfq 128 4096 0B ├─sdc1 0 512 0 512 512 1 cfq 128 4096 0B └─sdc2 0 512 0 512 512 1 cfq 128 4096 0B sdd 0 512 0 512 512 1 cfq 128 4096 0B ├─sdd1 0 512 0 512 512 1 cfq 128 4096 0B └─sdd2 0 512 0 512 512 1 cfq 128 4096 0B [user@node ~]$ lsblk -a NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 931.5G 0 disk └─sda1 8:1 0 931.5G 0 part / sdb 8:16 0 931.5G 0 disk ├─sdb1 8:17 0 921.5G 0 part /var/lib/ceph/osd/ceph-2 └─sdb2 8:18 0 10G 0 part sdc 8:32 0 931.5G 0 disk ├─sdc1 8:33 0 921.5G 0 part /var/lib/ceph/osd/ceph-5 └─sdc2 8:34 0 10G 0 part sdd 8:48 0 931.5G 0 disk ├─sdd1 8:49 0 921.5G 0 part /var/lib/ceph/osd/ceph-7 └─sdd2 8:50 0 10G 0 part [user@node ~]$ sudo ceph-disk list /dev/sda : /dev/sda1 other, ext4, mounted on / /dev/sdb : /dev/sdb2 ceph journal, for /dev/sdb1 /dev/sdb1 ceph data, active, cluster ceph, osd.2, journal /dev/sdb2 /dev/sdc : /dev/sdc2 ceph journal, for /dev/sdc1 /dev/sdc1 ceph data, active, cluster ceph, osd.5, journal /dev/sdc2 /dev/sdd : /dev/sdd2 ceph journal, for /dev/sdd1 /dev/sdd1 ceph data, active, cluster ceph, osd.7, journal /dev/sdd2 The corresponding patch that resolves this problem is at https://github.com/ceph/ceph/commit/430ab1b83e67dfb697b034e669b06b7a600bcc6b Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2815.html |