Bug 684343 - deadlock during multiple leg device failures of an exclusively activated cmirror containing a snapshot
Summary: deadlock during multiple leg device failures of an exclusively activated cmir...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Corey Marthaler
URL:
Whiteboard:
Depends On:
Blocks: 756082
TreeView+ depends on / blocked
 
Reported: 2011-03-11 20:51 UTC by Corey Marthaler
Modified: 2012-04-27 18:51 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-04-27 18:51:27 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Corey Marthaler 2011-03-11 20:51:49 UTC
Description of problem:
Scenario: Kill multiple legs of synced 4 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_multiple_legs_4legs_1
* sync:               1
* leg devices:        /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdb1
* log devices:        /dev/sdc1
* no MDA devices:     
* failpv(s):          /dev/sdf1 /dev/sdb1
* failnode(s):        taft-02
* additional snap:    /dev/sde1
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on taft-02...
taft-02: lvcreate -m 3 -n syncd_multiple_legs_4legs_1 -L 600M helter_skelter /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-1000 /dev/sdb1:0-1000 /dev/sdc1:0-150

EXCLUSIVELY ACTIVATING CMIRROR on taft-02
Creating a snapshot volume of each of the mirrors

PV=/dev/sdb1
        syncd_multiple_legs_4legs_1_mimage_3: 3.1
PV=/dev/sdf1
        syncd_multiple_legs_4legs_1_mimage_0: 4
PV=/dev/sdb1
        syncd_multiple_legs_4legs_1_mimage_3: 3.1
PV=/dev/sdf1
        syncd_multiple_legs_4legs_1_mimage_0: 4

Waiting until all mirrors become fully syncd...
   0/1 mirror(s) are fully synced: ( 85.83% )
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating gfs2 on top of mirror(s) on taft-02...
Mounting mirrored gfs2 filesystems on taft-02...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-02 ----

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-02 ----

Disabling device sdf on taft-02
Disabling device sdb on taft-02

Attempting I/O to cause mirror down conversion(s) on taft-02
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.310101 s, 135 MB/s


# before the failure
[root@taft-02 ~]# lvs -a -o +devices
 LV         VG             Attr   LSize   Origin                      Snap%  Devices
 hs_snap1   helter_skelter swi-a- 252.00m syncd_multiple_legs_4legs_1  13.24 /dev/sde1(150)

 LV                                     VG             Attr   LSize   Log                              Copy%  Devices
 syncd_multiple_legs_4legs_1            helter_skelter owi-ao 600.00m syncd_multiple_legs_4legs_1_mlog 100.00 syncd_multiple_legs_4legs_1_mimage_0(0),syncd_multiple_legs_4legs_1_mimage_1(0),syncd_multiple_legs_4legs_1_mimage_2(0),syncd_multiple_legs_4legs_1_mimage_3(0)
 [syncd_multiple_legs_4legs_1_mimage_0] helter_skelter iwi-ao 600.00m                                         /dev/sdf1(0)
 [syncd_multiple_legs_4legs_1_mimage_1] helter_skelter iwi-ao 600.00m                                         /dev/sde1(0)
 [syncd_multiple_legs_4legs_1_mimage_2] helter_skelter iwi-ao 600.00m                                         /dev/sdg1(0)
 [syncd_multiple_legs_4legs_1_mimage_3] helter_skelter iwi-ao 600.00m                                         /dev/sdb1(0)
 [syncd_multiple_legs_4legs_1_mlog]     helter_skelter lwi-ao   4.00m                                         /dev/sdc1(0)

# after the failure
[root@taft-02 ~]# lvs -a -o +devices
[STUCK]



Version-Release number of selected component (if applicable):
2.6.32-94.el6.x86_64

lvm2-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-libs-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-cluster-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
udev-147-2.31.el6    BUILT: Wed Jan 26 05:39:15 CST 2011
device-mapper-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
cmirror-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011

Comment 1 Corey Marthaler 2011-03-11 21:00:06 UTC
Doesn't look like anything died on any of these nodes.

[root@taft-02 ~]# ps -ef | grep dmeventd
root      1738  1722  0 14:52 pts/1    00:00:00 grep dmeventd
root      3085     1  0 Mar10 ?        00:03:34 /sbin/dmeventd
[root@taft-02 ~]# ps -ef | grep cmirrord
root      1740  1722  0 14:52 pts/1    00:00:00 grep cmirrord
root     30959     1  0 13:36 ?        00:00:05 cmirrord
[root@taft-02 ~]# ps -ef | grep udev
root       576     1  0 Mar10 ?        00:00:30 /sbin/udevd -d
root      1586   576  0 14:05 ?        00:00:00 /sbin/udevd -d
root      1587   576  0 14:05 ?        00:00:00 /sbin/udevd -d
root      1742  1722  0 14:52 pts/1    00:00:00 grep udev


The main mirror vol is marked as SUSPENDED, but none of the sub devices are.
Name:              helter_skelter-syncd_multiple_legs_4legs_1
State:             SUSPENDED
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 8
Number of targets: 1
UUID: LVM-0Yyo33rKjwXbNuZEDXnFij2HauvM3B601s2HPfMxQUnup8I3JOJVMHPPeFBK8tr8

[root@taft-02 ~]# dmsetup ls
helter_skelter-syncd_multiple_legs_4legs_1_mimage_2     (253, 6)
helter_skelter-syncd_multiple_legs_4legs_1      (253, 8)
helter_skelter-syncd_multiple_legs_4legs_1_mimage_1     (253, 5)
helter_skelter-syncd_multiple_legs_4legs_1_mlog (253, 2)
helter_skelter-syncd_multiple_legs_4legs_1-real (253, 10)
helter_skelter-hs_snap1-cow     (253, 11)
helter_skelter-hs_snap1 (253, 9)


Mar 11 14:05:00 taft-02 qarshd[1225]: Running cmdline: echo offline > /sys/block/sdf/device/state &
Mar 11 14:05:00 taft-02 qarshd[1228]: Running cmdline: echo offline > /sys/block/sdb/device/state &
[...]
Mar 11 14:05:05 taft-02 kernel: sd 3:0:0:5: rejecting I/O to offline device
Mar 11 14:05:05 taft-02 kernel: __ratelimit: 264 callbacks suppressed
Mar 11 14:05:05 taft-02 kernel: device-mapper: raid1: Read failure on mirror device 253:4.  Trying alternative device.
Mar 11 14:05:05 taft-02 lvm[3085]: Primary mirror device 253:4 read failed.
Mar 11 14:05:05 taft-02 lvm[3085]: helter_skelter-syncd_multiple_legs_4legs_1-real is now in-sync.
[...]
Mar 11 14:05:06 taft-02 qarshd[1234]: Running cmdline: dd if=/dev/zero of=/mnt/syncd_multiple_legs_4legs_1/ddfile count=10 bs=4M
[...]
Mar 11 14:05:07 taft-02 lvm[3085]: Primary mirror device 253:4 has failed (D).
Mar 11 14:05:07 taft-02 lvm[3085]: Secondary mirror device 253:7 has failed (D).
Mar 11 14:05:07 taft-02 lvm[3085]: Device failure in helter_skelter-syncd_multiple_legs_4legs_1-real.
Mar 11 14:05:08 taft-02 lvm[3085]: Couldn't find device with uuid f8oLNc-7pQZ-16h3-qjrt-pFJA-VRNp-xB2stB.
Mar 11 14:05:08 taft-02 lvm[3085]: Couldn't find device with uuid sUUk1o-5939-F18h-D8Jx-tNdY-f4NO-dHvYsd.
Mar 11 14:05:12 taft-02 lvm[3085]: Mirror status: 2 of 4 images failed.
Mar 11 14:05:14 taft-02 lvm[3085]: device-mapper: waitevent ioctl failed: Interrupted system call
Mar 11 14:05:14 taft-02 lvm[3085]: Another thread is handling an event. Waiting...
Mar 11 14:05:20 taft-02 lvm[3085]: Monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1-real for events.
Mar 11 14:05:20 taft-02 lvm[3085]: Another thread is handling an event. Waiting...
Mar 11 14:07:25 taft-02 lvm[3085]: Error locking on node taft-02: Command timed out
Mar 11 14:07:25 taft-02 lvm[3085]: Problem reactivating syncd_multiple_legs_4legs_1
Mar 11 14:09:18 taft-02 kernel: INFO: task clvmd:31012 blocked for more than 120 seconds.
Mar 11 14:09:18 taft-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 11 14:09:18 taft-02 kernel: clvmd         D ffff880200c05880     0 31012      1 0x00000080
Mar 11 14:09:18 taft-02 kernel: ffff880215065c28 0000000000000086 ffff880200b96a70 ffff880028216a00
Mar 11 14:09:18 taft-02 kernel: ffff880200000000 0000000000000003 ffff880215065bc8 ffff880200b96a70
Mar 11 14:09:18 taft-02 kernel: ffff8802134df0e8 ffff880215065fd8 0000000000010558 ffff8802134df0e8
Mar 11 14:09:18 taft-02 kernel: Call Trace:
Mar 11 14:09:18 taft-02 kernel: [<ffffffff814c66e5>] rwsem_down_failed_common+0x95/0x1d0
Mar 11 14:09:18 taft-02 kernel: [<ffffffff8108a106>] ? autoremove_wake_function+0x16/0x40
Mar 11 14:09:18 taft-02 kernel: [<ffffffffa0007e90>] ? dev_suspend+0x0/0x240 [dm_mod]
Mar 11 14:09:18 taft-02 kernel: [<ffffffff814c6843>] rwsem_down_write_failed+0x23/0x30
Mar 11 14:09:18 taft-02 kernel: [<ffffffff8125f3f3>] call_rwsem_down_write_failed+0x13/0x20
Mar 11 14:09:18 taft-02 kernel: [<ffffffff814c5d42>] ? down_write+0x32/0x40
Mar 11 14:09:18 taft-02 kernel: [<ffffffff8119d7c9>] thaw_bdev+0x99/0x130
Mar 11 14:09:18 taft-02 kernel: [<ffffffffa0000b5a>] unlock_fs+0x2a/0x50 [dm_mod]
Mar 11 14:09:18 taft-02 kernel: [<ffffffffa0002596>] dm_resume+0xd6/0xf0 [dm_mod]
Mar 11 14:09:18 taft-02 kernel: [<ffffffffa000804c>] dev_suspend+0x1bc/0x240 [dm_mod]
Mar 11 14:09:18 taft-02 kernel: [<ffffffffa0007e90>] ? dev_suspend+0x0/0x240 [dm_mod]
Mar 11 14:09:18 taft-02 kernel: [<ffffffffa0008923>] ctl_ioctl+0x1a3/0x240 [dm_mod]
Mar 11 14:09:18 taft-02 kernel: [<ffffffffa00089d3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
Mar 11 14:09:18 taft-02 kernel: [<ffffffff81178922>] vfs_ioctl+0x22/0xa0
Mar 11 14:09:18 taft-02 kernel: [<ffffffff81178ac4>] do_vfs_ioctl+0x84/0x580
Mar 11 14:09:18 taft-02 kernel: [<ffffffff81179041>] sys_ioctl+0x81/0xa0
Mar 11 14:09:18 taft-02 kernel: [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
Mar 11 14:09:18 taft-02 kernel: INFO: task xdoio:1221 blocked for more than 120 seconds.
[...]

Comment 2 Corey Marthaler 2011-03-11 21:46:26 UTC
This issue is easily reproducible.

./helter_skelter -l /home/msp/cmarthal/work/rhel6/sts-root -r /usr/tests/sts-rhel6.1 -R ../../var/share/resource_files/taft.xml -e kill_multiple_legs_synced_4_legs -E taft-02

Comment 6 Jonathan Earl Brassow 2012-01-04 16:22:28 UTC
Snapshots of mirrors (or exclusively activated cmirrors) will not be supported.  Snapshots of "raid1" segment type will be.  I will most likely close this bug WONTFIX, but I'll try to take a look and see what's going on first - capacity permitting.

Comment 8 Jonathan Earl Brassow 2012-04-27 18:51:27 UTC
We do not support snapshots of 'mirror' segment type.


Note You need to log in before you can comment on or make changes to this bug.