Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 751135

Summary: second leg or log device failure iteration will cause mirror I/O to hang
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: lvm2-clusterAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED WORKSFORME QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.8CC: agk, ccaulfie, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, thornber, zkabelac
Target Milestone: rcKeywords: Regression, TestBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-14 16:51:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log from taft-01
none
log from taft-02
none
log from taft-03
none
log from taft-04 none

Description Corey Marthaler 2011-11-03 16:08:56 UTC
Description of problem:
Scenario: Kill non primary leg

********* Mirror info for this scenario *********
* mirrors:            mirror_1
* leg devices:        /dev/sdc1 /dev/sdd1 /dev/sde1
* log devices:        /dev/sdg1
* failpv(s):          /dev/sdd1
* failnode(s):        taft-01 taft-02 taft-03 taft-04
* leg fault policy:   remove
* log fault policy:   allocate
*************************************************

PV=/dev/sdd1
        mirror_1_mimage_1: 6
PV=/dev/sdd1
        mirror_1_mimage_1: 6

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Disabling device sdd on taft-01
Disabling device sdd on taft-02
Disabling device sdd on taft-03
Disabling device sdd on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
[DEADLOCK]


[root@taft-01 ~]#  /usr/tests/sts-rhel5.7/lvm2/bin/lvs
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 145669554176: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 145669664768: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 4096: Input/output error
  Couldn't find device with uuid LbiiMw-a6tb-KehW-cuXA-A4mD-dk0q-CQvWei.
  LV                  Attr   LSize  Copy%  Devices
  mirror_1            mwi-ao  2.00G  98.05 mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0)
  [mirror_1_mimage_0] Iwi-ao  2.00G        /dev/sdc1(0)
  [mirror_1_mimage_1] Iwi-ao  2.00G        unknown device(0)
  [mirror_1_mimage_2] Iwi-ao  2.00G        /dev/sde1(0)
  [mirror_1_mlog]     lwi-ao  4.00M        /dev/sdg1(0)

[root@taft-02 ~]#  /usr/tests/sts-rhel5.7/lvm2/bin/lvs
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 145669554176: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 145669664768: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 4096: Input/output error
  Couldn't find device with uuid LbiiMw-a6tb-KehW-cuXA-A4mD-dk0q-CQvWei.
  LV                  Attr   LSize  Copy%  Devices
  mirror_1            mwi-ao  2.00G  98.05 mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0)
  [mirror_1_mimage_0] Iwi-ao  2.00G        /dev/sdc1(0)
  [mirror_1_mimage_1] Iwi-ao  2.00G        unknown device(0)
  [mirror_1_mimage_2] Iwi-ao  2.00G        /dev/sde1(0)
  [mirror_1_mlog]     lwi-ao  4.00M        /dev/sdg1(0)


Nov  2 18:07:59 taft-03 qarshd[9068]: Running cmdline: echo offline > /sys/block/sdd/device/state &
Nov  2 18:07:59 taft-03 xinetd[6431]: EXIT: qarsh status=0 pid=9068 duration=0(sec)
Nov  2 18:08:01 taft-03 kernel: sd 1:0:0:3: rejecting I/O to offline device
Nov  2 18:08:04 taft-03 last message repeated 108 times
Nov  2 18:10:59 taft-03 kernel: INFO: task pdflush:303 blocked for more than 120 seconds.
Nov  2 18:10:59 taft-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  2 18:10:59 taft-03 kernel: pdflush       D ffffffff801546d1     0   303     87           304   302 (L-TLB)
Nov  2 18:10:59 taft-03 kernel:  ffff81021f45ba70 0000000000000046 00000000000000c0 ffffffff880e7415
Nov  2 18:10:59 taft-03 kernel:  0000000000041340 000000000000000a ffff81021f85c7e0 ffff8101fff15100
Nov  2 18:10:59 taft-03 kernel:  00000567ce8327c3 0000000000006787 ffff81021f85c9c8 00000001880e8179
Nov  2 18:10:59 taft-03 kernel: Call Trace:
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff880e7415>] :dm_mod:__map_bio+0x4a/0x123
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff8006ec8f>] do_gettimeofday+0x40/0x90
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800155d3>] sync_buffer+0x0/0x3f
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800637ce>] io_schedule+0x3f/0x67
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff8001560e>] sync_buffer+0x3b/0x3f
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff80063912>] __wait_on_bit_lock+0x36/0x66
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800155d3>] sync_buffer+0x0/0x3f
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800639ae>] out_of_line_wait_on_bit_lock+0x6c/0x78
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800a2e8b>] wake_bit_function+0x0/0x23
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff88803405>] :gfs2:databuf_lo_add+0x4f/0x105
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff888046c6>] :gfs2:gfs2_page_add_databufs+0x72/0x97
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff88804ecb>] :gfs2:gfs2_ordered_writepage+0x52/0x6d
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff8001d1d1>] mpage_writepages+0x1bf/0x37d
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff88804e79>] :gfs2:gfs2_ordered_writepage+0x0/0x6d
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff8005a8af>] do_writepages+0x29/0x2f
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff8002fa24>] __writeback_single_inode+0x1a2/0x31c
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff80021143>] sync_sb_inodes+0x1b7/0x271
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800a2c45>] keventd_create_kthread+0x0/0xc4
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff80050ce2>] writeback_inodes+0x82/0xd8
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800cc364>] wb_kupdate+0xd4/0x14e
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800562a9>] pdflush+0x0/0x1fb
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800563fa>] pdflush+0x151/0x1fb
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800cc290>] wb_kupdate+0x0/0x14e
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff80032722>] kthread+0xfe/0x132
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff8009f868>] request_module+0x0/0x14d
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff800a2c45>] keventd_create_kthread+0x0/0xc4
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff80032624>] kthread+0x0/0x132
Nov  2 18:10:59 taft-03 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11



Nov  2 18:08:00 taft-04 qarshd[8889]: Running cmdline: echo offline > /sys/block/sdd/device/state &
Nov  2 18:08:00 taft-04 xinetd[6404]: EXIT: qarsh status=0 pid=8889 duration=0(sec)
Nov  2 18:08:03 taft-04 kernel: sd 1:0:0:3: rejecting I/O to offline device
Nov  2 18:08:03 taft-04 last message repeated 3 times
Nov  2 18:10:25 taft-04 kernel: INFO: task pdflush:303 blocked for more than 120 seconds.
Nov  2 18:10:25 taft-04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  2 18:10:25 taft-04 kernel: pdflush       D ffffffff801546d1     0   303     87           304   302 (L-TLB)
Nov  2 18:10:25 taft-04 kernel:  ffff81021ed37bc0 0000000000000046 ffff81021ed37b40 ffffffff8002dfd7
Nov  2 18:10:25 taft-04 kernel:  0000000000000001 0000000000000009 ffff81021f85c7e0 ffff8101fff24080
Nov  2 18:10:25 taft-04 kernel:  00000565c096c2e1 00000000002dddd3 ffff81021f85c9c8 00000002880e8498
Nov  2 18:10:25 taft-04 kernel: Call Trace:
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff8002dfd7>] __wake_up+0x38/0x4f
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff887fe05f>] :gfs2:just_schedule+0x0/0xd
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff887fe068>] :gfs2:just_schedule+0x9/0xd
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff887fe05f>] :gfs2:just_schedule+0x0/0xd
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff800a2e8b>] wake_bit_function+0x0/0x23
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff887ff2d3>] :gfs2:run_queue+0xca/0x1bb
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff887fe01e>] :gfs2:gfs2_glock_wait+0x37/0x78
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff8880efee>] :gfs2:gfs2_write_inode+0x5f/0x151
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff8880efe6>] :gfs2:gfs2_write_inode+0x57/0x151
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff8002fa5f>] __writeback_single_inode+0x1dd/0x31c
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff8002dfd7>] __wake_up+0x38/0x4f
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff80021143>] sync_sb_inodes+0x1b7/0x271
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff800a2c45>] keventd_create_kthread+0x0/0xc4
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff80050ce2>] writeback_inodes+0x82/0xd8
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff800cc364>] wb_kupdate+0xd4/0x14e
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff800562a9>] pdflush+0x0/0x1fb
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff800563fa>] pdflush+0x151/0x1fb
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff800cc290>] wb_kupdate+0x0/0x14e
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff80032722>] kthread+0xfe/0x132
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff800a2c45>] keventd_create_kthread+0x0/0xc4
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff80032624>] kthread+0x0/0x132
Nov  2 18:10:25 taft-04 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11


Version-Release number of selected component (if applicable):
2.6.18-274.el5

lvm2-2.02.88-2.el5    BUILT: Fri Oct 21 09:48:50 CDT 2011
lvm2-cluster-2.02.88-2.el5    BUILT: Fri Oct 21 09:49:24 CDT 2011
device-mapper-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
device-mapper-event-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009


How reproducible:
Every time so far when doing leg device failure

Comment 1 Jonathan Earl Brassow 2011-11-09 18:02:53 UTC
test and scenario?

Comment 2 Corey Marthaler 2011-11-10 21:23:21 UTC
./revolution_9 -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.7 -R ../../var/share/resource_files/taft.xml -F -e kill_random_devices

Scenario kill_random_devices: Kill random devices (log and legs)

********* Mirror info for this scenario *********
* mirrors:            mirror_1
* leg devices:        /dev/sdd1 /dev/sde1 /dev/sdh1
* log devices:        /dev/sdf1
* failpv(s):          /dev/sdf1
* failnode(s):        taft-01 taft-02 taft-03 taft-04
* leg fault policy:   allocate
* log fault policy:   allocate
*************************************************

PV=/dev/sdf1
        mirror_1_mlog: 1.2
PV=/dev/sdf1
        mirror_1_mlog: 1.2

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Disabling device sdf on taft-01
Disabling device sdf on taft-02
Disabling device sdf on taft-03
Disabling device sdf on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
[DEADLOCK]


[root@taft-01 ~]# lvs -a -o +devices
/dev/sdf1: read failed after 0 of 2048 at 0: Input/output error
/dev/sdf1: read failed after 0 of 512 at 145669554176: Input/output error
/dev/sdf1: read failed after 0 of 512 at 145669664768: Input/output error
/dev/sdf1: read failed after 0 of 512 at 0: Input/output error
/dev/sdf1: read failed after 0 of 512 at 4096: Input/output error
Couldn't find device with uuid fFtBO8-39s7-pxW8-qMtb-ZNU6-wQY3-C5W6l9.
LV                  Attr   LSize  Copy%  Devices
mirror_1            mwi-ao  2.00G 100.00 mirror_1_mimage_1(0),mirror_1_mimage_2(0),mirror_1_mimage_3(0)
[mirror_1_mimage_1] iwi-ao  2.00G        /dev/sdd1(0)
[mirror_1_mimage_2] iwi-ao  2.00G        /dev/sde1(0)
[mirror_1_mimage_3] iwi-ao  2.00G        /dev/sdh1(0)
[mirror_1_mlog]     lwi-ao  4.00M        unknown device(0)


2.6.18-274.el5

lvm2-2.02.88-2.el5    BUILT: Fri Oct 21 09:48:50 CDT 2011
lvm2-cluster-2.02.88-2.el5    BUILT: Fri Oct 21 09:49:24 CDT 2011
device-mapper-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
device-mapper-event-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
cmirror-1.1.39-10.el5    BUILT: Wed Sep  8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

Comment 3 Corey Marthaler 2011-11-10 21:27:08 UTC
Created attachment 532932 [details]
log from taft-01

Comment 4 Corey Marthaler 2011-11-10 21:28:13 UTC
Created attachment 532933 [details]
log from taft-02

Comment 5 Corey Marthaler 2011-11-10 21:28:49 UTC
Created attachment 532934 [details]
log from taft-03

Comment 6 Corey Marthaler 2011-11-10 21:29:45 UTC
Created attachment 532935 [details]
log from taft-04

Comment 7 Corey Marthaler 2011-11-14 22:38:20 UTC
This appears to happen 100% of the time on the 2nd iteration of revolver_9. This test (unlike helter_skelter) continues to use the same mirror when doing device failure testing. It does not create a new one every iteration. When the second device failure scenario is attempted, nothing from LVM is attempted to repair the mirror. This causes the running I/O to block.

./revolution_9 -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.8 -R ../../var/share/resource_files/taft.xml -F 

================================================================================
Iteration 0.2 started at Mon Nov 14 16:14:07 CST 2011
================================================================================
Scenario kill_non_primary_leg: Kill non primary leg

********* Mirror info for this scenario *********
* mirrors:            mirror_1
* leg devices:        /dev/sdc1 /dev/sdd1 /dev/sde1
* log devices:        /dev/sdg1
* failpv(s):          /dev/sde1
* failnode(s):        taft-01 taft-02 taft-03 taft-04
* leg fault policy:   allocate
* log fault policy:   allocate
*************************************************

Mirror Structure(s):
  LV                  Attr   LSize  Copy%  Devices
  mirror_1            mwi-ao  2.00G 100.00 mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0)
  [mirror_1_mimage_0] iwi-ao  2.00G        /dev/sdc1(0)
  [mirror_1_mimage_2] iwi-ao  2.00G        /dev/sde1(0)
  [mirror_1_mlog]     lwi-ao  4.00M        /dev/sdg1(0)

PV=/dev/sde1
        mirror_1_mimage_2: 3.1
PV=/dev/sde1
        mirror_1_mimage_2: 3.1

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Disabling device sde on taft-01
Disabling device sde on taft-02
Disabling device sde on taft-03
Disabling device sde on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
[DEADLOCK]

Comment 8 Corey Marthaler 2011-11-29 21:39:38 UTC
Hit this again while attempting to verify 523324. This should be fixed for 5.8. Basically it appears that no mirror can suffer two separate device failures.

Comment 9 Corey Marthaler 2011-12-14 16:51:52 UTC
I'm no longer seeing this issue with the latest 5.8 tree/kernel. Closing and will reopen if seen again.

2.6.18-300.el5
lvm2-2.02.88-5.el5    BUILT: Fri Dec  2 12:25:45 CST 2011
lvm2-cluster-2.02.88-5.el5    BUILT: Fri Dec  2 12:48:37 CST 2011
device-mapper-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
device-mapper-event-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
cmirror-1.1.39-13.el5    BUILT: Wed Oct 26 14:50:40 CDT 2011
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009