Bug 751135
| Summary: | second leg or log device failure iteration will cause mirror I/O to hang | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||||||||
| Component: | lvm2-cluster | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||
| Priority: | urgent | ||||||||||||
| Version: | 5.8 | CC: | agk, ccaulfie, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, thornber, zkabelac | ||||||||||
| Target Milestone: | rc | Keywords: | Regression, TestBlocker | ||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2011-12-14 16:51:52 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
test and scenario? ./revolution_9 -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.7 -R ../../var/share/resource_files/taft.xml -F -e kill_random_devices
Scenario kill_random_devices: Kill random devices (log and legs)
********* Mirror info for this scenario *********
* mirrors: mirror_1
* leg devices: /dev/sdd1 /dev/sde1 /dev/sdh1
* log devices: /dev/sdf1
* failpv(s): /dev/sdf1
* failnode(s): taft-01 taft-02 taft-03 taft-04
* leg fault policy: allocate
* log fault policy: allocate
*************************************************
PV=/dev/sdf1
mirror_1_mlog: 1.2
PV=/dev/sdf1
mirror_1_mlog: 1.2
Writing verification files (checkit) to mirror(s) on...
---- taft-01 ----
---- taft-02 ----
---- taft-03 ----
---- taft-04 ----
Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure
Verifying files (checkit) on mirror(s) on...
---- taft-01 ----
---- taft-02 ----
---- taft-03 ----
---- taft-04 ----
Disabling device sdf on taft-01
Disabling device sdf on taft-02
Disabling device sdf on taft-03
Disabling device sdf on taft-04
Attempting I/O to cause mirror down conversion(s) on taft-01
[DEADLOCK]
[root@taft-01 ~]# lvs -a -o +devices
/dev/sdf1: read failed after 0 of 2048 at 0: Input/output error
/dev/sdf1: read failed after 0 of 512 at 145669554176: Input/output error
/dev/sdf1: read failed after 0 of 512 at 145669664768: Input/output error
/dev/sdf1: read failed after 0 of 512 at 0: Input/output error
/dev/sdf1: read failed after 0 of 512 at 4096: Input/output error
Couldn't find device with uuid fFtBO8-39s7-pxW8-qMtb-ZNU6-wQY3-C5W6l9.
LV Attr LSize Copy% Devices
mirror_1 mwi-ao 2.00G 100.00 mirror_1_mimage_1(0),mirror_1_mimage_2(0),mirror_1_mimage_3(0)
[mirror_1_mimage_1] iwi-ao 2.00G /dev/sdd1(0)
[mirror_1_mimage_2] iwi-ao 2.00G /dev/sde1(0)
[mirror_1_mimage_3] iwi-ao 2.00G /dev/sdh1(0)
[mirror_1_mlog] lwi-ao 4.00M unknown device(0)
2.6.18-274.el5
lvm2-2.02.88-2.el5 BUILT: Fri Oct 21 09:48:50 CDT 2011
lvm2-cluster-2.02.88-2.el5 BUILT: Fri Oct 21 09:49:24 CDT 2011
device-mapper-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011
device-mapper-event-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011
cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010
kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009
Created attachment 532932 [details]
log from taft-01
Created attachment 532933 [details]
log from taft-02
Created attachment 532934 [details]
log from taft-03
Created attachment 532935 [details]
log from taft-04
This appears to happen 100% of the time on the 2nd iteration of revolver_9. This test (unlike helter_skelter) continues to use the same mirror when doing device failure testing. It does not create a new one every iteration. When the second device failure scenario is attempted, nothing from LVM is attempted to repair the mirror. This causes the running I/O to block.
./revolution_9 -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.8 -R ../../var/share/resource_files/taft.xml -F
================================================================================
Iteration 0.2 started at Mon Nov 14 16:14:07 CST 2011
================================================================================
Scenario kill_non_primary_leg: Kill non primary leg
********* Mirror info for this scenario *********
* mirrors: mirror_1
* leg devices: /dev/sdc1 /dev/sdd1 /dev/sde1
* log devices: /dev/sdg1
* failpv(s): /dev/sde1
* failnode(s): taft-01 taft-02 taft-03 taft-04
* leg fault policy: allocate
* log fault policy: allocate
*************************************************
Mirror Structure(s):
LV Attr LSize Copy% Devices
mirror_1 mwi-ao 2.00G 100.00 mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0)
[mirror_1_mimage_0] iwi-ao 2.00G /dev/sdc1(0)
[mirror_1_mimage_2] iwi-ao 2.00G /dev/sde1(0)
[mirror_1_mlog] lwi-ao 4.00M /dev/sdg1(0)
PV=/dev/sde1
mirror_1_mimage_2: 3.1
PV=/dev/sde1
mirror_1_mimage_2: 3.1
Writing verification files (checkit) to mirror(s) on...
---- taft-01 ----
---- taft-02 ----
---- taft-03 ----
---- taft-04 ----
Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure
Verifying files (checkit) on mirror(s) on...
---- taft-01 ----
---- taft-02 ----
---- taft-03 ----
---- taft-04 ----
Disabling device sde on taft-01
Disabling device sde on taft-02
Disabling device sde on taft-03
Disabling device sde on taft-04
Attempting I/O to cause mirror down conversion(s) on taft-01
[DEADLOCK]
Hit this again while attempting to verify 523324. This should be fixed for 5.8. Basically it appears that no mirror can suffer two separate device failures. I'm no longer seeing this issue with the latest 5.8 tree/kernel. Closing and will reopen if seen again. 2.6.18-300.el5 lvm2-2.02.88-5.el5 BUILT: Fri Dec 2 12:25:45 CST 2011 lvm2-cluster-2.02.88-5.el5 BUILT: Fri Dec 2 12:48:37 CST 2011 device-mapper-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 device-mapper-event-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 cmirror-1.1.39-13.el5 BUILT: Wed Oct 26 14:50:40 CDT 2011 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 |
Description of problem: Scenario: Kill non primary leg ********* Mirror info for this scenario ********* * mirrors: mirror_1 * leg devices: /dev/sdc1 /dev/sdd1 /dev/sde1 * log devices: /dev/sdg1 * failpv(s): /dev/sdd1 * failnode(s): taft-01 taft-02 taft-03 taft-04 * leg fault policy: remove * log fault policy: allocate ************************************************* PV=/dev/sdd1 mirror_1_mimage_1: 6 PV=/dev/sdd1 mirror_1_mimage_1: 6 Writing verification files (checkit) to mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- taft-01 ---- ---- taft-02 ---- ---- taft-03 ---- ---- taft-04 ---- Disabling device sdd on taft-01 Disabling device sdd on taft-02 Disabling device sdd on taft-03 Disabling device sdd on taft-04 Attempting I/O to cause mirror down conversion(s) on taft-01 [DEADLOCK] [root@taft-01 ~]# /usr/tests/sts-rhel5.7/lvm2/bin/lvs /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error /dev/sdd1: read failed after 0 of 512 at 145669554176: Input/output error /dev/sdd1: read failed after 0 of 512 at 145669664768: Input/output error /dev/sdd1: read failed after 0 of 512 at 0: Input/output error /dev/sdd1: read failed after 0 of 512 at 4096: Input/output error Couldn't find device with uuid LbiiMw-a6tb-KehW-cuXA-A4mD-dk0q-CQvWei. LV Attr LSize Copy% Devices mirror_1 mwi-ao 2.00G 98.05 mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0) [mirror_1_mimage_0] Iwi-ao 2.00G /dev/sdc1(0) [mirror_1_mimage_1] Iwi-ao 2.00G unknown device(0) [mirror_1_mimage_2] Iwi-ao 2.00G /dev/sde1(0) [mirror_1_mlog] lwi-ao 4.00M /dev/sdg1(0) [root@taft-02 ~]# /usr/tests/sts-rhel5.7/lvm2/bin/lvs /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error /dev/sdd1: read failed after 0 of 512 at 145669554176: Input/output error /dev/sdd1: read failed after 0 of 512 at 145669664768: Input/output error /dev/sdd1: read failed after 0 of 512 at 0: Input/output error /dev/sdd1: read failed after 0 of 512 at 4096: Input/output error Couldn't find device with uuid LbiiMw-a6tb-KehW-cuXA-A4mD-dk0q-CQvWei. LV Attr LSize Copy% Devices mirror_1 mwi-ao 2.00G 98.05 mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0) [mirror_1_mimage_0] Iwi-ao 2.00G /dev/sdc1(0) [mirror_1_mimage_1] Iwi-ao 2.00G unknown device(0) [mirror_1_mimage_2] Iwi-ao 2.00G /dev/sde1(0) [mirror_1_mlog] lwi-ao 4.00M /dev/sdg1(0) Nov 2 18:07:59 taft-03 qarshd[9068]: Running cmdline: echo offline > /sys/block/sdd/device/state & Nov 2 18:07:59 taft-03 xinetd[6431]: EXIT: qarsh status=0 pid=9068 duration=0(sec) Nov 2 18:08:01 taft-03 kernel: sd 1:0:0:3: rejecting I/O to offline device Nov 2 18:08:04 taft-03 last message repeated 108 times Nov 2 18:10:59 taft-03 kernel: INFO: task pdflush:303 blocked for more than 120 seconds. Nov 2 18:10:59 taft-03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 2 18:10:59 taft-03 kernel: pdflush D ffffffff801546d1 0 303 87 304 302 (L-TLB) Nov 2 18:10:59 taft-03 kernel: ffff81021f45ba70 0000000000000046 00000000000000c0 ffffffff880e7415 Nov 2 18:10:59 taft-03 kernel: 0000000000041340 000000000000000a ffff81021f85c7e0 ffff8101fff15100 Nov 2 18:10:59 taft-03 kernel: 00000567ce8327c3 0000000000006787 ffff81021f85c9c8 00000001880e8179 Nov 2 18:10:59 taft-03 kernel: Call Trace: Nov 2 18:10:59 taft-03 kernel: [<ffffffff880e7415>] :dm_mod:__map_bio+0x4a/0x123 Nov 2 18:10:59 taft-03 kernel: [<ffffffff8006ec8f>] do_gettimeofday+0x40/0x90 Nov 2 18:10:59 taft-03 kernel: [<ffffffff800155d3>] sync_buffer+0x0/0x3f Nov 2 18:10:59 taft-03 kernel: [<ffffffff800637ce>] io_schedule+0x3f/0x67 Nov 2 18:10:59 taft-03 kernel: [<ffffffff8001560e>] sync_buffer+0x3b/0x3f Nov 2 18:10:59 taft-03 kernel: [<ffffffff80063912>] __wait_on_bit_lock+0x36/0x66 Nov 2 18:10:59 taft-03 kernel: [<ffffffff800155d3>] sync_buffer+0x0/0x3f Nov 2 18:10:59 taft-03 kernel: [<ffffffff800639ae>] out_of_line_wait_on_bit_lock+0x6c/0x78 Nov 2 18:10:59 taft-03 kernel: [<ffffffff800a2e8b>] wake_bit_function+0x0/0x23 Nov 2 18:10:59 taft-03 kernel: [<ffffffff88803405>] :gfs2:databuf_lo_add+0x4f/0x105 Nov 2 18:10:59 taft-03 kernel: [<ffffffff888046c6>] :gfs2:gfs2_page_add_databufs+0x72/0x97 Nov 2 18:10:59 taft-03 kernel: [<ffffffff88804ecb>] :gfs2:gfs2_ordered_writepage+0x52/0x6d Nov 2 18:10:59 taft-03 kernel: [<ffffffff8001d1d1>] mpage_writepages+0x1bf/0x37d Nov 2 18:10:59 taft-03 kernel: [<ffffffff88804e79>] :gfs2:gfs2_ordered_writepage+0x0/0x6d Nov 2 18:10:59 taft-03 kernel: [<ffffffff8005a8af>] do_writepages+0x29/0x2f Nov 2 18:10:59 taft-03 kernel: [<ffffffff8002fa24>] __writeback_single_inode+0x1a2/0x31c Nov 2 18:10:59 taft-03 kernel: [<ffffffff80021143>] sync_sb_inodes+0x1b7/0x271 Nov 2 18:10:59 taft-03 kernel: [<ffffffff800a2c45>] keventd_create_kthread+0x0/0xc4 Nov 2 18:10:59 taft-03 kernel: [<ffffffff80050ce2>] writeback_inodes+0x82/0xd8 Nov 2 18:10:59 taft-03 kernel: [<ffffffff800cc364>] wb_kupdate+0xd4/0x14e Nov 2 18:10:59 taft-03 kernel: [<ffffffff800562a9>] pdflush+0x0/0x1fb Nov 2 18:10:59 taft-03 kernel: [<ffffffff800563fa>] pdflush+0x151/0x1fb Nov 2 18:10:59 taft-03 kernel: [<ffffffff800cc290>] wb_kupdate+0x0/0x14e Nov 2 18:10:59 taft-03 kernel: [<ffffffff80032722>] kthread+0xfe/0x132 Nov 2 18:10:59 taft-03 kernel: [<ffffffff8009f868>] request_module+0x0/0x14d Nov 2 18:10:59 taft-03 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 2 18:10:59 taft-03 kernel: [<ffffffff800a2c45>] keventd_create_kthread+0x0/0xc4 Nov 2 18:10:59 taft-03 kernel: [<ffffffff80032624>] kthread+0x0/0x132 Nov 2 18:10:59 taft-03 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Nov 2 18:08:00 taft-04 qarshd[8889]: Running cmdline: echo offline > /sys/block/sdd/device/state & Nov 2 18:08:00 taft-04 xinetd[6404]: EXIT: qarsh status=0 pid=8889 duration=0(sec) Nov 2 18:08:03 taft-04 kernel: sd 1:0:0:3: rejecting I/O to offline device Nov 2 18:08:03 taft-04 last message repeated 3 times Nov 2 18:10:25 taft-04 kernel: INFO: task pdflush:303 blocked for more than 120 seconds. Nov 2 18:10:25 taft-04 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 2 18:10:25 taft-04 kernel: pdflush D ffffffff801546d1 0 303 87 304 302 (L-TLB) Nov 2 18:10:25 taft-04 kernel: ffff81021ed37bc0 0000000000000046 ffff81021ed37b40 ffffffff8002dfd7 Nov 2 18:10:25 taft-04 kernel: 0000000000000001 0000000000000009 ffff81021f85c7e0 ffff8101fff24080 Nov 2 18:10:25 taft-04 kernel: 00000565c096c2e1 00000000002dddd3 ffff81021f85c9c8 00000002880e8498 Nov 2 18:10:25 taft-04 kernel: Call Trace: Nov 2 18:10:25 taft-04 kernel: [<ffffffff8002dfd7>] __wake_up+0x38/0x4f Nov 2 18:10:25 taft-04 kernel: [<ffffffff887fe05f>] :gfs2:just_schedule+0x0/0xd Nov 2 18:10:25 taft-04 kernel: [<ffffffff887fe068>] :gfs2:just_schedule+0x9/0xd Nov 2 18:10:25 taft-04 kernel: [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e Nov 2 18:10:25 taft-04 kernel: [<ffffffff887fe05f>] :gfs2:just_schedule+0x0/0xd Nov 2 18:10:25 taft-04 kernel: [<ffffffff80063a94>] out_of_line_wait_on_bit+0x6c/0x78 Nov 2 18:10:25 taft-04 kernel: [<ffffffff800a2e8b>] wake_bit_function+0x0/0x23 Nov 2 18:10:25 taft-04 kernel: [<ffffffff887ff2d3>] :gfs2:run_queue+0xca/0x1bb Nov 2 18:10:25 taft-04 kernel: [<ffffffff887fe01e>] :gfs2:gfs2_glock_wait+0x37/0x78 Nov 2 18:10:25 taft-04 kernel: [<ffffffff8880efee>] :gfs2:gfs2_write_inode+0x5f/0x151 Nov 2 18:10:25 taft-04 kernel: [<ffffffff8880efe6>] :gfs2:gfs2_write_inode+0x57/0x151 Nov 2 18:10:25 taft-04 kernel: [<ffffffff8002fa5f>] __writeback_single_inode+0x1dd/0x31c Nov 2 18:10:25 taft-04 kernel: [<ffffffff8002dfd7>] __wake_up+0x38/0x4f Nov 2 18:10:25 taft-04 kernel: [<ffffffff80021143>] sync_sb_inodes+0x1b7/0x271 Nov 2 18:10:25 taft-04 kernel: [<ffffffff800a2c45>] keventd_create_kthread+0x0/0xc4 Nov 2 18:10:25 taft-04 kernel: [<ffffffff80050ce2>] writeback_inodes+0x82/0xd8 Nov 2 18:10:25 taft-04 kernel: [<ffffffff800cc364>] wb_kupdate+0xd4/0x14e Nov 2 18:10:25 taft-04 kernel: [<ffffffff800562a9>] pdflush+0x0/0x1fb Nov 2 18:10:25 taft-04 kernel: [<ffffffff800563fa>] pdflush+0x151/0x1fb Nov 2 18:10:25 taft-04 kernel: [<ffffffff800cc290>] wb_kupdate+0x0/0x14e Nov 2 18:10:25 taft-04 kernel: [<ffffffff80032722>] kthread+0xfe/0x132 Nov 2 18:10:25 taft-04 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 2 18:10:25 taft-04 kernel: [<ffffffff800a2c45>] keventd_create_kthread+0x0/0xc4 Nov 2 18:10:25 taft-04 kernel: [<ffffffff80032624>] kthread+0x0/0x132 Nov 2 18:10:25 taft-04 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Version-Release number of selected component (if applicable): 2.6.18-274.el5 lvm2-2.02.88-2.el5 BUILT: Fri Oct 21 09:48:50 CDT 2011 lvm2-cluster-2.02.88-2.el5 BUILT: Fri Oct 21 09:49:24 CDT 2011 device-mapper-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 device-mapper-event-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 How reproducible: Every time so far when doing leg device failure