Bug 1138452
Summary: | mirror device repair failure when losing all but one leg and the log device | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | lvm2 | Assignee: | Heinz Mauelshagen <heinzm> | ||||
lvm2 sub component: | Mirroring and RAID (RHEL6) | QA Contact: | cluster-qe <cluster-qe> | ||||
Status: | CLOSED WONTFIX | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac | ||||
Version: | 6.6 | ||||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1215156 (view as bug list) | Environment: | |||||
Last Closed: | 2017-10-02 21:52:14 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1215156 | ||||||
Attachments: |
|
Description
Corey Marthaler
2014-09-04 20:09:57 UTC
This issue does appear to be reproducable. ================================================================================ Iteration 10.6 started at Fri Sep 5 01:34:53 CDT 2014 ================================================================================ ACTUAL LEG ORDER: /dev/sdg1 /dev/sdf1 /dev/sda1 Scenario kill_random_devices: Kill random devices (log and legs) ********* Mirror info for this scenario ********* * mirrors: mirror_1 * leg devices: /dev/sdg1 /dev/sdf1 /dev/sda1 * log devices: /dev/sdh1 * failpv(s): /dev/sdh1 /dev/sdg1 /dev/sda1 * failnode(s): host-074.virt.lab.msp.redhat.com * lvmetad: 0 * leg fault policy: allocate * log fault policy: allocate ************************************************* Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices mirror_1 mwi-aom--- 6.00g 100.00 mirror_1_mimage_1(0),mirror_1_mimage_3(0),mirror_1_mimage_4(0) [mirror_1_mimage_1] iwi-aom--- 6.00g /dev/sdg1(0) [mirror_1_mimage_3] iwi-aom--- 6.00g /dev/sdf1(0) [mirror_1_mimage_4] iwi-aom--- 6.00g /dev/sda1(0) [mirror_1_mlog] lwi-aom--- 4.00m /dev/sdh1(0) * NOTE: not enough available devices for allocation fault polices to fully work * (well technically, since we have 2, some allocation should work) PV=/dev/sda1 mirror_1_mimage_4: 5.2 PV=/dev/sdg1 mirror_1_mimage_1: 5.2 PV=/dev/sdh1 mirror_1_mlog: 1.2 PV=/dev/sda1 mirror_1_mimage_4: 5.2 PV=/dev/sdg1 mirror_1_mimage_1: 5.2 PV=/dev/sdh1 mirror_1_mlog: 1.2 Writing verification files (checkit) to mirror(s) on... ---- host-074.virt.lab.msp.redhat.com ---- checkit starting with: Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- host-074.virt.lab.msp.redhat.com ---- Disabling device sdh on host-074.virt.lab.msp.redhat.com Disabling device sdg on host-074.virt.lab.msp.redhat.com Disabling device sda on host-074.virt.lab.msp.redhat.com Getting recovery check start time from /var/log/messages: Sep 5 01:35 Attempting I/O to cause mirror down conversion(s) on host-074.virt.lab.msp.redhat.com 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.0976763 s, 429 MB/s Verifying current sanity of lvm after the failure Current mirror/raid device structure(s): Couldn't find device with uuid P2DTMe-vQrr-ArL8-D3s8-jhzd-BLOy-vtM6Vp. Couldn't find device with uuid iHLb4R-mzpv-1zFj-HMI1-62tV-6ZR0-sulMkq. Couldn't find device with uuid eFdxXo-mV8d-e0T9-mmAT-lFzA-dgEI-ewsmbH. LV Attr LSize Cpy%Sync Devices mirror_1 mwi-aom--- 6.00g 2.60 mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0) [mirror_1_mimage_0] Iwi-aom--- 6.00g /dev/sdf1(0) [mirror_1_mimage_1] Iwi-aom--- 6.00g /dev/sdd1(0) [mirror_1_mimage_2] Iwi-aom--- 6.00g /dev/sdb1(0) [mirror_1_mlog] lwi-aom--- 4.00m /dev/sdb1(1536) Verify that each of the mirror repairs finished successfully repair of mirrored LV mirror_1 failed on host-074.virt.lab.msp.redhat.com Created attachment 934870 [details]
log file
This appears to be easier to hit when running in exclusive activation mode while in a cluster. Sep 9 14:23:18 host-114 qarshd[9101]: Running cmdline: pvs -a Sep 9 14:23:18 host-114 kernel: md/raid1:mdX: active with 2 out of 4 mirrors Sep 9 14:23:18 host-114 kernel: created bitmap (1 pages) for device mdX Sep 9 14:23:18 host-114 kernel: md: mdX: recovery interrupted. Sep 9 14:23:20 host-114 lvm[7548]: device-mapper: waitevent ioctl on failed: Interrupted system call Sep 9 14:25:18 host-114 lvm[7548]: Error locking on node host-114: Command timed out Sep 9 14:25:18 host-114 xinetd[1932]: EXIT: qarsh status=0 pid=9101 duration=120(sec) Sep 9 14:26:17 host-114 kernel: INFO: task clvmd:6558 blocked for more than 120 seconds. Sep 9 14:26:17 host-114 kernel: Not tainted 2.6.32-500.el6.x86_64 #1 Sep 9 14:26:17 host-114 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 9 14:26:17 host-114 kernel: clvmd D 0000000000000000 0 6558 1 0x00000080 Sep 9 14:26:17 host-114 kernel: ffff88003dcdfa18 0000000000000082 ffff88003dcdf9a8 ffffffff81041e98 Sep 9 14:26:17 host-114 kernel: ffff880000000000 0003ac9bb260140c ffff88003dcdf9e8 ffff88003cf70590 Sep 9 14:26:17 host-114 kernel: 00000000000a4588 ffffffffabf0b056 ffff88003d35bab8 ffff88003dcdffd8 Sep 9 14:26:17 host-114 kernel: Call Trace: Sep 9 14:26:17 host-114 kernel: [<ffffffff81041e98>] ? pvclock_clocksource_read+0x58/0xd0 Sep 9 14:26:17 host-114 kernel: [<ffffffff810aaa21>] ? ktime_get_ts+0xb1/0xf0 Sep 9 14:26:17 host-114 kernel: [<ffffffff8152a1f3>] io_schedule+0x73/0xc0 Sep 9 14:26:17 host-114 kernel: [<ffffffff811ce71d>] __blockdev_direct_IO_newtrunc+0xb7d/0x1270 Sep 9 14:26:17 host-114 kernel: [<ffffffff812965e0>] ? vsnprintf+0x450/0x5e0 Sep 9 14:26:17 host-114 kernel: [<ffffffff811ca080>] ? blkdev_get_block+0x0/0x20 Sep 9 14:26:17 host-114 kernel: [<ffffffff8127a04d>] ? get_disk+0x7d/0xf0 Sep 9 14:26:17 host-114 kernel: [<ffffffff811cee87>] __blockdev_direct_IO+0x77/0xe0 Sep 9 14:26:17 host-114 kernel: [<ffffffff811ca080>] ? blkdev_get_block+0x0/0x20 Sep 9 14:26:17 host-114 kernel: [<ffffffff811cb107>] blkdev_direct_IO+0x57/0x60 Sep 9 14:26:17 host-114 kernel: [<ffffffff811ca080>] ? blkdev_get_block+0x0/0x20 Sep 9 14:26:17 host-114 kernel: [<ffffffff811261db>] generic_file_aio_read+0x6bb/0x700 Sep 9 14:26:17 host-114 kernel: [<ffffffff81234b51>] ? avc_has_perm+0x71/0x90 Sep 9 14:26:17 host-114 kernel: [<ffffffff81236a02>] ? selinux_inode_permission+0x72/0xb0 Sep 9 14:26:17 host-114 kernel: [<ffffffff811ca4c1>] blkdev_aio_read+0x51/0x80 Sep 9 14:26:17 host-114 kernel: [<ffffffff8118de2a>] do_sync_read+0xfa/0x140 Sep 9 14:26:17 host-114 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 Sep 9 14:26:17 host-114 kernel: [<ffffffff811ca41c>] ? block_ioctl+0x3c/0x40 Sep 9 14:26:17 host-114 kernel: [<ffffffff811a36b2>] ? vfs_ioctl+0x22/0xa0 Sep 9 14:26:17 host-114 kernel: [<ffffffff8123a55b>] ? selinux_file_permission+0xfb/0x150 Sep 9 14:26:17 host-114 kernel: [<ffffffff8122d3b6>] ? security_file_permission+0x16/0x20 Sep 9 14:26:17 host-114 kernel: [<ffffffff8118e7e5>] vfs_read+0xb5/0x1a0 Sep 9 14:26:17 host-114 kernel: [<ffffffff8118e921>] sys_read+0x51/0x90 Sep 9 14:26:17 host-114 kernel: [<ffffffff810e5aee>] ? __audit_syscall_exit+0x25e/0x290 Sep 9 14:26:17 host-114 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Quick update that this is still reproducable on mirror volumes in the latest 6.7 rpms. That said, it takes many iterations in order to hit this problem. 2.6.32-554.el6.x86_64 lvm2-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 lvm2-libs-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 lvm2-cluster-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 udev-147-2.61.el6 BUILT: Mon Mar 2 05:08:11 CST 2015 device-mapper-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-libs-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-event-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-event-libs-1.02.95-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 device-mapper-persistent-data-0.3.2-1.el6 BUILT: Fri Apr 4 08:43:06 CDT 2014 cmirror-2.02.118-2.el6 BUILT: Wed Apr 15 06:34:08 CDT 2015 ./revolution_9 -e kill_random_devices ================================================================================ Iteration 50.1 started at Tue Apr 28 07:00:48 CDT 2015 ================================================================================ ACTUAL LEG ORDER: /dev/sdc1 /dev/sda1 /dev/sde1 Scenario kill_random_devices: Kill random devices (log and legs) ********* Mirror info for this scenario ********* * mirrors: mirror_1 * leg devices: /dev/sdc1 /dev/sda1 /dev/sde1 * log devices: /dev/sdf1 * failpv(s): /dev/sdf1 /dev/sdc1 /dev/sda1 * failnode(s): host-073.virt.lab.msp.redhat.com * lvmetad: 0 * leg fault policy: remove * log fault policy: allocate ************************************************* Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices mirror_1 mwi-aom--- 6.00g 100.00 mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0) [mirror_1_mimage_0] iwi-aom--- 6.00g /dev/sdc1(0) [mirror_1_mimage_1] iwi-aom--- 6.00g /dev/sda1(0) [mirror_1_mimage_2] iwi-aom--- 6.00g /dev/sde1(0) [mirror_1_mlog] lwi-aom--- 4.00m /dev/sdf1(0) * NOTE: not enough available devices for allocation fault polices to fully work * (well technically, since we have 2, some allocation should work) PV=/dev/sda1 mirror_1_mimage_1: 6 PV=/dev/sdf1 mirror_1_mlog: 1.1 PV=/dev/sdc1 mirror_1_mimage_0: 6 PV=/dev/sda1 mirror_1_mimage_1: 6 PV=/dev/sdf1 mirror_1_mlog: 1.1 PV=/dev/sdc1 mirror_1_mimage_0: 6 Writing verification files (checkit) to mirror(s) on... ---- host-073.virt.lab.msp.redhat.com ---- Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- host-073.virt.lab.msp.redhat.com ---- Disabling device sdf on host-073.virt.lab.msp.redhat.com Disabling device sdc on host-073.virt.lab.msp.redhat.com Disabling device sda on host-073.virt.lab.msp.redhat.com Getting recovery check start time from /var/log/messages: Apr 28 07:01 Attempting I/O to cause mirror down conversion(s) on host-073.virt.lab.msp.redhat.com 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.0790474 s, 531 MB/s [root@host-073 ~]# lvs -a -o +devices Couldn't find device with uuid y4sg5S-IFm9-pJ7q-iogm-juYP-J0FO-XVsXj3. Couldn't find device with uuid qT9tZd-hBwb-L8Mi-6nRb-d2eW-fQSa-c65wg6. Couldn't find device with uuid pGhXZr-RD0N-g21I-UZkw-m1X4-A0YU-P0ykAk. LV VG Attr LSize Log Cpy%Sync Devices mirror_1 revolution_9 mwi-aom-p- 6.00g 0.00 mirror_1_mimage_1(0),mirror_1_mimage_2(0) [mirror_1_mimage_1] revolution_9 Iwi-aom-p- 6.00g unknown device(0) [mirror_1_mimage_2] revolution_9 Iwi-aom--- 6.00g /dev/sde1(0) mirror_1_mlog revolution_9 -wi-----p- 4.00m unknown device(0) Apr 28 07:01:27 host-073 qarshd[22827]: Running cmdline: echo offline > /sys/block/sdf/device/state Apr 28 07:01:27 host-073 qarshd[22829]: Running cmdline: echo offline > /sys/block/sdc/device/state Apr 28 07:01:27 host-073 lvm[22580]: Log device 253:2 has failed (D). Apr 28 07:01:27 host-073 lvm[22580]: Device failure in revolution_9-mirror_1. Apr 28 07:01:27 host-073 lvm[22580]: /dev/sdc1: read failed after 0 of 4096 at 0: Input/output error Apr 28 07:01:27 host-073 lvm[22580]: /dev/sdf1: read failed after 0 of 4096 at 0: Input/output error Apr 28 07:01:27 host-073 qarshd[22831]: Running cmdline: echo offline > /sys/block/sda/device/state [...] Apr 28 07:01:27 host-073 lvm[22580]: WARNING: Inconsistent metadata found for VG revolution_9 - updating to use version 504 Apr 28 07:01:27 host-073 lvm[22580]: WARNING: Failed to write an MDA of VG revolution_9. Apr 28 07:01:27 host-073 lvm[22580]: WARNING: Failed to write an MDA of VG revolution_9. Apr 28 07:01:32 host-073 qarshd[22857]: Running cmdline: dd if=/dev/zero of=/mnt/mirror_1/ddfile count=10 bs=4M Apr 28 07:01:32 host-073 qarshd[22859]: Running cmdline: sync [...] Apr 28 07:01:35 host-073 kernel: device-mapper: raid1: log postsuspend failed Apr 28 07:01:35 host-073 lvm[22580]: Internal error: Performing unsafe table load while 5 device(s) are known to be suspended: (253:9) Apr 28 07:01:35 host-073 lvm[22580]: Internal error: Performing unsafe table load while 5 device(s) are known to be suspended: (253:4) Apr 28 07:01:35 host-073 kernel: device-mapper: raid1: Unable to read primary mirror during recovery Apr 28 07:01:35 host-073 kernel: device-mapper: raid1: Primary mirror (253:4) failed while out-of-sync: Reads may fail. Apr 28 07:01:35 host-073 lvm[22580]: Monitoring mirror device revolution_9-mirror_1 for events. Apr 28 07:01:35 host-073 lvm[22580]: /dev/sda1: open failed: No such device or address Apr 28 07:01:35 host-073 lvm[22580]: WARNING: Failed to write an MDA of VG revolution_9. Apr 28 07:01:35 host-073 kernel: device-mapper: raid1: Unable to read primary mirror during recovery Apr 28 07:01:35 host-073 kernel: device-mapper: raid1: Unable to read primary mirror during recovery Apr 28 07:01:35 host-073 kernel: device-mapper: raid1: Unable to read primary mirror during recovery Apr 28 07:01:35 host-073 kernel: device-mapper: raid1: Unable to read primary mirror during recovery Apr 28 07:01:35 host-073 lvm[22580]: WARNING: Failed to write an MDA of VG revolution_9. Apr 28 07:01:35 host-073 lvm[22580]: WARNING: Failed to write an MDA of VG revolution_9. Apr 28 07:01:35 host-073 kernel: device-mapper: table: 253:4: linear: dm-linear: Device lookup failed Apr 28 07:01:35 host-073 kernel: device-mapper: ioctl: error adding target to table Apr 28 07:01:35 host-073 lvm[22580]: Failed to lock logical volume revolution_9/mirror_1. Apr 28 07:01:35 host-073 lvm[22580]: /dev/sda1: open failed: No such device or address Apr 28 07:01:35 host-073 lvm[22580]: Attempt to close device '/dev/sda1' which is not open. Apr 28 07:01:35 host-073 lvm[22580]: Repair of mirrored device revolution_9-mirror_1 failed. Apr 28 07:01:35 host-073 lvm[22580]: Failed to remove faulty devices in revolution_9-mirror_1. Apr 28 07:01:35 host-073 lvm[22580]: Primary mirror device 253:4 sync failed. Apr 28 07:01:37 host-073 lvm[22580]: No longer monitoring mirror device revolution_9-mirror_1 for events. Apr 28 07:04:40 host-073 kernel: INFO: task flush-253:6:22909 blocked for more than 120 seconds. Apr 28 07:04:40 host-073 kernel: Not tainted 2.6.32-554.el6.x86_64 #1 Apr 28 07:04:40 host-073 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 28 07:04:40 host-073 kernel: flush-253:6 D 0000000000000000 0 22909 2 0x00000080 Apr 28 07:04:40 host-073 kernel: ffff88002eeb3840 0000000000000046 ffff880000000000 0000830b5203cba2 Apr 28 07:04:40 host-073 kernel: ffff880000822f60 ffff88003a3069f0 000000000673ac22 ffffffffaac19ea0 Apr 28 07:04:40 host-073 kernel: 000000001e4188ab 0000000104366706 ffff88003b4cc5f8 ffff88002eeb3fd8 Apr 28 07:04:40 host-073 kernel: Call Trace: Apr 28 07:04:40 host-073 kernel: [<ffffffff811c8580>] ? sync_buffer+0x0/0x50 Apr 28 07:04:40 host-073 kernel: [<ffffffff81537bc3>] io_schedule+0x73/0xc0 Apr 28 07:04:40 host-073 kernel: [<ffffffff811c85c0>] sync_buffer+0x40/0x50 Apr 28 07:04:40 host-073 kernel: [<ffffffff8153845a>] __wait_on_bit_lock+0x5a/0xc0 Apr 28 07:04:40 host-073 kernel: [<ffffffff811c8580>] ? sync_buffer+0x0/0x50 Apr 28 07:04:40 host-073 kernel: [<ffffffff81538538>] out_of_line_wait_on_bit_lock+0x78/0x90 Apr 28 07:04:40 host-073 kernel: [<ffffffff810a1520>] ? wake_bit_function+0x0/0x50 Apr 28 07:04:40 host-073 kernel: [<ffffffff811c88f0>] ? end_buffer_async_write+0x0/0x190 Apr 28 07:04:40 host-073 kernel: [<ffffffff811c8766>] __lock_buffer+0x36/0x40 Apr 28 07:04:40 host-073 kernel: [<ffffffff811c9a65>] __block_write_full_page+0x305/0x330 Apr 28 07:04:40 host-073 kernel: [<ffffffff811c88f0>] ? end_buffer_async_write+0x0/0x190 Apr 28 07:04:40 host-073 kernel: [<ffffffff811c9b70>] block_write_full_page_endio+0xe0/0x120 Apr 28 07:04:40 host-073 kernel: [<ffffffffa0480d10>] ? buffer_unmapped+0x0/0x20 [ext3] Apr 28 07:04:40 host-073 kernel: [<ffffffff811c9bc5>] block_write_full_page+0x15/0x20 Apr 28 07:04:40 host-073 kernel: [<ffffffffa04818bd>] ext3_ordered_writepage+0x1ed/0x240 [ext3] Apr 28 07:04:40 host-073 kernel: [<ffffffff8113b087>] __writepage+0x17/0x40 Apr 28 07:04:40 host-073 kernel: [<ffffffff8113c34d>] write_cache_pages+0x1fd/0x4c0 Apr 28 07:04:40 host-073 kernel: [<ffffffff8113b070>] ? __writepage+0x0/0x40 Apr 28 07:04:40 host-073 kernel: [<ffffffff8113c634>] generic_writepages+0x24/0x30 Apr 28 07:04:40 host-073 kernel: [<ffffffff8113c675>] do_writepages+0x35/0x40 Apr 28 07:04:40 host-073 kernel: [<ffffffff811be40d>] writeback_single_inode+0xdd/0x290 Apr 28 07:04:40 host-073 kernel: [<ffffffff811be80d>] writeback_sb_inodes+0xbd/0x170 Apr 28 07:04:40 host-073 kernel: [<ffffffff811be96b>] writeback_inodes_wb+0xab/0x1b0 Apr 28 07:04:40 host-073 kernel: [<ffffffff811bed63>] wb_writeback+0x2f3/0x410 Apr 28 07:04:40 host-073 kernel: [<ffffffff8108a772>] ? del_timer_sync+0x22/0x30 Apr 28 07:04:40 host-073 kernel: [<ffffffff811bef3b>] wb_do_writeback+0xbb/0x240 Apr 28 07:04:40 host-073 kernel: [<ffffffff811bf123>] bdi_writeback_task+0x63/0x1b0 Apr 28 07:04:40 host-073 kernel: [<ffffffff810a1327>] ? bit_waitqueue+0x17/0xd0 Apr 28 07:04:40 host-073 kernel: [<ffffffff8114b380>] ? bdi_start_fn+0x0/0x100 Apr 28 07:04:40 host-073 kernel: [<ffffffff8114b406>] bdi_start_fn+0x86/0x100 Apr 28 07:04:40 host-073 kernel: [<ffffffff8114b380>] ? bdi_start_fn+0x0/0x100 Apr 28 07:04:40 host-073 kernel: [<ffffffff810a100e>] kthread+0x9e/0xc0 Apr 28 07:04:40 host-073 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Apr 28 07:04:40 host-073 kernel: [<ffffffff810a0f70>] ? kthread+0x0/0xc0 Apr 28 07:04:40 host-073 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 |