Bug 807069
Summary: | raid volumes containing snapshot devices fail to repair after device failure | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> |
Component: | lvm2 | Assignee: | Jonathan Earl Brassow <jbrassow> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 6.3 | CC: | agk, dwysocha, heinzm, jbrassow, mbroz, msnitzer, prajnoha, prockai, thornber, zkabelac |
Target Milestone: | rc | Keywords: | Reopened, TestBlocker |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | lvm2-2.02.95-5.el6 | Doc Type: | Bug Fix |
Doc Text: |
New feature to rhel6.3. No documentation necessary.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2012-06-20 15:03:15 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 739162 |
Description
Corey Marthaler
2012-03-26 22:30:10 UTC
This fails for raid1 volumes as well. taft-01 qarshd[7461]: Running cmdline: echo offline > /sys/block/sdh/device/state & taft-01 kernel: md: super_written gets error=-5, uptodate=0 taft-01 kernel: md/raid1:mdX: Disk failure on dm-3, disabling device. taft-01 kernel: md/raid1:mdX: Operation continuing on 2 devices. taft-01 lvm[1217]: Device #0 of raid1 array, black_bird-synced_primary_raid1_2legs_1-real, has failed. taft-01 lvm[1217]: device-mapper: waitevent ioctl on failed: Interrupted system call taft-01 kernel: device-mapper: raid: Device 0 specified for rebuild: Clearing superblock taft-01 kernel: md/raid1:mdX: active with 2 out of 3 mirrors taft-01 kernel: created bitmap (1 pages) for device mdX taft-01 kernel: mdX: bitmap initialized from disk: read 1/1 pages, set 4 of 1000 bits taft-01 kernel: md: recovery of RAID array mdX taft-01 kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. taft-01 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. taft-01 kernel: md: using 128k window, over a total of 512000k. taft-01 kernel: md: md_do_sync() got signal ... exiting taft-01 kernel: INFO: task dmeventd:7418 blocked for more than 120 seconds. taft-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. taft-01 kernel: dmeventd D 0000000000000000 0 7418 1 0x00000080 taft-01 kernel: ffff880217e55aa8 0000000000000082 0000000000000000 0000000000000286 taft-01 kernel: ffffe8ffffc02480 0000000000000282 ffff880218ba4cd8 0000000000000282 taft-01 kernel: ffff880215e4d058 ffff880217e55fd8 000000000000fb48 ffff880215e4d058 taft-01 kernel: Call Trace: taft-01 kernel: [<ffffffff814f8895>] schedule_timeout+0x215/0x2e0 taft-01 kernel: [<ffffffff8125331e>] ? generic_make_request+0x25e/0x530 taft-01 kernel: [<ffffffffa000261b>] ? dm_get_live_table+0x4b/0x60 [dm_mod] taft-01 kernel: [<ffffffff814f8513>] wait_for_common+0x123/0x180 taft-01 kernel: [<ffffffff8105ebc0>] ? default_wake_function+0x0/0x20 taft-01 kernel: [<ffffffff8125367d>] ? submit_bio+0x8d/0x120 taft-01 kernel: [<ffffffff814f862d>] wait_for_completion+0x1d/0x20 taft-01 kernel: [<ffffffff813f3b0e>] sync_page_io+0xbe/0x110 taft-01 kernel: [<ffffffffa039edd9>] raid_ctr+0xd69/0xfe4 [dm_raid] taft-01 kernel: [<ffffffffa0005f2f>] dm_table_add_target+0x13f/0x3b0 [dm_mod] taft-01 kernel: [<ffffffffa0008559>] table_load+0xc9/0x340 [dm_mod] taft-01 kernel: [<ffffffffa00094b4>] ctl_ioctl+0x1b4/0x270 [dm_mod] taft-01 kernel: [<ffffffffa0008490>] ? table_load+0x0/0x340 [dm_mod] taft-01 kernel: [<ffffffffa0009583>] dm_ctl_ioctl+0x13/0x20 [dm_mod] taft-01 kernel: [<ffffffff8118bfc2>] vfs_ioctl+0x22/0xa0 taft-01 kernel: [<ffffffff8118c164>] do_vfs_ioctl+0x84/0x580 taft-01 kernel: [<ffffffff8118c6e1>] sys_ioctl+0x81/0xa0 taft-01 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b Fails for raid4 and raid6 as well. I've been hunting for this one for a while. I just noticed this from above: "Unable to extract RAID image while RAID array is not in-sync" At first I didn't think that was right because I thought this was RAID1. For RAID5 this makes perfect sense. If a device is removed before the array is in-sync, the array contents will be undefined. Better to abort the device replacement and have the admin try to fix the failed device... I'm gonna close this "not a bug". If the RAID5 device is in-sync when the failure happens, then we should re-open this bug. Missed the last comment. This means the bug can be caused even when in-sync (or under RAID1). Patch seems to have taken care of the problem: [root@bp-01 LVM2]# dmsetup status vg-lv-real; devices vg 0 10485760 raid raid5_ls 5 AAAAA 2621440/2621440 LV Copy% Devices lv lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0),lv_rimage_3(0),lv_rimage_4(0) [lv_rimage_0] /dev/sdb1(1) [lv_rimage_1] /dev/sde1(1) [lv_rimage_2] /dev/sdc1(1) [lv_rimage_3] /dev/sdf1(1) [lv_rimage_4] /dev/sdh1(1) [lv_rmeta_0] /dev/sdb1(0) [lv_rmeta_1] /dev/sde1(0) [lv_rmeta_2] /dev/sdc1(0) [lv_rmeta_3] /dev/sdf1(0) [lv_rmeta_4] /dev/sdh1(0) snap /dev/sde1(321) [root@bp-01 LVM2]# off.sh sdb Turning off sdb [root@bp-01 LVM2]# !dd dd if=/dev/zero of=/dev/vg/lv bs=4M count=10 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 12.2158 s, 3.4 MB/s [root@bp-01 LVM2]# dmsetup status vg-lv-real; devices vg 0 10485760 raid raid5_ls 5 aAAAA 931920/2621440 Couldn't find device with uuid iHklfm-tiU5-jLv0-fwHB-3j2Q-o5h1-tUZ21O. LV Copy% Devices lv lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0),lv_rimage_3(0),lv_rimage_4(0) [lv_rimage_0] /dev/sdi1(1) [lv_rimage_1] /dev/sde1(1) [lv_rimage_2] /dev/sdc1(1) [lv_rimage_3] /dev/sdf1(1) [lv_rimage_4] /dev/sdh1(1) [lv_rmeta_0] /dev/sdi1(0) [lv_rmeta_1] /dev/sde1(0) [lv_rmeta_2] /dev/sdc1(0) [lv_rmeta_3] /dev/sdf1(0) [lv_rmeta_4] /dev/sdh1(0) snap /dev/sde1(321) Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: New feature to rhel6.3. No documentation necessary. Fix verified with the latest rpms and upstream kernel containing a fix for bug 811669. 2.6.32-265.el6.brassow.1.x86_64 lvm2-2.02.95-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 lvm2-libs-2.02.95-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 lvm2-cluster-2.02.95-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 udev-147-2.41.el6 BUILT: Thu Mar 1 13:01:08 CST 2012 device-mapper-1.02.74-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 device-mapper-libs-1.02.74-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 device-mapper-event-1.02.74-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 device-mapper-event-libs-1.02.74-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 cmirror-2.02.95-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0962.html |