| Summary: | raid volumes containing snapshot devices fail to repair after device failure | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> |
| Component: | lvm2 | Assignee: | Jonathan Earl Brassow <jbrassow> |
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 6.3 | CC: | agk, dwysocha, heinzm, jbrassow, mbroz, msnitzer, prajnoha, prockai, thornber, zkabelac |
| Target Milestone: | rc | Keywords: | Reopened, TestBlocker |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | lvm2-2.02.95-5.el6 | Doc Type: | Bug Fix |
| Doc Text: |
New feature to rhel6.3. No documentation necessary.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-06-20 15:03:15 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 739162 | ||
This fails for raid1 volumes as well. taft-01 qarshd[7461]: Running cmdline: echo offline > /sys/block/sdh/device/state & taft-01 kernel: md: super_written gets error=-5, uptodate=0 taft-01 kernel: md/raid1:mdX: Disk failure on dm-3, disabling device. taft-01 kernel: md/raid1:mdX: Operation continuing on 2 devices. taft-01 lvm[1217]: Device #0 of raid1 array, black_bird-synced_primary_raid1_2legs_1-real, has failed. taft-01 lvm[1217]: device-mapper: waitevent ioctl on failed: Interrupted system call taft-01 kernel: device-mapper: raid: Device 0 specified for rebuild: Clearing superblock taft-01 kernel: md/raid1:mdX: active with 2 out of 3 mirrors taft-01 kernel: created bitmap (1 pages) for device mdX taft-01 kernel: mdX: bitmap initialized from disk: read 1/1 pages, set 4 of 1000 bits taft-01 kernel: md: recovery of RAID array mdX taft-01 kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. taft-01 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. taft-01 kernel: md: using 128k window, over a total of 512000k. taft-01 kernel: md: md_do_sync() got signal ... exiting taft-01 kernel: INFO: task dmeventd:7418 blocked for more than 120 seconds. taft-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. taft-01 kernel: dmeventd D 0000000000000000 0 7418 1 0x00000080 taft-01 kernel: ffff880217e55aa8 0000000000000082 0000000000000000 0000000000000286 taft-01 kernel: ffffe8ffffc02480 0000000000000282 ffff880218ba4cd8 0000000000000282 taft-01 kernel: ffff880215e4d058 ffff880217e55fd8 000000000000fb48 ffff880215e4d058 taft-01 kernel: Call Trace: taft-01 kernel: [<ffffffff814f8895>] schedule_timeout+0x215/0x2e0 taft-01 kernel: [<ffffffff8125331e>] ? generic_make_request+0x25e/0x530 taft-01 kernel: [<ffffffffa000261b>] ? dm_get_live_table+0x4b/0x60 [dm_mod] taft-01 kernel: [<ffffffff814f8513>] wait_for_common+0x123/0x180 taft-01 kernel: [<ffffffff8105ebc0>] ? default_wake_function+0x0/0x20 taft-01 kernel: [<ffffffff8125367d>] ? submit_bio+0x8d/0x120 taft-01 kernel: [<ffffffff814f862d>] wait_for_completion+0x1d/0x20 taft-01 kernel: [<ffffffff813f3b0e>] sync_page_io+0xbe/0x110 taft-01 kernel: [<ffffffffa039edd9>] raid_ctr+0xd69/0xfe4 [dm_raid] taft-01 kernel: [<ffffffffa0005f2f>] dm_table_add_target+0x13f/0x3b0 [dm_mod] taft-01 kernel: [<ffffffffa0008559>] table_load+0xc9/0x340 [dm_mod] taft-01 kernel: [<ffffffffa00094b4>] ctl_ioctl+0x1b4/0x270 [dm_mod] taft-01 kernel: [<ffffffffa0008490>] ? table_load+0x0/0x340 [dm_mod] taft-01 kernel: [<ffffffffa0009583>] dm_ctl_ioctl+0x13/0x20 [dm_mod] taft-01 kernel: [<ffffffff8118bfc2>] vfs_ioctl+0x22/0xa0 taft-01 kernel: [<ffffffff8118c164>] do_vfs_ioctl+0x84/0x580 taft-01 kernel: [<ffffffff8118c6e1>] sys_ioctl+0x81/0xa0 taft-01 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b Fails for raid4 and raid6 as well. I've been hunting for this one for a while. I just noticed this from above: "Unable to extract RAID image while RAID array is not in-sync" At first I didn't think that was right because I thought this was RAID1. For RAID5 this makes perfect sense. If a device is removed before the array is in-sync, the array contents will be undefined. Better to abort the device replacement and have the admin try to fix the failed device... I'm gonna close this "not a bug". If the RAID5 device is in-sync when the failure happens, then we should re-open this bug. Missed the last comment. This means the bug can be caused even when in-sync (or under RAID1). Patch seems to have taken care of the problem: [root@bp-01 LVM2]# dmsetup status vg-lv-real; devices vg 0 10485760 raid raid5_ls 5 AAAAA 2621440/2621440 LV Copy% Devices lv lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0),lv_rimage_3(0),lv_rimage_4(0) [lv_rimage_0] /dev/sdb1(1) [lv_rimage_1] /dev/sde1(1) [lv_rimage_2] /dev/sdc1(1) [lv_rimage_3] /dev/sdf1(1) [lv_rimage_4] /dev/sdh1(1) [lv_rmeta_0] /dev/sdb1(0) [lv_rmeta_1] /dev/sde1(0) [lv_rmeta_2] /dev/sdc1(0) [lv_rmeta_3] /dev/sdf1(0) [lv_rmeta_4] /dev/sdh1(0) snap /dev/sde1(321) [root@bp-01 LVM2]# off.sh sdb Turning off sdb [root@bp-01 LVM2]# !dd dd if=/dev/zero of=/dev/vg/lv bs=4M count=10 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 12.2158 s, 3.4 MB/s [root@bp-01 LVM2]# dmsetup status vg-lv-real; devices vg 0 10485760 raid raid5_ls 5 aAAAA 931920/2621440 Couldn't find device with uuid iHklfm-tiU5-jLv0-fwHB-3j2Q-o5h1-tUZ21O. LV Copy% Devices lv lv_rimage_0(0),lv_rimage_1(0),lv_rimage_2(0),lv_rimage_3(0),lv_rimage_4(0) [lv_rimage_0] /dev/sdi1(1) [lv_rimage_1] /dev/sde1(1) [lv_rimage_2] /dev/sdc1(1) [lv_rimage_3] /dev/sdf1(1) [lv_rimage_4] /dev/sdh1(1) [lv_rmeta_0] /dev/sdi1(0) [lv_rmeta_1] /dev/sde1(0) [lv_rmeta_2] /dev/sdc1(0) [lv_rmeta_3] /dev/sdf1(0) [lv_rmeta_4] /dev/sdh1(0) snap /dev/sde1(321)
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
New feature to rhel6.3. No documentation necessary.
Fix verified with the latest rpms and upstream kernel containing a fix for bug 811669. 2.6.32-265.el6.brassow.1.x86_64 lvm2-2.02.95-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 lvm2-libs-2.02.95-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 lvm2-cluster-2.02.95-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 udev-147-2.41.el6 BUILT: Thu Mar 1 13:01:08 CST 2012 device-mapper-1.02.74-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 device-mapper-libs-1.02.74-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 device-mapper-event-1.02.74-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 device-mapper-event-libs-1.02.74-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 cmirror-2.02.95-5.el6 BUILT: Thu Apr 19 10:29:01 CDT 2012 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0962.html |
Description of problem: ./black_bird -l /home/msp/cmarthal/work/sts/sts-root -r /usr/tests/sts-rhel6.3 -o taft-01 -F -w NONE -i 2 -e kill_random_synced_raid5_2legs Scenario kill_random_synced_raid5_2legs: Kill random leg of synced 2 leg raid5 volume(s) ********* RAID hash info for this scenario ********* * names: synced_random_raid5_2legs_1 * sync: 1 * type: raid5 * -m |-i value: 2 * leg devices: /dev/sdc1 /dev/sdd1 /dev/sdb1 * failpv(s): /dev/sdd1 * failnode(s): taft-01 * additional snap: /dev/sdc1 * raid fault policy: allocate ****************************************************** Creating raids(s) on taft-01... taft-01: lvcreate --type raid5 -i 2 -n synced_random_raid5_2legs_1 -L 500M black_bird /dev/sdc1:0-1000 /dev/sdd1:0-1000 /dev/sdb1:0-1000 Creating a snapshot volume of each of the raids RAID Structure(s): LV Attr LSize Copy% Devices bb_snap1 swi-a-s- 252.00m /dev/sdc1(64) synced_random_raid5_2legs_1 owi-a-r- 504.00m synced_random_raid5_2legs_1_rimage_0(0),synced_random_raid5_2legs_1_rimage_1(0),synced_random_raid5_2legs_1_rimage_2(0) [synced_random_raid5_2legs_1_rimage_0] Iwi-aor- 252.00m /dev/sdc1(1) [synced_random_raid5_2legs_1_rimage_1] Iwi-aor- 252.00m /dev/sdd1(1) [synced_random_raid5_2legs_1_rimage_2] Iwi-aor- 252.00m /dev/sdb1(1) [synced_random_raid5_2legs_1_rmeta_0] ewi-aor- 4.00m /dev/sdc1(0) [synced_random_raid5_2legs_1_rmeta_1] ewi-aor- 4.00m /dev/sdd1(0) [synced_random_raid5_2legs_1_rmeta_2] ewi-aor- 4.00m /dev/sdb1(0) PV=/dev/sdd1 synced_random_raid5_2legs_1_rimage_1: 1 synced_random_raid5_2legs_1_rmeta_1: 1 Disabling device sdd on taft-01 Attempting I/O to cause mirror down conversion(s) on taft-01 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.0317989 s, 16.1 kB/s Verifying current sanity of lvm after the failure RAID Structure(s): /dev/sdd1: read failed after 0 of 512 at 145669554176: Input/output error [...] Couldn't find device with uuid Ii9puH-3AfI-h9yO-fyhJ-A2GZ-abFe-nhToKu. LV Attr LSize Copy% Devices bb_snap1 swi-a-s- 252.00m /dev/sdc1(64) synced_random_raid5_2legs_1 owi-a-r- 504.00m synced_random_raid5_2legs_1_rimage_0(0),synced_random_raid5_2legs_1_rimage_1(0),synced_random_raid5_2legs_1_rimage_2(0) [synced_random_raid5_2legs_1_rimage_0] Iwi-aor- 252.00m /dev/sdc1(1) [synced_random_raid5_2legs_1_rimage_1] Iwi-aor- 252.00m unknown device(1) [synced_random_raid5_2legs_1_rimage_2] Iwi-aor- 252.00m /dev/sdb1(1) [synced_random_raid5_2legs_1_rmeta_0] ewi-aor- 4.00m /dev/sdc1(0) [synced_random_raid5_2legs_1_rmeta_1] ewi-aor- 4.00m unknown device(0) [synced_random_raid5_2legs_1_rmeta_2] ewi-aor- 4.00m /dev/sdb1(0) Verifying FAILED device /dev/sdd1 is *NOT* in the volume(s) Verifying IMAGE device /dev/sdc1 *IS* in the volume(s) Verifying IMAGE device /dev/sdb1 *IS* in the volume(s) verify the rimage/rmeta dm devices remain after the failures Checking EXISTENCE and STATE of synced_random_raid5_2legs_1_rimage_1 on: taft-01 there should not be an 'unknown' device associated with synced_random_raid5_2legs_1_rimage_1 on taft-01 Mar 26 17:16:26 taft-01 lvm[3123]: Unable to extract RAID image while RAID array is not in-sync Mar 26 17:16:26 taft-01 lvm[3123]: Failed to remove the specified images from black_bird/synced_random_raid5_2legs_1 Mar 26 17:16:26 taft-01 lvm[3123]: Failed to replace faulty devices in black_bird/synced_random_raid5_2legs_1. Mar 26 17:16:26 taft-01 lvm[3123]: Repair of RAID device black_bird-synced_random_raid5_2legs_1-real failed. Mar 26 17:16:26 taft-01 lvm[3123]: Failed to process event for black_bird-synced_random_raid5_2legs_1-real Version-Release number of selected component (if applicable): 2.6.32-251.el6.x86_64 lvm2-2.02.95-2.el6 BUILT: Fri Mar 16 08:39:54 CDT 2012 lvm2-libs-2.02.95-2.el6 BUILT: Fri Mar 16 08:39:54 CDT 2012 lvm2-cluster-2.02.95-2.el6 BUILT: Fri Mar 16 08:39:54 CDT 2012 udev-147-2.40.el6 BUILT: Fri Sep 23 07:51:13 CDT 2011 device-mapper-1.02.74-2.el6 BUILT: Fri Mar 16 08:39:54 CDT 2012 device-mapper-libs-1.02.74-2.el6 BUILT: Fri Mar 16 08:39:54 CDT 2012 device-mapper-event-1.02.74-2.el6 BUILT: Fri Mar 16 08:39:54 CDT 2012 device-mapper-event-libs-1.02.74-2.el6 BUILT: Fri Mar 16 08:39:54 CDT 2012 cmirror-2.02.95-2.el6 BUILT: Fri Mar 16 08:39:54 CDT 2012 How reproducible: Every time