Bug 889368
Summary: | LVM RAID: I/O can hang if entire stripe (mirror group) of RAID10 LV is killed while under snapshot | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jonathan Earl Brassow <jbrassow> | ||||
Component: | kernel | Assignee: | Mikuláš Patočka <mpatocka> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.4 | CC: | agk, cmarthal, coughlan, dwysocha, heinzm, jbrassow, mpatocka, msnitzer, prajnoha, prockai, thornber, tlavigne, wgomerin, zkabelac | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | kernel-2.6.32-609.el6 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 886658 | Environment: | |||||
Last Closed: | 2016-05-10 21:47:36 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 886658 | ||||||
Bug Blocks: | 1217621, 1268411 | ||||||
Attachments: |
|
Comment 2
Jonathan Earl Brassow
2012-12-21 17:49:46 UTC
making this a 6.7 discussion. I don't remember the specifics of the snapshot code, but from comment 2 it is related to 'retry_origin_bios' in dm-snap.c. You don't need RAID to reproduce this either, any device will do. In this case, I used stripe as the origin. [root@bp-01 ~]# lvs vg LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert snap vg swi-I-s--- 50.00g stripe 100.00 stripe vg owi-aos--- 100.00g Steps to repo: 1) create LV 2) create a snapshot of it. 3) start I/O to the origin (dd) 4) kill a device in the origin **) the 'dd' will never complete due to indefinite retry of bios. It should emit errors. I get a lot of the following messages also: Oct 20 14:41:10 bp-01 kernel: lost page write due to I/O error on dm-3 Oct 20 14:41:10 bp-01 kernel: Buffer I/O error on device dm-3, logical block 572824 Oct 20 14:41:10 bp-01 kernel: lost page write due to I/O error on dm-3 Oct 20 14:41:10 bp-01 kernel: Buffer I/O error on device dm-3, logical block 572825 Oct 20 14:41:10 bp-01 kernel: lost page write due to I/O error on dm-3 [root@bp-01 ~]# ls -l /dev/vg total 0 lrwxrwxrwx. 1 root root 7 Oct 20 14:33 snap -> ../dm-6 lrwxrwxrwx. 1 root root 7 Oct 20 14:33 stripe -> ../dm-3 Created attachment 1113054 [details]
RHEL 6 patch
The patch, backported to RHEL 6
Patch(es) available on kernel-2.6.32-609.el6 Marking verified based on the test case given in comment #10. 2.6.32-639.el6.x86_64 lvm2-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 lvm2-libs-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 lvm2-cluster-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 udev-147-2.72.el6 BUILT: Tue Mar 1 06:14:05 CST 2016 device-mapper-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-libs-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-event-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-event-libs-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-persistent-data-0.6.2-0.1.rc7.el6 BUILT: Tue Mar 22 08:58:09 CDT 2016 cmirror-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 [root@host-113 ~]# lvcreate -L 20G -i 2 -n stripe vg Using default stripesize 64.00 KiB. Logical volume "stripe" created. [root@host-113 ~]# lvcreate -s vg/stripe -n snap -L 5G Logical volume "snap" created. [root@host-113 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Devices snap vg swi-a-s--- 5.00g stripe 0.00 /dev/sda1(2560) stripe vg owi-a-s--- 20.00g /dev/sda1(0),/dev/sdb1(0) [root@host-113 ~]# dd if=/dev/urandom of=/dev/vg/stripe bs=1M count=2000 # Takes awhile, but this does eventually finish well after the device failure 2000+0 records in 2000+0 records out [root@host-113 ~]# echo offline > /sys/block/sdb/device/state # Additional writes now either work of fail with an i/o error depending on the size [root@host-113 ~]# dd if=/dev/urandom of=/dev/vg/stripe count=10 10+0 records in 10+0 records out 5120 bytes (5.1 kB) copied, 0.00153896 s, 3.3 MB/s [root@host-113 ~]# dd if=/dev/urandom of=/dev/vg/stripe count=1000 dd: writing to `/dev/vg/stripe': Input/output error 137+0 records in 136+0 records out 69632 bytes (70 kB) copied, 0.0250647 s, 2.8 MB/s [root@host-113 ~]# lvs -a -o +devices /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error /dev/vg/stripe: read failed after 0 of 4096 at 0: Input/output error /dev/vg/stripe: read failed after 0 of 4096 at 21474770944: Input/output error /dev/vg/stripe: read failed after 0 of 4096 at 21474828288: Input/output error /dev/vg/snap: read failed after 0 of 4096 at 0: Input/output error /dev/vg/snap: read failed after 0 of 4096 at 21474770944: Input/output error /dev/vg/snap: read failed after 0 of 4096 at 21474828288: Input/output error /dev/vg/snap: read failed after 0 of 4096 at 4096: Input/output error /dev/sdb1: read failed after 0 of 4096 at 26838958080: Input/output error /dev/sdb1: read failed after 0 of 4096 at 26839048192: Input/output error /dev/sdb1: read failed after 0 of 4096 at 4096: Input/output error Couldn't find device with uuid q6iRyy-YT4M-kqdr-2qZW-oR4u-f19u-lAX6gQ. Couldn't find device for segment belonging to vg/stripe while checking used and assumed devices. LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices snap vg swi-I-s--- 5.00g stripe 100.00 /dev/sda1(2560) stripe vg owi-aos-p- 20.00g /dev/sda1(0),unknown device(0) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-0855.html |