Red Hat Bugzilla – Bug 889368
LVM RAID: I/O can hang if entire stripe (mirror group) of RAID10 LV is killed while under snapshot
Last modified: 2016-05-10 17:47:36 EDT
This bug is a direct result of the way snapshots are handling failed writes to the origin. Specifically, 'retry_origin_bios' doesn't not allow the failures to propagate - causing I/O to hang indefinitely.
making this a 6.7 discussion.
I don't remember the specifics of the snapshot code, but from comment 2 it is related to 'retry_origin_bios' in dm-snap.c. You don't need RAID to reproduce this either, any device will do. In this case, I used stripe as the origin. [root@bp-01 ~]# lvs vg LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert snap vg swi-I-s--- 50.00g stripe 100.00 stripe vg owi-aos--- 100.00g Steps to repo: 1) create LV 2) create a snapshot of it. 3) start I/O to the origin (dd) 4) kill a device in the origin **) the 'dd' will never complete due to indefinite retry of bios. It should emit errors. I get a lot of the following messages also: Oct 20 14:41:10 bp-01 kernel: lost page write due to I/O error on dm-3 Oct 20 14:41:10 bp-01 kernel: Buffer I/O error on device dm-3, logical block 572824 Oct 20 14:41:10 bp-01 kernel: lost page write due to I/O error on dm-3 Oct 20 14:41:10 bp-01 kernel: Buffer I/O error on device dm-3, logical block 572825 Oct 20 14:41:10 bp-01 kernel: lost page write due to I/O error on dm-3 [root@bp-01 ~]# ls -l /dev/vg total 0 lrwxrwxrwx. 1 root root 7 Oct 20 14:33 snap -> ../dm-6 lrwxrwxrwx. 1 root root 7 Oct 20 14:33 stripe -> ../dm-3
Upstream patch: https://www.redhat.com/archives/dm-devel/2016-January/msg00090.html
Created attachment 1113054 [details] RHEL 6 patch The patch, backported to RHEL 6
Patch(es) available on kernel-2.6.32-609.el6
Marking verified based on the test case given in comment #10. 2.6.32-639.el6.x86_64 lvm2-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 lvm2-libs-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 lvm2-cluster-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 udev-147-2.72.el6 BUILT: Tue Mar 1 06:14:05 CST 2016 device-mapper-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-libs-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-event-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-event-libs-1.02.117-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 device-mapper-persistent-data-0.6.2-0.1.rc7.el6 BUILT: Tue Mar 22 08:58:09 CDT 2016 cmirror-2.02.143-7.el6 BUILT: Wed Apr 6 10:08:33 CDT 2016 [root@host-113 ~]# lvcreate -L 20G -i 2 -n stripe vg Using default stripesize 64.00 KiB. Logical volume "stripe" created. [root@host-113 ~]# lvcreate -s vg/stripe -n snap -L 5G Logical volume "snap" created. [root@host-113 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Devices snap vg swi-a-s--- 5.00g stripe 0.00 /dev/sda1(2560) stripe vg owi-a-s--- 20.00g /dev/sda1(0),/dev/sdb1(0) [root@host-113 ~]# dd if=/dev/urandom of=/dev/vg/stripe bs=1M count=2000 # Takes awhile, but this does eventually finish well after the device failure 2000+0 records in 2000+0 records out [root@host-113 ~]# echo offline > /sys/block/sdb/device/state # Additional writes now either work of fail with an i/o error depending on the size [root@host-113 ~]# dd if=/dev/urandom of=/dev/vg/stripe count=10 10+0 records in 10+0 records out 5120 bytes (5.1 kB) copied, 0.00153896 s, 3.3 MB/s [root@host-113 ~]# dd if=/dev/urandom of=/dev/vg/stripe count=1000 dd: writing to `/dev/vg/stripe': Input/output error 137+0 records in 136+0 records out 69632 bytes (70 kB) copied, 0.0250647 s, 2.8 MB/s [root@host-113 ~]# lvs -a -o +devices /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error /dev/vg/stripe: read failed after 0 of 4096 at 0: Input/output error /dev/vg/stripe: read failed after 0 of 4096 at 21474770944: Input/output error /dev/vg/stripe: read failed after 0 of 4096 at 21474828288: Input/output error /dev/vg/snap: read failed after 0 of 4096 at 0: Input/output error /dev/vg/snap: read failed after 0 of 4096 at 21474770944: Input/output error /dev/vg/snap: read failed after 0 of 4096 at 21474828288: Input/output error /dev/vg/snap: read failed after 0 of 4096 at 4096: Input/output error /dev/sdb1: read failed after 0 of 4096 at 26838958080: Input/output error /dev/sdb1: read failed after 0 of 4096 at 26839048192: Input/output error /dev/sdb1: read failed after 0 of 4096 at 4096: Input/output error Couldn't find device with uuid q6iRyy-YT4M-kqdr-2qZW-oR4u-f19u-lAX6gQ. Couldn't find device for segment belonging to vg/stripe while checking used and assumed devices. LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices snap vg swi-I-s--- 5.00g stripe 100.00 /dev/sda1(2560) stripe vg owi-aos-p- 20.00g /dev/sda1(0),unknown device(0)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-0855.html