Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 889368 - LVM RAID: I/O can hang if entire stripe (mirror group) of RAID10 LV is killed while under snapshot
LVM RAID: I/O can hang if entire stripe (mirror group) of RAID10 LV is kille...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.4
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Mikuláš Patočka
Cluster QE
:
Depends On: 886658
Blocks: 1217621 1268411
  Show dependency treegraph
 
Reported: 2012-12-20 18:48 EST by Jonathan Earl Brassow
Modified: 2016-05-10 17:47 EDT (History)
14 users (show)

See Also:
Fixed In Version: kernel-2.6.32-609.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 886658
Environment:
Last Closed: 2016-05-10 17:47:36 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
RHEL 6 patch (4.47 KB, patch)
2016-01-08 19:11 EST, Mikuláš Patočka
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:0855 normal SHIPPED_LIVE Moderate: kernel security, bug fix, and enhancement update 2016-05-10 18:43:57 EDT

  None (edit)
Comment 2 Jonathan Earl Brassow 2012-12-21 12:49:46 EST
This bug is a direct result of the way snapshots are handling failed writes to the origin.  Specifically, 'retry_origin_bios' doesn't not allow the failures to propagate - causing I/O to hang indefinitely.
Comment 6 Jonathan Earl Brassow 2014-08-26 23:52:09 EDT
making this a 6.7 discussion.
Comment 10 Jonathan Earl Brassow 2015-10-20 15:52:15 EDT
I don't remember the specifics of the snapshot code, but from comment 2 it is related to 'retry_origin_bios' in dm-snap.c.

You don't need RAID to reproduce this either, any device will do.  In this case, I used stripe as the origin.
[root@bp-01 ~]# lvs vg
  LV     VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  snap   vg   swi-I-s---  50.00g      stripe 100.00
  stripe vg   owi-aos--- 100.00g

Steps to repo:
1) create LV
2) create a snapshot of it.
3) start I/O to the origin (dd)
4) kill a device in the origin
**) the 'dd' will never complete due to indefinite retry of bios.  It should emit errors.

I get a lot of the following messages also:
Oct 20 14:41:10 bp-01 kernel: lost page write due to I/O error on dm-3
Oct 20 14:41:10 bp-01 kernel: Buffer I/O error on device dm-3, logical block 572824
Oct 20 14:41:10 bp-01 kernel: lost page write due to I/O error on dm-3
Oct 20 14:41:10 bp-01 kernel: Buffer I/O error on device dm-3, logical block 572825
Oct 20 14:41:10 bp-01 kernel: lost page write due to I/O error on dm-3

[root@bp-01 ~]# ls -l /dev/vg
total 0
lrwxrwxrwx. 1 root root 7 Oct 20 14:33 snap -> ../dm-6
lrwxrwxrwx. 1 root root 7 Oct 20 14:33 stripe -> ../dm-3
Comment 13 Mikuláš Patočka 2016-01-08 19:09:48 EST
Upstream patch: https://www.redhat.com/archives/dm-devel/2016-January/msg00090.html
Comment 14 Mikuláš Patočka 2016-01-08 19:11 EST
Created attachment 1113054 [details]
RHEL 6 patch

The patch, backported to RHEL 6
Comment 16 Aristeu Rozanski 2016-01-28 17:16:32 EST
Patch(es) available on kernel-2.6.32-609.el6
Comment 20 Corey Marthaler 2016-04-13 13:17:58 EDT
Marking verified based on the test case given in comment #10.

2.6.32-639.el6.x86_64
lvm2-2.02.143-7.el6    BUILT: Wed Apr  6 10:08:33 CDT 2016
lvm2-libs-2.02.143-7.el6    BUILT: Wed Apr  6 10:08:33 CDT 2016
lvm2-cluster-2.02.143-7.el6    BUILT: Wed Apr  6 10:08:33 CDT 2016
udev-147-2.72.el6    BUILT: Tue Mar  1 06:14:05 CST 2016
device-mapper-1.02.117-7.el6    BUILT: Wed Apr  6 10:08:33 CDT 2016
device-mapper-libs-1.02.117-7.el6    BUILT: Wed Apr  6 10:08:33 CDT 2016
device-mapper-event-1.02.117-7.el6    BUILT: Wed Apr  6 10:08:33 CDT 2016
device-mapper-event-libs-1.02.117-7.el6    BUILT: Wed Apr  6 10:08:33 CDT 2016
device-mapper-persistent-data-0.6.2-0.1.rc7.el6    BUILT: Tue Mar 22 08:58:09 CDT 2016
cmirror-2.02.143-7.el6    BUILT: Wed Apr  6 10:08:33 CDT 2016



[root@host-113 ~]# lvcreate -L 20G -i 2 -n stripe vg
  Using default stripesize 64.00 KiB.
  Logical volume "stripe" created.
[root@host-113 ~]# lvcreate -s vg/stripe -n snap -L 5G 
  Logical volume "snap" created.
[root@host-113 ~]# lvs -a -o +devices
  LV      VG    Attr       LSize   Pool Origin Data%  Devices                  
  snap    vg    swi-a-s---   5.00g      stripe 0.00   /dev/sda1(2560)          
  stripe  vg    owi-a-s---  20.00g                    /dev/sda1(0),/dev/sdb1(0)

[root@host-113 ~]# dd if=/dev/urandom of=/dev/vg/stripe  bs=1M count=2000

# Takes awhile, but this does eventually finish well after the device failure
2000+0 records in
2000+0 records out


[root@host-113 ~]# echo offline > /sys/block/sdb/device/state


# Additional writes now either work of fail with an i/o error depending on the size
[root@host-113 ~]# dd if=/dev/urandom of=/dev/vg/stripe count=10
10+0 records in
10+0 records out
5120 bytes (5.1 kB) copied, 0.00153896 s, 3.3 MB/s

[root@host-113 ~]# dd if=/dev/urandom of=/dev/vg/stripe count=1000
dd: writing to `/dev/vg/stripe': Input/output error
137+0 records in
136+0 records out
69632 bytes (70 kB) copied, 0.0250647 s, 2.8 MB/s


[root@host-113 ~]# lvs -a -o +devices
  /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
  /dev/vg/stripe: read failed after 0 of 4096 at 0: Input/output error
  /dev/vg/stripe: read failed after 0 of 4096 at 21474770944: Input/output error
  /dev/vg/stripe: read failed after 0 of 4096 at 21474828288: Input/output error
  /dev/vg/snap: read failed after 0 of 4096 at 0: Input/output error
  /dev/vg/snap: read failed after 0 of 4096 at 21474770944: Input/output error
  /dev/vg/snap: read failed after 0 of 4096 at 21474828288: Input/output error
  /dev/vg/snap: read failed after 0 of 4096 at 4096: Input/output error
  /dev/sdb1: read failed after 0 of 4096 at 26838958080: Input/output error
  /dev/sdb1: read failed after 0 of 4096 at 26839048192: Input/output error
  /dev/sdb1: read failed after 0 of 4096 at 4096: Input/output error
  Couldn't find device with uuid q6iRyy-YT4M-kqdr-2qZW-oR4u-f19u-lAX6gQ.
  Couldn't find device for segment belonging to vg/stripe while checking used and assumed devices.
  LV      VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                       
  snap    vg         swi-I-s---   5.00g      stripe 100.00                                  /dev/sda1(2560)               
  stripe  vg         owi-aos-p-  20.00g                                                     /dev/sda1(0),unknown device(0)
Comment 22 errata-xmlrpc 2016-05-10 17:47:36 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-0855.html

Note You need to log in before you can comment on or make changes to this bug.