Bug 1155196
| Summary: | Snapshot issue causes original LV to become suspended | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Joe Grasse <joe.grasse> | ||||
| Component: | lvm2 | Assignee: | Zdenek Kabelac <zkabelac> | ||||
| lvm2 sub component: | Snapshots | QA Contact: | cluster-qe <cluster-qe> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | unspecified | CC: | agk, cmarthal, heinzm, jbrassow, joe.grasse, jszczype, lmiksik, mcsontos, msnitzer, prajnoha, prockai, rbednar, zkabelac | ||||
| Version: | 7.0 | ||||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | lvm2-2.02.176-1.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-04-10 15:16:02 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1505394 | ||||||
| Bug Blocks: | 1469559 | ||||||
| Attachments: |
|
||||||
|
Description
Joe Grasse
2014-10-21 14:31:17 UTC
Is it possible to get a verbose trace (add -vvvv to the lvcreate command) for step #7? Created attachment 951766 [details]
lvcreate -vvvv output
Just curious if there are any updates on this bug. Getting ready for OnBoarding first week of January. Please reproduce with latest available lvm2 for RHEL7.3. Also please attach 'dmesg' reported ioctl error related to this user space reported error message: "device-mapper: suspend ioctl on failed: Input/output error" There is likely going to be some reason explaining why suspend failed. But also there were number of fixes for snapshot both in kernel and user-space between RHEL7.1 & RHEL7.3 - As a workaround for you existing system state - try using 'ext4', since XFS might be probably having some troubles on finishing unmount on 'error' device (improved in more recent RHEL kernel) Remove any Invalid snapshot before creating a new one. I will have to see if I have resources available to recreate this issue. The hardware I ran this on has long since changed. This was a long time ago. The suspend didn't fail. The suspend of the original lv is what this bug report is about. If memory servers, this was back in 2014, this happened with ext4 and xfs. Also, yes removing the invalid snapshot would prevent this problem, but this bug report is to highlight a bug that happens when you don't. yeah, something is still not right: [root@bp-01 snap1]# lvm version LVM version: 2.02.172(2)-git (2017-05-03) Library version: 1.02.141-git (2017-05-03) Driver version: 4.35.0 Configuration: ./configure --enable-lvm1_fallback --enable-fsadm --with-pool=internal --with-user= --with-group= --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --enable-pkgconfig --enable-units-compat --with-optimisation=-g --enable-cmdlib --enable-dmeventd --libdir=/usr/lib64 --with-usrlibdir=/usr/lib64 --with-pool=internal --enable-applib --enable-python2-bindings --enable-udev_sync --with-thin=internal --enable-lvmetad --with-cache=internal [root@bp-01 ~]# lvcreate -L 1G -n lv vg Logical volume "lv" created. [root@bp-01 ~]# mkfs.ext4 /dev/vg/lv <snip/> [root@bp-01 ~]# mkdir /mnt/fs [root@bp-01 ~]# mount /dev/vg/lv /mnt/fs [root@bp-01 ~]# mkdir /mnt/snap1 [root@bp-01 ~]# lvcreate -s -L 20M -n snap1 vg/lv Using default stripesize 64.00 KiB. Logical volume "snap1" created. [root@bp-01 ~]# mount /dev/vg/snap1 /mnt/snap1/ [root@bp-01 ~]# cd /mnt/snap1/ [root@bp-01 snap1]# dd if=/dev/zero of=/mnt/fs/file bs=4M count=10 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.0579186 s, 724 MB/s [root@bp-01 snap1]# lvdisplay /dev/vg/snap1 --- Logical volume --- LV Path /dev/vg/snap1 LV Name snap1 VG Name vg LV UUID vMNcqN-DNt4-gQj1-UnzV-W2kp-iIPM-dXjAke LV Write Access read/write LV Creation host, time bp-01.lab.msp.redhat.com, 2017-07-27 15:48:20 -0500 LV snapshot status active destination for lv LV Status available # open 1 LV Size 1.00 GiB Current LE 256 COW-table size 20.00 MiB COW-table LE 5 Allocated to snapshot 0.23% Snapshot chunk size 4.00 KiB Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:6 [root@bp-01 snap1]# lvs -o name,segtype,attr,raidsyncaction,syncpercent,devices -a vg /dev/vg/snap1: read failed after 0 of 4096 at 0: Input/output error /dev/vg/snap1: read failed after 0 of 4096 at 1073676288: Input/output error /dev/vg/snap1: read failed after 0 of 4096 at 1073733632: Input/output error /dev/vg/snap1: read failed after 0 of 4096 at 4096: Input/output error LV Type Attr SyncAction Cpy%Sync Devices lv linear owi-aos--- /dev/sdb1(0) snap1 linear swi-Ios--- /dev/sdb1(256) [root@bp-01 snap1]# lvdisplay /dev/vg/snap1 /dev/vg/snap1: read failed after 0 of 4096 at 0: Input/output error /dev/vg/snap1: read failed after 0 of 4096 at 1073676288: Input/output error /dev/vg/snap1: read failed after 0 of 4096 at 1073733632: Input/output error /dev/vg/snap1: read failed after 0 of 4096 at 4096: Input/output error --- Logical volume --- LV Path /dev/vg/snap1 LV Name snap1 VG Name vg LV UUID vMNcqN-DNt4-gQj1-UnzV-W2kp-iIPM-dXjAke LV Write Access read/write LV Creation host, time bp-01.lab.msp.redhat.com, 2017-07-27 15:48:20 -0500 LV snapshot status INACTIVE destination for lv LV Status available # open 1 LV Size 1.00 GiB Current LE 256 COW-table size 20.00 MiB COW-table LE 5 Snapshot chunk size 4.00 KiB Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:6 [root@bp-01 snap1]# lvcreate -s -L 20M -n snap2 vg/lv Using default stripesize 64.00 KiB. /dev/vg/snap1: read failed after 0 of 4096 at 0: Input/output error /dev/vg/snap1: read failed after 0 of 4096 at 1073676288: Input/output error /dev/vg/snap1: read failed after 0 of 4096 at 1073733632: Input/output error /dev/vg/snap1: read failed after 0 of 4096 at 4096: Input/output error device-mapper: suspend ioctl on (253:6) failed: Input/output error Unable to suspend vg-snap1 (253:6) Failed to lock logical volume vg/lv. Aborting. Manual intervention required. The original issue (left suspended device on error path) has been fixed with commit: a84d0d0c7b843168516a940a8bc4debafc5f980c https://www.redhat.com/archives/lvm-devel/2014-September/msg00231.html Seems after longer discussion - we should close this bug as fixes - since reported problem is being already fixed with lvm2 2.02.112. The actual problem in comment 9 is worth a separate bugzilla, where is not yet clear, which kind of fix we want here. Sorry, I am missing how problem in comment 9 is different than the original problem. Lvm2 fo(In reply to Joe Grasse from comment #13) > Sorry, I am missing how problem in comment 9 is different than the original > problem. lvm2 from version 2.02.112 'correctly' handles error path and restores/resumes devices on error path and informs user there is a problem it could not have resolved (yes lvm2 is not almighty and some cases needs a human brain...) So ATM it's users responsibility to remove invalid snapshot and continue. The new BZ could be about possibly better automation - but here we are not yet sure if this is better fixed in kernel or in user-space - in practice we need to evaluate in details both directs. Original BZ was about leaving devices in suspended state - which was clear internal error bug and got already fixed. New bug 1505394 has been opened to track progress for fixing this snapshot enhancement. (In reply to Marian Csontos from comment #16) > (In reply to Zdenek Kabelac from comment #14) > > Lvm2 fo(In reply to Joe Grasse from comment #13) > > > Sorry, I am missing how problem in comment 9 is different than the original > > > problem. > > > > The new BZ could be about possibly better automation - but here we are not > > yet sure if this is better fixed in kernel or in user-space - in practice we > > need to evaluate in details both directs. > > But if even *we* do not know how to fix it, how can user fix it? We do know 2 variants of possible steps - but both are not so trivial - one needs fixes on kernel side - other would provide very sophisticated user-space solution. Both needs design doc first. > > > Original BZ was about leaving devices in suspended state - which was clear > > internal error bug and got already fixed. > > IMHO from users point of view it is still broken. Yes - that's why new BZ was born to solve the associated issue. (In reply to Zdenek Kabelac from comment #14) > lvm2 from version 2.02.112 'correctly' handles error path and > restores/resumes devices on error path and informs user there is a > problem it could not have resolved (yes lvm2 is not almighty and some cases > needs a human brain...) > So ATM it's users responsibility to remove invalid snapshot and continue. (In reply to Zdenek Kabelac from comment #17) > We do know 2 variants of possible steps - but both are not so trivial - > one needs fixes on kernel side - other would provide very sophisticated > user-space solution. > > Both needs design doc first. OK, so this is only partially fixed, depends on Bug 1505394, we do not know yet how to fix it, so I suggest to keep this open. Nope (In reply to Marian Csontos from comment #18) > (In reply to Zdenek Kabelac from comment #17) > > We do know 2 variants of possible steps - but both are not so trivial - > > one needs fixes on kernel side - other would provide very sophisticated > > user-space solution. > > > > Both needs design doc first. > > OK, so this is only partially fixed, depends on Bug 1505394, we do not know > yet how to fix it, so I suggest to keep this open. Nope - the original MAIN problem was fixed (leaking device in suspend state) So the primary expectation in BZ description is fixed - no suspended devices are left. However the there can be seen RFE to enhance error behavior for old snapshot. This will require quite some devel effort and is unrelated to original leak of suspended devices issue - so for RFE we have new BZ for upstream (since the fix will certainly not lend as a bugfix) and will require major testing effort as well. Marking verified with latest rpms. I was not able to hit the issue from initial comment. That is an origin being suspended after filling it's snapshot to 100% and creating a second one. Input/output errors as shown in Comment 9 are tracked in separate bug 1505394. # mkfs.ext4 /dev/vg/origin ... # mount /dev/vg/origin /mnt/origin # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root rhel_virt-362 -wi-ao---- <6.20g swap rhel_virt-362 -wi-ao---- 820.00m origin vg -wi-ao---- 1.00g # mkdir /mnt/snap # lvcreate -L2M -s -n snap /dev/vg/origin Using default stripesize 64.00 KiB. Rounding up size to full physical extent 4.00 MiB Logical volume "snap" created. # mount /dev/vg/snap /mnt/snap # dd if=/dev/urandom of=/mnt/origin/file bs=4 count=10 10+0 records in 10+0 records out 40 bytes (40 B) copied, 0.000239129 s, 167 kB/s # lvs vg/snap LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert snap vg swi-aos--- 4.00m origin 1.07 # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root rhel_virt-362 -wi-ao---- <6.20g swap rhel_virt-362 -wi-ao---- 820.00m origin vg owi-aos--- 1.00g snap vg swi-aos--- 4.00m origin 1.07 # dd if=/dev/urandom of=/mnt/origin/file bs=4M count=10 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.277909 s, 151 MB/s ## Snapshot is filled and deactivated # lvs vg/snap LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert snap vg swi-I-s--- 4.00m origin 100.00 ## Creating another snapshot of same origin # lvcreate -L2M -s -n snap2 /dev/vg/origin Using default stripesize 64.00 KiB. Rounding up size to full physical extent 4.00 MiB Logical volume "snap2" created. ## Origin remains active # touch /mnt/origin/file2 # lvs -a LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root rhel_virt-362 -wi-ao---- <6.20g swap rhel_virt-362 -wi-ao---- 820.00m origin vg owi-aos--- 1.00g snap vg swi-I-s--- 4.00m origin 100.00 snap2 vg swi-a-s--- 4.00m origin 0.29 3.10.0-755.el7.x86_64 lvm2-2.02.176-2.el7 BUILT: Fri Nov 3 13:46:53 CET 2017 lvm2-libs-2.02.176-2.el7 BUILT: Fri Nov 3 13:46:53 CET 2017 lvm2-cluster-2.02.176-2.el7 BUILT: Fri Nov 3 13:46:53 CET 2017 device-mapper-1.02.145-2.el7 BUILT: Fri Nov 3 13:46:53 CET 2017 device-mapper-libs-1.02.145-2.el7 BUILT: Fri Nov 3 13:46:53 CET 2017 device-mapper-event-1.02.145-2.el7 BUILT: Fri Nov 3 13:46:53 CET 2017 device-mapper-event-libs-1.02.145-2.el7 BUILT: Fri Nov 3 13:46:53 CET 2017 device-mapper-persistent-data-0.7.3-2.el7 BUILT: Tue Oct 10 11:00:07 CEST 2017 cmirror-2.02.176-2.el7 BUILT: Fri Nov 3 13:46:53 CET 2017 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0853 |