1155196 – Snapshot issue causes original LV to become suspended

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1155196 - Snapshot issue causes original LV to become suspended

Summary: Snapshot issue causes original LV to become suspended

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.0
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Zdenek Kabelac
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1505394
Blocks:	1469559
TreeView+	depends on / blocked

Reported:	2014-10-21 14:31 UTC by Joe Grasse
Modified:	2021-09-03 12:54 UTC (History)
CC List:	13 users (show)
Fixed In Version:	lvm2-2.02.176-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-10 15:16:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
lvcreate -vvvv output (72.70 KB, text/plain) 2014-10-29 12:41 UTC, Joe Grasse	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1505394	0	medium	NEW	[RFE] Better support for suspend of out-of-space old snapshots	2024-04-27 05:52:45 UTC
Red Hat Product Errata	RHEA-2018:0853	0	None	None	None	2018-04-10 15:17:48 UTC

Internal Links: 1505394

Description Joe Grasse 2014-10-21 14:31:17 UTC

Description of problem:
The original LV can become suspended after creating a snapshot that becomes deactivated and creating a second snapshot.

After walking through the steps to reproduce, I understand why the /dev/vg_0/dbbackup would become deactivated (it becomes full). I don't understand, or ever want in this case, /dev/vg_0/lv_mysql becoming suspended.

How reproducible:
Always

Steps to Reproduce:
# Starting relevant partitions
Filesystem			Mounted on
/dev/mapper/vg_0-lv_root	/
/dev/mapper/vg_0-lv_mysql	/usr/mysql
/dev/mapper/vg_0-lv_tmp		/tmp

1. # Create snapshot
lvcreate -L2M -s -n dbbackup /dev/vg_0/lv_mysql

2. # Create dir to mount snapshot
mkdir /tmp/mounted_snapshot

3. # Mount snapshot
mount -o nouuid /dev/vg_0/dbbackup /tmp/mounted_snapshot

4. # cd to mounted snapshot dir
cd /tmp/mounted_snapshot

5. # Fill snapshot causing it to be disabled
dd if=/dev/urandom of=/usr/mysql/random_file bs=32M count=4

6. # Confirm snapshot disabled
lvdisplay /dev/vg_0/dbbackup

7. # Create second snapshot
lvcreate -L2M -s -n dbbackup2 /dev/vg_0/lv_mysql

8 .# Original lv is now suspended
# This command will halt until /dev/vg_0/lv_mysql is unsuspended
touch /usr/mysql/new_file


Expected results:
I would expect the original lv (/dev/vg_0/lv_mysql in this case) to not become suspended. 

I would also expect step #7 to not create a snapshot LV

Comment 2 Jonathan Earl Brassow 2014-10-28 21:29:08 UTC

Is it possible to get a verbose trace (add -vvvv to the lvcreate command) for step #7?

Comment 3 Joe Grasse 2014-10-29 12:41:52 UTC

Created attachment 951766 [details]
lvcreate -vvvv output

Comment 5 Joe Grasse 2015-08-14 12:38:21 UTC

Just curious if there are any updates on this bug.

Comment 6 George Beshers 2016-12-22 16:15:31 UTC

Getting ready for OnBoarding first week of January.

Comment 7 Zdenek Kabelac 2016-12-22 19:36:32 UTC

Please reproduce with latest available lvm2 for RHEL7.3.

Also please attach 'dmesg' reported ioctl error related to this
user space reported error message:

"device-mapper: suspend ioctl on  failed: Input/output error"

There is likely going to be some reason explaining why suspend failed.

But also there were number of fixes for snapshot both in kernel
and user-space  between RHEL7.1  & RHEL7.3


-

As a workaround for you existing system state - try using  'ext4',
since XFS might be probably having some troubles on finishing
unmount on 'error' device  (improved in more recent RHEL kernel)

Remove any Invalid snapshot before creating a new one.

Comment 8 Joe Grasse 2017-01-05 15:31:55 UTC

I will have to see if I have resources available to recreate this issue. The hardware I ran this on has long since changed. This was a long time ago.

The suspend didn't fail. The suspend of the original lv is what this bug report is about.

If memory servers, this was back in 2014, this happened with ext4 and xfs. Also, yes removing the invalid snapshot would prevent this problem, but this bug report is to highlight a bug that happens when you don't.

Comment 9 Jonathan Earl Brassow 2017-07-27 15:54:58 UTC

yeah, something is still not right:


[root@bp-01 snap1]# lvm version
  LVM version:     2.02.172(2)-git (2017-05-03)
  Library version: 1.02.141-git (2017-05-03)
  Driver version:  4.35.0
  Configuration:   ./configure --enable-lvm1_fallback --enable-fsadm --with-pool=internal --with-user= --with-group= --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --enable-pkgconfig --enable-units-compat --with-optimisation=-g --enable-cmdlib --enable-dmeventd --libdir=/usr/lib64 --with-usrlibdir=/usr/lib64 --with-pool=internal --enable-applib --enable-python2-bindings --enable-udev_sync --with-thin=internal --enable-lvmetad --with-cache=internal


[root@bp-01 ~]# lvcreate -L 1G -n lv vg
  Logical volume "lv" created.
[root@bp-01 ~]# mkfs.ext4 /dev/vg/lv
<snip/>
[root@bp-01 ~]# mkdir /mnt/fs
[root@bp-01 ~]# mount /dev/vg/lv /mnt/fs
[root@bp-01 ~]# mkdir /mnt/snap1
[root@bp-01 ~]# lvcreate -s -L 20M -n snap1 vg/lv
  Using default stripesize 64.00 KiB.
  Logical volume "snap1" created.
[root@bp-01 ~]# mount /dev/vg/snap1 /mnt/snap1/
[root@bp-01 ~]# cd /mnt/snap1/
[root@bp-01 snap1]# dd if=/dev/zero of=/mnt/fs/file bs=4M count=10
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.0579186 s, 724 MB/s
[root@bp-01 snap1]# lvdisplay /dev/vg/snap1
  --- Logical volume ---
  LV Path                /dev/vg/snap1
  LV Name                snap1
  VG Name                vg
  LV UUID                vMNcqN-DNt4-gQj1-UnzV-W2kp-iIPM-dXjAke
  LV Write Access        read/write
  LV Creation host, time bp-01.lab.msp.redhat.com, 2017-07-27 15:48:20 -0500
  LV snapshot status     active destination for lv
  LV Status              available
  # open                 1
  LV Size                1.00 GiB
  Current LE             256
  COW-table size         20.00 MiB
  COW-table LE           5
  Allocated to snapshot  0.23%
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:6

[root@bp-01 snap1]# lvs -o name,segtype,attr,raidsyncaction,syncpercent,devices -a vg
  /dev/vg/snap1: read failed after 0 of 4096 at 0: Input/output error
  /dev/vg/snap1: read failed after 0 of 4096 at 1073676288: Input/output error
  /dev/vg/snap1: read failed after 0 of 4096 at 1073733632: Input/output error
  /dev/vg/snap1: read failed after 0 of 4096 at 4096: Input/output error
  LV    Type   Attr       SyncAction Cpy%Sync Devices
  lv    linear owi-aos---                     /dev/sdb1(0)
  snap1 linear swi-Ios---                     /dev/sdb1(256)
[root@bp-01 snap1]# lvdisplay /dev/vg/snap1
  /dev/vg/snap1: read failed after 0 of 4096 at 0: Input/output error
  /dev/vg/snap1: read failed after 0 of 4096 at 1073676288: Input/output error
  /dev/vg/snap1: read failed after 0 of 4096 at 1073733632: Input/output error
  /dev/vg/snap1: read failed after 0 of 4096 at 4096: Input/output error
  --- Logical volume ---
  LV Path                /dev/vg/snap1
  LV Name                snap1
  VG Name                vg
  LV UUID                vMNcqN-DNt4-gQj1-UnzV-W2kp-iIPM-dXjAke
  LV Write Access        read/write
  LV Creation host, time bp-01.lab.msp.redhat.com, 2017-07-27 15:48:20 -0500
  LV snapshot status     INACTIVE destination for lv
  LV Status              available
  # open                 1
  LV Size                1.00 GiB
  Current LE             256
  COW-table size         20.00 MiB
  COW-table LE           5
  Snapshot chunk size    4.00 KiB
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:6

[root@bp-01 snap1]# lvcreate -s -L 20M -n snap2 vg/lv
  Using default stripesize 64.00 KiB.
  /dev/vg/snap1: read failed after 0 of 4096 at 0: Input/output error
  /dev/vg/snap1: read failed after 0 of 4096 at 1073676288: Input/output error
  /dev/vg/snap1: read failed after 0 of 4096 at 1073733632: Input/output error
  /dev/vg/snap1: read failed after 0 of 4096 at 4096: Input/output error
  device-mapper: suspend ioctl on  (253:6) failed: Input/output error
  Unable to suspend vg-snap1 (253:6)
  Failed to lock logical volume vg/lv.
  Aborting. Manual intervention required.

Comment 11 Zdenek Kabelac 2017-10-23 12:26:14 UTC

The original issue (left suspended device on error path) has been fixed with commit: a84d0d0c7b843168516a940a8bc4debafc5f980c

https://www.redhat.com/archives/lvm-devel/2014-September/msg00231.html

Comment 12 Zdenek Kabelac 2017-10-23 12:51:03 UTC

Seems after longer discussion -  we should close this bug as fixes - since reported problem is being already fixed with  lvm2  2.02.112.


The actual problem in comment 9   is worth a separate bugzilla, where is not yet clear, which kind of fix we want here.

Comment 13 Joe Grasse 2017-10-23 13:04:08 UTC

Sorry, I am missing how problem in comment 9 is different than the original problem.

Comment 14 Zdenek Kabelac 2017-10-23 13:09:31 UTC

Lvm2  fo(In reply to Joe Grasse from comment #13)
> Sorry, I am missing how problem in comment 9 is different than the original
> problem.

lvm2 from version 2.02.112  'correctly' handles error path and restores/resumes devices on error path  and  informs  user there is a problem it could not have resolved (yes  lvm2 is not almighty and some cases needs a human brain...)
So ATM it's users responsibility to remove invalid snapshot and continue.

The new BZ could be about possibly better automation - but here we are not yet sure if this is better fixed in kernel or in user-space - in practice we need to evaluate in details both directs.


Original BZ was about leaving devices in suspended state - which was clear internal error bug and got already fixed.

Comment 15 Zdenek Kabelac 2017-10-23 13:54:17 UTC

New bug  1505394  has been opened to track progress for fixing this snapshot enhancement.

Comment 17 Zdenek Kabelac 2017-10-25 08:17:55 UTC

(In reply to Marian Csontos from comment #16)
> (In reply to Zdenek Kabelac from comment #14)
> > Lvm2  fo(In reply to Joe Grasse from comment #13)
> > > Sorry, I am missing how problem in comment 9 is different than the original
> > > problem.
> > 
> > The new BZ could be about possibly better automation - but here we are not
> > yet sure if this is better fixed in kernel or in user-space - in practice we
> > need to evaluate in details both directs.
> 
> But if even *we* do not know how to fix it, how can user fix it?

We do know 2 variants of possible steps - but both are not so trivial -
one needs fixes on kernel side - other would provide very sophisticated
user-space solution.

Both needs design doc first.


> 
> > Original BZ was about leaving devices in suspended state - which was clear
> > internal error bug and got already fixed.
> 
> IMHO from users point of view it is still broken.


Yes - that's why new BZ was born to solve the associated issue.

Comment 18 Marian Csontos 2017-10-25 08:23:47 UTC

(In reply to Zdenek Kabelac from comment #14)
> lvm2 from version 2.02.112  'correctly' handles error path and
> restores/resumes devices on error path  and  informs  user there is a
> problem it could not have resolved (yes  lvm2 is not almighty and some cases
> needs a human brain...)
> So ATM it's users responsibility to remove invalid snapshot and continue.

(In reply to Zdenek Kabelac from comment #17)
> We do know 2 variants of possible steps - but both are not so trivial -
> one needs fixes on kernel side - other would provide very sophisticated
> user-space solution.
> 
> Both needs design doc first.

OK, so this is only partially fixed, depends on Bug 1505394, we do not know yet how to fix it, so I suggest to keep this open.

Comment 19 Zdenek Kabelac 2017-10-25 10:39:22 UTC

Nope (In reply to Marian Csontos from comment #18)

> (In reply to Zdenek Kabelac from comment #17)
> > We do know 2 variants of possible steps - but both are not so trivial -
> > one needs fixes on kernel side - other would provide very sophisticated
> > user-space solution.
> > 
> > Both needs design doc first.
> 
> OK, so this is only partially fixed, depends on Bug 1505394, we do not know
> yet how to fix it, so I suggest to keep this open.


Nope - the original MAIN problem was  fixed (leaking device in suspend state)
So the primary expectation in BZ description is fixed - no suspended devices are left.

However the there can be seen RFE to enhance error behavior for old snapshot.

This will require quite some devel effort and is unrelated to original leak of suspended devices issue  - so for  RFE  we have new BZ for upstream (since the fix will certainly not lend as a bugfix) and will require major testing effort as well.

Comment 21 Roman Bednář 2017-11-06 10:29:38 UTC

Marking verified with latest rpms. I was not able to hit the issue from initial comment. That is an origin being suspended after filling it's snapshot to 100% and creating a second one. 

Input/output errors as shown in Comment 9 are tracked in separate bug 1505394.



# mkfs.ext4 /dev/vg/origin
...

# mount /dev/vg/origin /mnt/origin

# lvs
  LV     VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root   rhel_virt-362 -wi-ao----  <6.20g                                                    
  swap   rhel_virt-362 -wi-ao---- 820.00m                                                    
  origin vg            -wi-ao----   1.00g    
                                                
# mkdir /mnt/snap
                                                    
# lvcreate -L2M -s -n snap /dev/vg/origin
  Using default stripesize 64.00 KiB.
  Rounding up size to full physical extent 4.00 MiB
  Logical volume "snap" created.

# mount /dev/vg/snap /mnt/snap

# dd if=/dev/urandom of=/mnt/origin/file bs=4 count=10
10+0 records in
10+0 records out
40 bytes (40 B) copied, 0.000239129 s, 167 kB/s

# lvs vg/snap
  LV   VG Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  snap vg swi-aos--- 4.00m      origin 1.07      
                             
# lvs 
  LV     VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root   rhel_virt-362 -wi-ao----  <6.20g                                                    
  swap   rhel_virt-362 -wi-ao---- 820.00m                                                    
  origin vg            owi-aos---   1.00g                                                    
  snap   vg            swi-aos---   4.00m      origin 1.07       
                            
# dd if=/dev/urandom of=/mnt/origin/file bs=4M count=10
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.277909 s, 151 MB/s

## Snapshot is filled and deactivated 
# lvs vg/snap
  LV   VG Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  snap vg swi-I-s--- 4.00m      origin 100.00    

## Creating another snapshot of same origin                             
# lvcreate -L2M -s -n snap2 /dev/vg/origin
  Using default stripesize 64.00 KiB.
  Rounding up size to full physical extent 4.00 MiB
  Logical volume "snap2" created.

## Origin remains active
# touch /mnt/origin/file2 

# lvs -a 
  LV     VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root   rhel_virt-362 -wi-ao----  <6.20g                                                    
  swap   rhel_virt-362 -wi-ao---- 820.00m                                                    
  origin vg            owi-aos---   1.00g                                                    
  snap   vg            swi-I-s---   4.00m      origin 100.00                                 
  snap2  vg            swi-a-s---   4.00m      origin 0.29       


3.10.0-755.el7.x86_64

lvm2-2.02.176-2.el7    BUILT: Fri Nov  3 13:46:53 CET 2017
lvm2-libs-2.02.176-2.el7    BUILT: Fri Nov  3 13:46:53 CET 2017
lvm2-cluster-2.02.176-2.el7    BUILT: Fri Nov  3 13:46:53 CET 2017
device-mapper-1.02.145-2.el7    BUILT: Fri Nov  3 13:46:53 CET 2017
device-mapper-libs-1.02.145-2.el7    BUILT: Fri Nov  3 13:46:53 CET 2017
device-mapper-event-1.02.145-2.el7    BUILT: Fri Nov  3 13:46:53 CET 2017
device-mapper-event-libs-1.02.145-2.el7    BUILT: Fri Nov  3 13:46:53 CET 2017
device-mapper-persistent-data-0.7.3-2.el7    BUILT: Tue Oct 10 11:00:07 CEST 2017
cmirror-2.02.176-2.el7    BUILT: Fri Nov  3 13:46:53 CET 2017

Comment 24 errata-xmlrpc 2018-04-10 15:16:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0853

Note You need to log in before you can comment on or make changes to this bug.