1049296 – Avoid incorrectly dropping locks when lvm forks off a sub-process such as fsadm

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1049296 - Avoid incorrectly dropping locks when lvm forks off a sub-process such as fsadm

Summary: Avoid incorrectly dropping locks when lvm forks off a sub-process such as fsadm

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Zdenek Kabelac
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1056252
TreeView+	depends on / blocked

Reported:	2014-01-07 11:20 UTC by Cedric Buissart
Modified:	2019-10-10 09:13 UTC (History)
CC List:	12 users (show)
Fixed In Version:	lvm2-2.02.107-1.el6
Doc Type:	Bug Fix
Doc Text:	LVM uses lock files to prevents incompatible operations from running simultaneously. Prior to this release, when forking sub-processes, such as fsadm within 'lvresize -r', it could incorrectly drop these locks and incorrectly permit commands to run in parallel.
Clone Of:
Environment:
Last Closed:	2014-10-14 08:24:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	642843	0	None	None	None	Never
Red Hat Product Errata	RHBA-2014:1387	0	normal	SHIPPED_LIVE	lvm2 bug fix and enhancement update	2014-10-14 01:39:47 UTC

Description Cedric Buissart 2014-01-07 11:20:47 UTC

>> Description of problem:

Running `lvresize -r` will end up with the deletion of the /var/lock/lvm/[VG] lock file.
Therefore, running `lvresize -r` in parallel of other LVM commands may lead to unexpected behaviours, that can include metadata incoherency.


>> Version-Release number of selected component (if applicable):
- Reproducible on RHEL5 and RHEL6


>> How reproducible:
- 1/3 on RHEL6
- a lot easier on RHEL5

>> Steps to Reproduce:
1. get a VG with several LVs formatted with file systems
2. for LV in /dev/[VG]/* ; do lvresize -r -L-400M $LV & sleep 1 ; done


>> Actual results:
kernel's device-mapper maps and file system will be resized correctly. However, the VG metadata may be incorrect (some resize will be "forgotten" from a metadata point of view)

>> Expected results:
All LVs correctly resized

>> Additional info:

The issue is on the exec_cmd (when we execute fsadm) : the reset_locking() in the child will lead to the unlink of /var/lock/lvm/V_[VG]. When the file is deleted, the lock is not working anymore, and any LVM command can recreate a new lock.

int exec_cmd(struct cmd_context *cmd, const char *const argv[],
             int *rstatus, int sync_needed)
{
[...]

        if (!pid) {
                /* Child */
                reset_locking();          <= here we should probably not delete the file
                dev_close_all();
                /* FIXME Fix effect of reset_locking on cache then include this */
                /* destroy_toolcontext(cmd); */
                /* FIXME Use execve directly */
                execvp(argv[0], (char **) argv);           <= runs `fsadm check [LV]`
                log_sys_error("execvp", argv[0]);
                _exit(errno);
        }


When we follow the child from gdb :

(gdb) bt
#0  unlink () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000004b8a54 in _undo_flock (file=0x976e20 "/var/lock/lvm/V_vgtests", fd=4) at locking/file_locking.c:56
#2  0x00000000004b8b7c in _release_lock (file=0x0, unlock=0) at locking/file_locking.c:79
#3  0x0000000000469909 in reset_locking () at locking/locking.c:193
#4  0x00000000004986d5 in exec_cmd (cmd=<value optimized out>, argv=0x7fffffffcb80, rstatus=0x7fffffffdd0c, sync_needed=<value optimized out>) at misc/lvm-exec.c:73

Once the unlink is done, the lock on the parent is voided. Any LVM command is free to run and creates it own flock ... but the lvresize has not committed the metadata yet.

Comment 1 Cedric Buissart 2014-01-07 11:29:16 UTC

For a better explanation of the behaviour :

[Note : extract below is from RHEL5, which is a lot more impacted than RHEL6]

== before :

[root@vm-199 ~]# df -h /mnt/?/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/tests-a   733M  285M  411M  41% /mnt/a
/dev/mapper/tests-b   733M  285M  411M  41% /mnt/b
/dev/mapper/tests-c   590M  184M  376M  33% /mnt/c
/dev/mapper/tests-d   637M  185M  420M  31% /mnt/d
/dev/mapper/tests-e   637M  185M  420M  31% /mnt/e
/dev/mapper/tests-f    90M  5.6M   79M   7% /mnt/f
[root@vm-199 ~]# lvs tests
  LV   VG    Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  a    tests -wi-ao 756.00M
  b    tests -wi-ao 756.00M
  c    tests -wi-ao 608.00M
  d    tests -wi-ao 656.00M
  e    tests -wi-ao 656.00M
  f    tests -wi-ao  92.00M

== unmount & resize concurrently & mount:

[root@vm-199 ~]# for LV in a b c d e ; do lvresize -r -L-400M /dev/tests/$LV & sleep 1 ; done

== after :

[root@vm-199 ~]# lvs tests
  LV   VG    Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  a    tests -wi-ao 756.00M
  b    tests -wi-ao 756.00M
  c    tests -wi-ao 608.00M
  d    tests -wi-ao 256.00M
  e    tests -wi-ao 656.00M
  f    tests -wi-ao  92.00M
[root@vm-199 ~]# df -h /mnt/?/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/tests-a   346M  285M   44M  87% /mnt/a
/dev/mapper/tests-b   346M  285M   44M  87% /mnt/b
/dev/mapper/tests-c   202M  184M  7.8M  96% /mnt/c
/dev/mapper/tests-d   249M  184M   52M  79% /mnt/d
/dev/mapper/tests-e   249M  184M   52M  79% /mnt/e
/dev/mapper/tests-f    90M  5.6M   79M   7% /mnt/f


From above, we can see that the file systems have the correct sizes. However, only 1 LV has been resized : `tests/d` ... on the others, the modification has not been taken into account

Comment 3 Zdenek Kabelac 2014-01-07 12:21:52 UTC

It should be noted here - that  lvm2 support for resize of metadata is some kind of 'bonus' feature - so far lvm2 does not care about any data living on the block device.

So it cannot be blamed for non-atomicity of filesystem resize operation even though we tried to make it at least 'hard' to break.

There are many other paths where you could trigger similar error (i.e. practically any error in the middle of resize operation is left ignored).

Also we have Bug 644612 and Bug 651102 - which is in the similar rank of problems. We are considering various ideas how to deal with them,
but at this point there is absolutely no guarantee that filesystem resize will happen together with metadata change for related block device.

Comment 4 Zdenek Kabelac 2014-01-07 12:24:13 UTC

(In reply to Zdenek Kabelac from comment #3)
> It should be noted here - that  lvm2 support for resize of metadata is some

 of filesystem with metadata

Comment 5 Cedric Buissart 2014-01-07 15:25:25 UTC

Thank you for the explanation. 

I believe that the incident, although currently visible through resize/fsadm only, is tight to exec_cmd() rather than resize.

As such, there may be other potential problematic code path too (after a quick check _thin_pool_callback and activation (modprobe command may be run)

Comment 6 Zdenek Kabelac 2014-01-07 15:32:32 UTC

Thin pool callback is only meant to be called while thin pool device is activated - so the lvm2 lock is kept during this callback - that happens directly during activation of device tree.

The collision you hit is - that you start a command -  which expects LV in some some - then  lvm command give-up lock in favour of  fsadm - which does filesystem resize without lvm lock being held.

I believe (at least hope) there is no bug in a way that something in the middle of lvm command would remove lock and lvm command would have continued without the lock - so there would be  2 metadata writing binaries to the lvm metadata area.

Comment 7 Cedric Buissart 2014-01-09 10:34:52 UTC

Let me show the extend of the issue :

In case of lvreduce, this is my understanding of the global algorithm :
1- LVM takes a exclusive lock of the VG and read the VG metadata
vg_read_for_update()

2- LVM will attempt an fsck by forking and execute fsadm. 
exec_cmd("fsadm", "check")

3a- Parent : wait4() child.

3b- Child : Right after fork, exec_cmd() resets locking. This has for side effect to delete the VG lock file and parent gets its lock deleted as well. 
At his moment, any new LVM command can take a new lock (including an exclusive lock)
reset_locking() -> undo_flock(VG file) -> unlink(VG file)
execvp("fsadm", "check");

Child exit, parent takes action again

4- similar to steps 2 & 3 : fsadm runs resize2fs
exec_cmd("fsadm", "resize")

5- LV is resized
lv_reduce()

6- metadata are written
vg_write()

8- locks are released
unlock_and_release_vg()


==> Note : The only difference between lvreduce and lvextend is that the resize2fs is run after the lv_extend().

==> The issue : from step 3b onwards, the parent has no more lock on the VG, any command can write metadata : these writes will be anihilated when lvresize reach steps 6 and writes metadata based on "step 1 + LV resize".
This is why we reach status such as described in comment 1 (all file systems were resized, but on metadata, only 1 LV got resized)

==> Another example in my next command, with gdb for a better control on the program.

Comment 8 Cedric Buissart 2014-01-09 10:35:54 UTC

Below is lvm run in gdb context for a full control. 
The plan is to reduce vgtests/a down to 100M, but I will catch the fork and follow the child to really show that unlinking VG lock is no good idea.
The result :  while LV is being resized, I will run a lvremove. This lvremove will be annihilated when lvresize writes metadata.


==== Status of LVM prior to running :

[root@dhcp-26-101 build-tests]# lvs
  LV   VG      Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  Root VG01    -wi-ao----   5.47g
  Swap VG01    -wi-ao---- 288.00m
  a    vgtests -wi-a----- 500.00m
  b    vgtests -wi-a----- 600.00m
  c    vgtests -wi-a----- 900.00m
  d    vgtests -wi-a----- 600.00m

==== I will reduce vgtests/a down to 100M, but I will catch the fork, so that we have plenty of time to check the LVM status. During fork, parent will peacefully wait4 child.

[root@dhcp-26-101 build-tests]# gdb /sbin/lvm
[...]
(gdb) set follow-fork-mode child
(gdb) set args lvresize -r -L100M vgtests/a
(gdb) catch fork
Catchpoint 1 (fork)
(gdb) r
Starting program: /sbin/lvm lvresize -r -L100M vgtests/a
File descriptor 3 (pipe:[19305]) leaked on lvm invocation. Parent PID 5405: gdb
File descriptor 4 (pipe:[19305]) leaked on lvm invocation. Parent PID 5405: gdb
File descriptor 5 (pipe:[19306]) leaked on lvm invocation. Parent PID 5405: gdb
File descriptor 6 (pipe:[19306]) leaked on lvm invocation. Parent PID 5405: gdb

Catchpoint 1 (forked process 5442), 0x00000033576acc6d in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/fork.c:131
131       pid = ARCH_FORK ();
(gdb) break unlink
Breakpoint 2 at 0x33576dc7d0: file ../sysdeps/unix/syscall-template.S, line 82.
(gdb) continue
Continuing.
[New process 5442]

Breakpoint 2, unlink () at ../sysdeps/unix/syscall-template.S:82
82      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
#0  unlink () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000004b8a54 in _undo_flock (file=0x788e20 "/var/lock/lvm/V_vgtests", fd=4) at locking/file_locking.c:56
#2  0x00000000004b8b7c in _release_lock (file=0x0, unlock=0) at locking/file_locking.c:79
#3  0x0000000000469909 in reset_locking () at locking/locking.c:193
#4  0x00000000004986d5 in exec_cmd (cmd=<value optimized out>, argv=0x7fffffffcb50, rstatus=0x7fffffffdcdc, sync_needed=<value optimized out>) at misc/lvm-exec.c:73
#5  0x000000000046ee96 in _fsadm_cmd (cmd=0x7240f0, vg=0x794b50, lp=<value optimized out>, fcmd=FSADM_CMD_CHECK, status=0x7fffffffdcdc) at metadata/lv_manip.c:3442
#6  0x000000000047ad4d in _lvresize_volume (cmd=0x7240f0, lv=0x794e70, lp=0x7fffffffdd30, pvh=0x794bf0) at metadata/lv_manip.c:4072
#7  lv_resize (cmd=0x7240f0, lv=0x794e70, lp=0x7fffffffdd30, pvh=0x794bf0) at metadata/lv_manip.c:4166
#8  0x000000000042c487 in lvresize (cmd=0x7240f0, argc=<value optimized out>, argv=<value optimized out>) at lvresize.c:201
#9  0x000000000042738f in lvm_run_command (cmd=0x7240f0, argc=1, argv=0x7fffffffe128) at lvmcmdline.c:1178
#10 0x000000000042a625 in lvm2_main (argc=4, argv=0x7fffffffe110) at lvmcmdline.c:1613
#11 0x000000335761ecdd in __libc_start_main (main=0x43f0a0 <main>, argc=5, ubp_av=0x7fffffffe108, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffe0f8) at libc-start.c:226
#12 0x0000000000415d99 in _start ()


==== We are paused on the child, just before the execution of fsadm. We are about to delete the VG lock file in  reset_locking () called from exec_cmd()

At this point, lvresize has an exclusive lock, no LVM command can be run against vgtests VG :
# lvs
^C  CTRL-c detected: giving up waiting for lock
  /var/lock/lvm/V_vgtests: flock failed: Interrupted system call
  Can't get lock for vgtests
  Skipping volume group vgtests
  LV   VG   Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  Root VG01 -wi-ao----   5.47g                                             
  Swap VG01 -wi-ao---- 288.00m

== ^^ This is normal and expected since there is an exclusive lock on the lvresize command.

==== We step, and delete the VG lock file from the child context
(gdb) next
unlink () at ../sysdeps/unix/syscall-template.S:83
83              ret

[root@dhcp-26-101 build-tests]# ll /var/lock/lvm/
total 0
[root@dhcp-26-101 build-tests]# lvs
  LV   VG      Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  Root VG01    -wi-ao----   5.47g                                             
  Swap VG01    -wi-ao---- 288.00m                                             
  a    vgtests -wi-a----- 500.00m                                             
  b    vgtests -wi-a----- 600.00m                                             
  c    vgtests -wi-a----- 900.00m                                            
  d    vgtests -wi-a----- 600.00m
[root@dhcp-26-101 build-tests]# lvremove vgtests/d
Do you really want to remove active logical volume d? [y/n]: y
  Logical volume "d" successfully removed
[root@dhcp-26-101 build-tests]# lvs
  LV   VG      Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  Root VG01    -wi-ao----   5.47g                                             
  Swap VG01    -wi-ao---- 288.00m                                             
  a    vgtests -wi-a----- 500.00m                                             
  b    vgtests -wi-a----- 600.00m                                             
  c    vgtests -wi-a----- 900.00m

== ^^ The lvresize command hasn't finished, but *now* we have full access to the VG : vgtests/d has been successfully removed

==== We continue the execution
(gdb) delete 1 2
(gdb) info breakpoints 
No breakpoints or watchpoints.
(gdb) continue 
Continuing.
process 5442 is executing new program: /bin/bash
[New process 5490]
process 5490 is executing new program: /sbin/lvm
Missing separate debuginfos, use: debuginfo-install bash-4.1.2-9.el6_2.x86_64

Program exited normally.
(gdb) fsck from util-linux-ng 2.17.2
/dev/mapper/vgtests-a: clean, 11/127512 files, 26636/512000 blocks
resize2fs 1.41.12 (17-May-2010)
Resizing the filesystem on /dev/mapper/vgtests-a to 102400 (1k) blocks.
The filesystem on /dev/mapper/vgtests-a is now 102400 blocks long.

  Reducing logical volume a to 100.00 MiB
  Logical volume a successfully resized

===== Now let's look at the LVM status :

[root@dhcp-26-101 build-tests]# lvs
  LV   VG      Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
  Root VG01    -wi-ao----   5.47g                                             
  Swap VG01    -wi-ao---- 288.00m                                             
  a    vgtests -wi-a----- 100.00m                                             
  b    vgtests -wi-a----- 600.00m                                             
  c    vgtests -wi-a----- 900.00m                                             
  d    vgtests -wi------- 600.00m


== ^^ vgtests/d has reappeared. that's because lvresize wasn't aware of the LV removal, and overwrote the metadata based on its modification.

Comment 9 Zdenek Kabelac 2014-01-09 11:33:15 UTC

Ahhhh  - I see the problem - there were also some major code shifts between command and library (for liblvm  lvresize support).

We should probably regularly drop the VG lock before exec - and reopen VG afterwards (keeping the vglock across fsresize operation is probably not the best idea), but it should be thought through for BZ mentioned in comment 3.

Comment 10 Cedric Buissart 2014-01-09 12:07:39 UTC

I would have thought that keeping the vglock across fsresize would be the only clean way around. However, I understand the major drawback of preventing any LVM (read, or write) operation on the VG during the whole resize. That can be annoying for big resize (we could print a warning when `-r` is used ?).

Are we sure that dropping VG lock, fsresize, getting lock again, will not bring its own dangers ? e.g. : what if something else takes the lock and does not release it ? what if other modifications are going on in the background ?

Comment 11 Zdenek Kabelac 2014-01-09 12:33:45 UTC

The solution here is - to mark resized LV as locked -  so lvm command will disallow manipulation with such LV.

But it has some design consequence which has not yet been decided. 
As mentioned in my first comment - so far - lvm was basically cares only about the block layer - but this change will open whole new can of worms.

We now would need to handle all the error states along the new paths - i.e.  machine crashes during fs-resize -  now after reboot - is lvm2 activation code supposed to be responsible to do a safe recovery and how such recovery should be handled?

There are lot of cases - and it gets even more complex when various type of fs gets into this game.

The possible way could be to introduce a new LV lock flag like ONLY_DEACTIVATE - which may prevent further manipulation while LV is active - but could be safely dropped with next activation - and admin would be responsible to handle recovery.
LV with this flag could be only deactivated - and maybe  lvconvert --repaired.

Comment 12 Alasdair Kergon 2014-01-27 11:06:58 UTC

There is indeed a bug where a lock file is incorrectly deleted after a fork - thanks for spotting this.  I'm testing a fix for that.  But as Zdenek says, another change is required because these locks should not be held across long operations like this.

Comment 14 Zdenek Kabelac 2014-01-27 11:21:08 UTC

Fixed in upstream commit:

https://www.redhat.com/archives/lvm-devel/2014-January/msg00056.html

In fact this bug has also cause problems during thin pool activation,
which has been doing fork for thin_check execution. So lvm2 was dropping during the volume activation and may have lead to other kind of problems.


For testing - parallel execution of lvresize & lvremove command as described in the BZ could be used - lvm2 now includes internal test for this case using slowed down resized device, so it's easy to hit such type of race.

Comment 20 Zdenek Kabelac 2014-03-31 12:07:50 UTC

Lvm2 tests this case with this test script:
test/shell/lock-parallel.sh

https://git.fedorahosted.org/cgit/lvm2.git/tree/test/shell/lock-parallel.sh

Comment 22 Nenad Peric 2014-06-30 11:23:22 UTC

Ran the test with resizing a bigger FS rather than slowing the device down. 
The lvremove waits for after the resize, so it cannot remove the LV in the middle of the resizing process.

[root@virt-015 ~]# lvcreate -L8G -n $LV1 $VG
  Logical volume "lvol_1" created
[root@virt-015 ~]# mkfs.ext4 /dev/my_vg/lvol_1 
.
.
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 32 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

[root@virt-015 ~]# (lvresize -L1G -r $VG/$LV1 &) ; sleep 1; lvremove -ff $VG/$LV1
fsck from util-linux-ng 2.17.2
/dev/mapper/my_vg-lvol_1: clean, 11/524288 files, 70287/2097152 blocks
resize2fs 1.41.12 (17-May-2010)
Resizing the filesystem on /dev/mapper/my_vg-lvol_1 to 262144 (4k) blocks.
The filesystem on /dev/mapper/my_vg-lvol_1 is now 262144 blocks long.

  Reducing logical volume lvol_1 to 1,00 GiB
  Logical volume lvol_1 successfully resized
  Logical volume "lvol_1" successfully removed


marking VERIFIED with:

2.6.32-488.el6.x86_64

lvm2-2.02.107-1.el6    BUILT: Mon Jun 23 16:44:45 CEST 2014
lvm2-libs-2.02.107-1.el6    BUILT: Mon Jun 23 16:44:45 CEST 2014
lvm2-cluster-2.02.107-1.el6    BUILT: Mon Jun 23 16:44:45 CEST 2014
udev-147-2.55.el6    BUILT: Wed Jun 18 13:30:21 CEST 2014
device-mapper-1.02.86-1.el6    BUILT: Mon Jun 23 16:44:45 CEST 2014
device-mapper-libs-1.02.86-1.el6    BUILT: Mon Jun 23 16:44:45 CEST 2014
device-mapper-event-1.02.86-1.el6    BUILT: Mon Jun 23 16:44:45 CEST 2014
device-mapper-event-libs-1.02.86-1.el6    BUILT: Mon Jun 23 16:44:45 CEST 2014
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 15:43:06 CEST 2014
cmirror-2.02.107-1.el6    BUILT: Mon Jun 23 16:44:45 CEST 2014

Comment 23 errata-xmlrpc 2014-10-14 08:24:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1387.html

Note You need to log in before you can comment on or make changes to this bug.