Due to a recent update on Javascript code a full page refresh on your browser might be needed.
Bug 1278920 - deadlock when removing snapshot of root LV from lvremove failing to mlock() itself into memory
Summary: deadlock when removing snapshot of root LV from lvremove failing to mlock() i...
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: rc
: ---
Assignee: LVM and device-mapper development team
QA Contact: cluster-qe@redhat.com
Depends On:
Blocks: 1279983
TreeView+ depends on / blocked
Reported: 2015-11-06 18:32 UTC by David Jeffery
Modified: 2019-09-12 09:14 UTC (History)
14 users (show)

Fixed In Version: lvm2-2.02.140-3.el6
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1279983 (view as bug list)
Last Closed: 2016-05-11 01:18:55 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2044513 None None None Never
Red Hat Product Errata RHBA-2016:0964 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2016-05-10 22:57:40 UTC

Description David Jeffery 2015-11-06 18:32:38 UTC
A customer periodically takes a snapshot of the logical volume containing the root filesystem.  Several RHEL 6.7 systems have hung and had all applications become unresponsive when using lvremove to remove the snapshot.

A vmcore was captured, and lvremove was found to be stuck waiting for a page fault.  At the time lvremove triggered a page fault, it had already suspended the DM devices for the root logical volume and snapshot.  lvremove was deadlocked as the pagefault needed data from the root filesystem, but the root filesystem couldn't be read until lvremove finished its operations and resumed the root logical volume.

The issue appears to be a regression starting with the 2.02.118 releases of lvm2 for RHEL 6.7.  lvremove did not have any of its memory mlocked into physical memory.  Under a test with strace running, lvremove was found to be passing a length of 0 to all calls to mlock():

mlock(0x7fc06c125000, 0)                = 0
mlock(0x7fc06c33b000, 0)                = 0
mlock(0x7fc06c33c000, 0)                = 0

Tested older versions, including lvm2-2.02.111-2.el6_6.6, did not show this behavior.  Instead, mlock was being passed proper lengths for the regions to lock into memory with RHEL6.6 and older versions of lvm2.

mlock(0x400000, 1044480)                = 0
mlock(0x6fe000, 49152)                  = 0
mlock(0x70a000, 98304)                  = 0

The bug appears to be from a change to _maps_line() in lib/mm/memlock.c related to valgrind defines, specifically the code shortly before mlock() is called:

         * Valgrind is continually eating memory while executing code
         * so we need to deactivate check of locked memory size
                sz -= sz; /* = 0, but avoids getting warning about dead assigment */


With HAVE_VALGRIND defined and VALGRIND_POOL now defined from an option passed to ./configure in lvm2.spec, the "sz -=sz;" line is always invoked and sets a 0 size.   This 0 size is then passed to mlock(), breaking the use of mlock().  With lvremove not locked into memory, it can page fault in the middle of its critical section and deadlock itself and hang anything else needing the root filesystem.

Version-Release number of selected component (if applicable):

How reproducible:
Deadlock is very random from needing a page fault at a critical time.

Steps to Reproduce:
1.  Create a snapshot of an logical volume
2.  Run "lvremove" under strace to remove the snapshot.
3.  strace data will show mlock() calls with a length parameter of 0 when the bug occurs.

Actual results:
lvremove can deadlock when removing a snapshot for a logical volume containing the root filesystem.

Expected results:
lvremove should remove the snapshot without risk of a deadlock.

Comment 2 Zdenek Kabelac 2015-11-06 19:07:53 UTC
I'm quite confused what is this BZ about.

Running  'lvm2' code  within  'valgrind'   MUST not  mlock any memory.

Thus it eliminates locking size to 0 - this is 'expected' and 'wanted'.
Using 0 is not 'breaking' mlock - it disables mlock.

So passing  0  is not a problem -  it's the behaviour for lvm2 binary executed from valgrind.

Comment 3 Alasdair Kergon 2015-11-06 19:48:33 UTC
The spec file should not have set that option.

Comment 4 Zdenek Kabelac 2015-11-06 20:01:50 UTC
--enable-valgrind-pool  somehow slipped to the build.

This option shall not appear in final build as it's current implementation eats memory (even in critical section)  and it's not protected with runtime detection.

Comment 6 Alasdair Kergon 2015-11-10 03:22:06 UTC
To be clear, this is a straightforward rebuild with a corrected spec file.  No code change.  "Steps to reproduce" in the original description no longer showing zero will be sufficient to show the problem has gone away.

Comment 7 Alasdair Kergon 2015-11-10 03:24:35 UTC
A temporary workaround of setting lvm.conf configuration of activation/use_mlockall=1 has been provided, but this is not ideal as it uses more memory and can be slower, and should be reverted once the fixed package is available.

Comment 15 Roman Bednář 2015-11-24 07:50:48 UTC
Marking as verified.

lvremove strace output:

old version: lvm2-2.02.118-3.el6_7.3

mlock(0x7fc530929000, 0)                = 0
mlock(0x7fc53092a000, 0)                = 0
mlock(0x7fc530b41000, 0)                = 0

new version: lvm2-2.02.118-3.el6_7.4

mlock(0x7f0a7a024000, 4096)             = 0
mlock(0x7f0a7a025000, 94208)            = 0
mlock(0x7f0a7a23c000, 4096)             = 0

Comment 17 Roman Bednář 2016-02-04 14:43:55 UTC
Marking as verified.

Tested on:


lvm2-2.02.140-3.el6    BUILT: Thu Jan 21 12:40:10 CET 2016
lvm2-libs-2.02.140-3.el6    BUILT: Thu Jan 21 12:40:10 CET 2016
lvm2-cluster-2.02.140-3.el6    BUILT: Thu Jan 21 12:40:10 CET 2016
udev-147-2.69.el6    BUILT: Thu Jan 28 15:41:45 CET 2016
device-mapper-1.02.114-3.el6    BUILT: Thu Jan 21 12:40:10 CET 2016
device-mapper-libs-1.02.114-3.el6    BUILT: Thu Jan 21 12:40:10 CET 2016
device-mapper-event-1.02.114-3.el6    BUILT: Thu Jan 21 12:40:10 CET 2016
device-mapper-event-libs-1.02.114-3.el6    BUILT: Thu Jan 21 12:40:10 CET 2016
device-mapper-persistent-data-0.6.0-2.el6    BUILT: Thu Jan 21 09:40:25 CET 2016
cmirror-2.02.140-3.el6    BUILT: Thu Jan 21 12:40:10 CET 2016

Test result:

lvremove strace output:

mlock(0x7f81a38fb000, 4096)             = 0
mlock(0x7f81a38fc000, 536576)           = 0
mlock(0x7f81a3b7e000, 4096)             = 0

Comment 19 errata-xmlrpc 2016-05-11 01:18:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.