Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
A customer periodically takes a snapshot of the logical volume containing the root filesystem. Several RHEL 6.7 systems have hung and had all applications become unresponsive when using lvremove to remove the snapshot.
A vmcore was captured, and lvremove was found to be stuck waiting for a page fault. At the time lvremove triggered a page fault, it had already suspended the DM devices for the root logical volume and snapshot. lvremove was deadlocked as the pagefault needed data from the root filesystem, but the root filesystem couldn't be read until lvremove finished its operations and resumed the root logical volume.
The issue appears to be a regression starting with the 2.02.118 releases of lvm2 for RHEL 6.7. lvremove did not have any of its memory mlocked into physical memory. Under a test with strace running, lvremove was found to be passing a length of 0 to all calls to mlock():
...
mlock(0x7fc06c125000, 0) = 0
mlock(0x7fc06c33b000, 0) = 0
mlock(0x7fc06c33c000, 0) = 0
...
Tested older versions, including lvm2-2.02.111-2.el6_6.6, did not show this behavior. Instead, mlock was being passed proper lengths for the regions to lock into memory with RHEL6.6 and older versions of lvm2.
...
mlock(0x400000, 1044480) = 0
mlock(0x6fe000, 49152) = 0
mlock(0x70a000, 98304) = 0
...
The bug appears to be from a change to _maps_line() in lib/mm/memlock.c related to valgrind defines, specifically the code shortly before mlock() is called:
#ifdef HAVE_VALGRIND
/*
* Valgrind is continually eating memory while executing code
* so we need to deactivate check of locked memory size
*/
#ifndef VALGRIND_POOL
if (RUNNING_ON_VALGRIND)
#endif
sz -= sz; /* = 0, but avoids getting warning about dead assigment */
#endif
With HAVE_VALGRIND defined and VALGRIND_POOL now defined from an option passed to ./configure in lvm2.spec, the "sz -=sz;" line is always invoked and sets a 0 size. This 0 size is then passed to mlock(), breaking the use of mlock(). With lvremove not locked into memory, it can page fault in the middle of its critical section and deadlock itself and hang anything else needing the root filesystem.
Version-Release number of selected component (if applicable):
lvm2-2.02.118-3.el6_7.2.x86_64
How reproducible:
Deadlock is very random from needing a page fault at a critical time.
Steps to Reproduce:
1. Create a snapshot of an logical volume
2. Run "lvremove" under strace to remove the snapshot.
3. strace data will show mlock() calls with a length parameter of 0 when the bug occurs.
Actual results:
lvremove can deadlock when removing a snapshot for a logical volume containing the root filesystem.
Expected results:
lvremove should remove the snapshot without risk of a deadlock.
I'm quite confused what is this BZ about.
Running 'lvm2' code within 'valgrind' MUST not mlock any memory.
Thus it eliminates locking size to 0 - this is 'expected' and 'wanted'.
Using 0 is not 'breaking' mlock - it disables mlock.
So passing 0 is not a problem - it's the behaviour for lvm2 binary executed from valgrind.
--enable-valgrind-pool somehow slipped to the build.
This option shall not appear in final build as it's current implementation eats memory (even in critical section) and it's not protected with runtime detection.
To be clear, this is a straightforward rebuild with a corrected spec file. No code change. "Steps to reproduce" in the original description no longer showing zero will be sufficient to show the problem has gone away.
A temporary workaround of setting lvm.conf configuration of activation/use_mlockall=1 has been provided, but this is not ideal as it uses more memory and can be slower, and should be reverted once the fixed package is available.
Marking as verified.
Tested on:
2.6.32-610.el6.x86_64
lvm2-2.02.140-3.el6 BUILT: Thu Jan 21 12:40:10 CET 2016
lvm2-libs-2.02.140-3.el6 BUILT: Thu Jan 21 12:40:10 CET 2016
lvm2-cluster-2.02.140-3.el6 BUILT: Thu Jan 21 12:40:10 CET 2016
udev-147-2.69.el6 BUILT: Thu Jan 28 15:41:45 CET 2016
device-mapper-1.02.114-3.el6 BUILT: Thu Jan 21 12:40:10 CET 2016
device-mapper-libs-1.02.114-3.el6 BUILT: Thu Jan 21 12:40:10 CET 2016
device-mapper-event-1.02.114-3.el6 BUILT: Thu Jan 21 12:40:10 CET 2016
device-mapper-event-libs-1.02.114-3.el6 BUILT: Thu Jan 21 12:40:10 CET 2016
device-mapper-persistent-data-0.6.0-2.el6 BUILT: Thu Jan 21 09:40:25 CET 2016
cmirror-2.02.140-3.el6 BUILT: Thu Jan 21 12:40:10 CET 2016
==========================
Test result:
lvremove strace output:
...
mlock(0x7f81a38fb000, 4096) = 0
mlock(0x7f81a38fc000, 536576) = 0
mlock(0x7f81a3b7e000, 4096) = 0
...
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHBA-2016-0964.html
A customer periodically takes a snapshot of the logical volume containing the root filesystem. Several RHEL 6.7 systems have hung and had all applications become unresponsive when using lvremove to remove the snapshot. A vmcore was captured, and lvremove was found to be stuck waiting for a page fault. At the time lvremove triggered a page fault, it had already suspended the DM devices for the root logical volume and snapshot. lvremove was deadlocked as the pagefault needed data from the root filesystem, but the root filesystem couldn't be read until lvremove finished its operations and resumed the root logical volume. The issue appears to be a regression starting with the 2.02.118 releases of lvm2 for RHEL 6.7. lvremove did not have any of its memory mlocked into physical memory. Under a test with strace running, lvremove was found to be passing a length of 0 to all calls to mlock(): ... mlock(0x7fc06c125000, 0) = 0 mlock(0x7fc06c33b000, 0) = 0 mlock(0x7fc06c33c000, 0) = 0 ... Tested older versions, including lvm2-2.02.111-2.el6_6.6, did not show this behavior. Instead, mlock was being passed proper lengths for the regions to lock into memory with RHEL6.6 and older versions of lvm2. ... mlock(0x400000, 1044480) = 0 mlock(0x6fe000, 49152) = 0 mlock(0x70a000, 98304) = 0 ... The bug appears to be from a change to _maps_line() in lib/mm/memlock.c related to valgrind defines, specifically the code shortly before mlock() is called: #ifdef HAVE_VALGRIND /* * Valgrind is continually eating memory while executing code * so we need to deactivate check of locked memory size */ #ifndef VALGRIND_POOL if (RUNNING_ON_VALGRIND) #endif sz -= sz; /* = 0, but avoids getting warning about dead assigment */ #endif With HAVE_VALGRIND defined and VALGRIND_POOL now defined from an option passed to ./configure in lvm2.spec, the "sz -=sz;" line is always invoked and sets a 0 size. This 0 size is then passed to mlock(), breaking the use of mlock(). With lvremove not locked into memory, it can page fault in the middle of its critical section and deadlock itself and hang anything else needing the root filesystem. Version-Release number of selected component (if applicable): lvm2-2.02.118-3.el6_7.2.x86_64 How reproducible: Deadlock is very random from needing a page fault at a critical time. Steps to Reproduce: 1. Create a snapshot of an logical volume 2. Run "lvremove" under strace to remove the snapshot. 3. strace data will show mlock() calls with a length parameter of 0 when the bug occurs. Actual results: lvremove can deadlock when removing a snapshot for a logical volume containing the root filesystem. Expected results: lvremove should remove the snapshot without risk of a deadlock.