Bug 1260194

Summary: Reproducible LVM errors on a few systems when .cache file contains a device that is no longer on the system
Product: Red Hat Enterprise Linux 6 Reporter: Alex Wang <alex.wang>
Component: lvm2Assignee: Peter Rajnoha <prajnoha>
lvm2 sub component: Devices, Filtering and Stacking (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: agk, cmarthal, heinzm, jbrassow, jsvarova, msnitzer, prajnoha, prockai, rbednar, tlavigne, zkabelac
Version: 6.6Keywords: ZStream
Target Milestone: pre-dev-freeze   
Target Release: 6.8   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.140-1.el6 Doc Type: Bug Fix
Doc Text:
When the /etc/lvm/cache/.cache file contained an entry that did not exist, the code processed an uninitialized structure which led to unreliable behavior. As a consequence, an error message referencing an undefined (major, minor) pair was returned. With this update, non-existent devices are handled correctly while processing the .cache file, and the aforementioned error message is no longer generated.
Story Points: ---
Clone Of:
: 1261070 1261071 (view as bug list) Environment:
Last Closed: 2016-05-11 01:18:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1261070, 1261071, 1268411    

Description Alex Wang 2015-09-04 17:18:02 UTC
Created attachment 1070364 [details]
Verbose console output of vgchange from rc.sysinit

Description of problem:

After installation of lvm2 package lvm2-libs-2.02.118-3.el6_7.2.s390x the following errors shows up during boot on console:

Setting up Logical Volume Management:   /dev/VolGroup00/TmpVol00: stat failed: N
o such file or directory""                                                      
  Path /dev/VolGroup00/TmpVol00 no longer valid for device(0,1)""               

[Continues..]

In customer's case, package was installed during system update using YUM.  Customer downgraded the following packages and the error messages disappears

lvm2-2.02.118-3.el6_7.2.s390x
device-mapper-1.02.95-3.el6_7.2.s390x
lvm2-libs-2.02.118-3.el6_7.2.s390x
device-mapper-persistent-data-0.3.2-1.el6.s390x
device-mapper-libs-1.02.95-3.el6_7.2.s390x
device-mapper-event-1.02.95-3.el6_7.2.s390x
kpartx-0.4.9-87.el6.s390x
device-mapper-multipath-0.4.9-87.el6.s390x
device-mapper-event-libs-1.02.95-3.el6_7.2.s390x
device-mapper-multipath-libs-0.4.9-87.el6.s390x

Version-Release number of selected component (if applicable):

lvm2-libs-2.02.118-3.el6_7.2.s390x

How reproducible:

Everytime

Steps to Reproduce:
1. Run yum update to updated lvm2 package
2. Reboot server
3. 

Actual results:

Large amount of stat failed error

Expected results:

No errors should be displayed

Comment 3 Alasdair Kergon 2015-09-04 23:58:07 UTC
When the .cache file contains an entry that no longer exists, the code reads an uninitialised structure which leads to unreliable behaviour and the message with undefined major and minor numbers like "(0,1)".

Fixed in 2.02.130 upstream.

Comment 4 Alasdair Kergon 2015-09-05 00:01:52 UTC
This is seen on all architectures and can happen at other times, not just at boot.

To reproduce, add an entry for a device that does not exist on the system to the .cache file then run any normal LVM command.  If it doesn't fail (it often works OK), then try adding different entries in different places in the file, till you hit a case that is reproducible on the specific system you are running on.

Comment 5 Peter Rajnoha 2015-09-07 08:04:08 UTC
This is is needed for both 6.6.EUS (to complete bug #1248030) and 6.7.z (to complete bug #1248032).

Comment 8 Peter Rajnoha 2015-09-09 13:25:54 UTC
One needs to be quite lucky to get this reproduced exactly this way so that the message is printed. If you don't hit the error after trying several places in the .cache file for the nonexistent device, you can also alternatively use valgrind maybe.

Valgrind displays the error all the time:

$ valgrind pvs
...
==2450== Conditional jump or move depends on uninitialised value(s)
==2450==    at 0x185AFA: _insert (dev-cache.c:617)
==2450==    by 0x186AC1: dev_cache_get (dev-cache.c:955)
==2450==    by 0x18FA6A: _read_array (filter-persistent.c:85)
==2450==    by 0x18FD16: persistent_filter_load (filter-persistent.c:122)
==2450==    by 0x17B550: _init_filters (toolcontext.c:1220)
==2450==    by 0x17CE66: create_toolcontext (toolcontext.c:1769)
==2450==    by 0x14CBD3: init_lvm (lvmcmdline.c:1770)
==2450==    by 0x14D556: lvm2_main (lvmcmdline.c:1938)
==2450==    by 0x16BB37: main (lvm.c:21)

This is exactly the place - the dev_cache_get/_insert function where there's uninitialised value detected. Also install lvm2-debuginfo for the function names to be printed.

Comment 11 Roman Bednář 2016-02-22 11:56:09 UTC
Verified. Reproduced as suggested in Comment #8

Before fix:
lvm2-2.02.118-3.el6_7.2.x86_64

# grep missing /etc/lvm/cache/.cache
		"/dev/vg/missing",

# valgrind pvs
...
==1739== Conditional jump or move depends on uninitialised value(s)
==1739==    at 0x171AE2: ??? (in /sbin/lvm)
==1739==    by 0x17296F: dev_cache_get (in /sbin/lvm)
==1739==    by 0x179BDB: persistent_filter_load (in /sbin/lvm)
==1739==    by 0x168C2B: ??? (in /sbin/lvm)
==1739==    by 0x16B456: create_toolcontext (in /sbin/lvm)
==1739==    by 0x13E5F0: init_lvm (in /sbin/lvm)
==1739==    by 0x143E5C: lvm2_main (in /sbin/lvm)
==1739==    by 0x56DFD5C: (below main) (in /lib64/libc-2.12.so)
...
==1739== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 10 from 6)

==========================================================================

After fix:
# grep missing /etc/lvm/cache/.cache
		"/dev/vg/missing",

# valgrind pvs
...
==24830== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 10 from 6)


Tested on:
2.6.32-615.el6.x86_64

lvm2-2.02.141-2.el6    BUILT: Wed Feb 10 14:49:03 CET 2016
lvm2-libs-2.02.141-2.el6    BUILT: Wed Feb 10 14:49:03 CET 2016
lvm2-cluster-2.02.141-2.el6    BUILT: Wed Feb 10 14:49:03 CET 2016
udev-147-2.71.el6    BUILT: Wed Feb 10 14:07:17 CET 2016
device-mapper-1.02.115-2.el6    BUILT: Wed Feb 10 14:49:03 CET 2016
device-mapper-libs-1.02.115-2.el6    BUILT: Wed Feb 10 14:49:03 CET 2016
device-mapper-event-1.02.115-2.el6    BUILT: Wed Feb 10 14:49:03 CET 2016
device-mapper-event-libs-1.02.115-2.el6    BUILT: Wed Feb 10 14:49:03 CET 2016
device-mapper-persistent-data-0.6.2-0.1.rc1.el6    BUILT: Wed Feb 10 16:52:15 CET 2016
cmirror-2.02.141-2.el6    BUILT: Wed Feb 10 14:49:03 CET 2016

Comment 13 errata-xmlrpc 2016-05-11 01:18:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0964.html