Bug 857554 - segfault in dmeventd, lvremove, vgreduce when removing part of an LV on a missing PV with issue_discards enabled
segfault in dmeventd, lvremove, vgreduce when removing part of an LV on a mis...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.3
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Alasdair Kergon
Cluster QE
: ZStream
Depends On:
Blocks: 867009
  Show dependency treegraph
 
Reported: 2012-09-14 15:56 EDT by Corey Marthaler
Modified: 2013-02-21 03:13 EST (History)
12 users (show)

See Also:
Fixed In Version: lvm2-2.02.98-1.el6
Doc Type: Bug Fix
Doc Text:
If issue_discards is enabled in the configuration and lvremove is run against a Logical Volume that partially uses a Physical Volume that is missing, lvremove would crash. This is now fixed. Also, because issuing discards is irreversible, the user is now prompted to confirm their intention. A new attribute 'p' in 'lvs' is set when the Logical Volume is partially missing.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-21 03:13:46 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2012-09-14 15:56:47 EDT
Description of problem:
Not a whole lot to go on here since the core dump save failed.

================================================================================
Iteration 0.7 started at Thu Sep 13 16:32:45 CDT 2012
================================================================================
Scenario kill_secondary_synced_4_legs: Kill secondary leg of synced 4 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_secondary_4legs_1
* sync:               1
* striped:            0
* leg devices:        /dev/sdb1 /dev/sdg1 /dev/sde1 /dev/sdf1
* log devices:        /dev/sdc1
* no MDA devices:     
* failpv(s):          /dev/sdg1
* failnode(s):        taft-01 taft-02 taft-03 taft-04
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate -m 3 -n syncd_secondary_4legs_1 -L 500M helter_skelter /dev/sdb1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-150

Mirror Structure(s):
 LV                                 Attr     LSize   Copy%  Devices
 syncd_secondary_4legs_1            mwi-a-m- 500.00m  11.20 syncd_secondary_4legs_1_mimage_0(0),syncd_secondary_4legs_1_mimage_1(0),syncd_secondary_4legs_1_mimage_2(0),syncd_secondary_4legs_1_mimage_3(0)
 [syncd_secondary_4legs_1_mimage_0] Iwi-aom- 500.00m        /dev/sdb1(0)
 [syncd_secondary_4legs_1_mimage_1] Iwi-aom- 500.00m        /dev/sdg1(0)
 [syncd_secondary_4legs_1_mimage_2] Iwi-aom- 500.00m        /dev/sde1(0)
 [syncd_secondary_4legs_1_mimage_3] Iwi-aom- 500.00m        /dev/sdf1(0)
 [syncd_secondary_4legs_1_mlog]     lwi-aom-   4.00m        /dev/sdc1(0)

PV=/dev/sdg1
     syncd_secondary_4legs_1_mimage_1: 4.1
PV=/dev/sdg1
     syncd_secondary_4legs_1_mimage_1: 4.1

Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 70.90% )
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating gfs2 on top of mirror(s) on taft-01...
Mounting mirrored gfs2 filesystems on taft-01...
Mounting mirrored gfs2 filesystems on taft-02...
Mounting mirrored gfs2 filesystems on taft-03...
Mounting mirrored gfs2 filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...
     ---- taft-01 ----
     ---- taft-02 ----
     ---- taft-03 ----
     ---- taft-04 ----

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
     ---- taft-01 ----
     ---- taft-02 ----
     ---- taft-03 ----
     ---- taft-04 ----

Disabling device sdg on taft-01
Disabling device sdg on taft-02
Disabling device sdg on taft-03
Disabling device sdg on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.291228 s, 144 MB/s
Verifying current sanity of lvm after the failure

Mirror Structure(s):
  Couldn't find device with uuid wqzFLr-zMsn-d0qv-A1KU-14p2-cqnb-SxhZcw.
  LV                                 Attr     LSize   Copy%  Devices
  syncd_secondary_4legs_1            mwi-aom- 500.00m 100.00 syncd_secondary_4legs_1_mimage_0(0),syncd_secondary_4legs_1_mimage_2(0),syncd_secondary_4legs_1_mimage_3(0)
  [syncd_secondary_4legs_1_mimage_0] iwi-aom- 500.00m        /dev/sdb1(0)
  syncd_secondary_4legs_1_mimage_1   -wi----- 500.00m        unknown device(0)
  [syncd_secondary_4legs_1_mimage_2] iwi-aom- 500.00m        /dev/sde1(0)
  [syncd_secondary_4legs_1_mimage_3] iwi-aom- 500.00m        /dev/sdf1(0)
  [syncd_secondary_4legs_1_mlog]     lwi-aom-   4.00m        /dev/sdc1(0)

Verifying FAILED device /dev/sdg1 is *NOT* in the volume(s)
  Couldn't find device with uuid wqzFLr-zMsn-d0qv-A1KU-14p2-cqnb-SxhZcw.
olog: 1
Verifying LOG device(s) /dev/sdc1 *ARE* in the mirror(s)
  Couldn't find device with uuid wqzFLr-zMsn-d0qv-A1KU-14p2-cqnb-SxhZcw.
Verifying LEG device /dev/sdb1 *IS* in the volume(s)
  Couldn't find device with uuid wqzFLr-zMsn-d0qv-A1KU-14p2-cqnb-SxhZcw.
Verifying LEG device /dev/sde1 *IS* in the volume(s)
  Couldn't find device with uuid wqzFLr-zMsn-d0qv-A1KU-14p2-cqnb-SxhZcw.
Verifying LEG device /dev/sdf1 *IS* in the volume(s)
  Couldn't find device with uuid wqzFLr-zMsn-d0qv-A1KU-14p2-cqnb-SxhZcw.
verify the dm devices associated with /dev/sdg1 have been removed as expected
Checking REMOVAL of syncd_secondary_4legs_1_mimage_1 on:  taft-01 taft-02 taft-03 taft-04
verify the newly allocated dm devices were added as a result of the failures
Checking EXISTENCE of syncd_secondary_4legs_1_mimage_4 on:  taft-01
syncd_secondary_4legs_1_mimage_4 on taft-01 should now exist
** ARE THERE ENOUGH SPARE PVS TO ALLOCATE? **


Sep 13 16:34:22 taft-01 kernel: sd 3:0:0:6: rejecting I/O to offline device
Sep 13 16:34:22 taft-01 kernel: sd 3:0:0:6: rejecting I/O to offline device
Sep 13 16:34:22 taft-01 kernel: sd 3:0:0:6: rejecting I/O to offline device
Sep 13 16:34:22 taft-01 kernel: dmeventd[6123]: segfault at 10 ip 00007f3827c3b9d0 sp 00007f38283712c0 error 4 in liblvm2cmd.so.2.02[7f3827c1d000+da000]
Sep 13 16:34:23 taft-01 abrtd: Directory 'ccpp-2012-09-13-16:34:22-4226' creation detected
Sep 13 16:34:23 taft-01 abrt[6303]: Saved core dump of pid 4226 (/sbin/dmeventd) to /var/spool/abrt/ccpp-2012-09-13-16:34:22-4226 (13709312 bytes)
Sep 13 16:34:23 taft-01 abrtd: Package 'device-mapper-event' isn't signed with proper key
Sep 13 16:34:23 taft-01 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2012-09-13-16:34:22-4226' exited with 1
Sep 13 16:34:23 taft-01 abrtd: Corrupted or bad directory /var/spool/abrt/ccpp-2012-09-13-16:34:22-4226, deleting


Version-Release number of selected component (if applicable):
2.6.32-279.el6.x86_64
lvm2-2.02.95-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
lvm2-libs-2.02.95-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
lvm2-cluster-2.02.95-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
udev-147-2.41.el6    BUILT: Thu Mar  1 13:01:08 CST 2012
device-mapper-1.02.74-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
device-mapper-libs-1.02.74-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
device-mapper-event-1.02.74-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
device-mapper-event-libs-1.02.74-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
cmirror-2.02.95-10.el6_3.1    BUILT: Tue Aug 28 09:31:04 CDT 2012
Comment 1 Corey Marthaler 2012-09-17 14:25:41 EDT
This appears to be 100% reproducible if you run enough iterations of helter_skelter.

The most it's taken for me to hit this bug was 11.

./helter_skelter -R ../../var/share/resource_files/taft.xml -l /usr/tests/sts-rhel6.3 -r /usr/tests/sts-rhel6.3 

================================================================================
Iteration 0.11 started at Mon Sep 17 12:06:45 CDT 2012
================================================================================
Scenario kill_primary_and_log_synced_2_legs: Kill primary leg and disk log of synced 2 leg mirror(s)

[...]

Sep 17 12:08:17 taft-01 kernel: dmeventd[7243]: segfault at 10 ip 00007f533d8c79d0 sp 00007f533d7112c0 error 4 in liblvm2cmd.so.2.02[7f533d8a9000+da000]
Comment 2 Corey Marthaler 2012-09-17 16:21:26 EDT
This exists in the latest rhel6.4 rpms as well.

lvm2-2.02.97-3.el6    BUILT: Tue Sep 11 05:06:56 CDT 2012
lvm2-libs-2.02.97-3.el6    BUILT: Tue Sep 11 05:06:56 CDT 2012
lvm2-cluster-2.02.97-3.el6    BUILT: Tue Sep 11 05:06:56 CDT 2012
udev-147-2.41.el6    BUILT: Thu Mar  1 13:01:08 CST 2012
device-mapper-1.02.76-3.el6    BUILT: Tue Sep 11 05:06:56 CDT 2012
device-mapper-libs-1.02.76-3.el6    BUILT: Tue Sep 11 05:06:56 CDT 2012
device-mapper-event-1.02.76-3.el6    BUILT: Tue Sep 11 05:06:56 CDT 2012
device-mapper-event-libs-1.02.76-3.el6    BUILT: Tue Sep 11 05:06:56 CDT 2012
cmirror-2.02.97-3.el6    BUILT: Tue Sep 11 05:06:56 CDT 2012
Comment 3 Corey Marthaler 2012-09-17 18:17:46 EDT
Core was generated by `/sbin/dmeventd'.
Program terminated with signal 11, Segmentation fault.
#0  _dev_topology_attribute (attribute=0x7f5488136019 "queue/discard_max_bytes", sysfs_dir=0x1544254 "/sys", dev=0x0) at device/device.c:377
377     if (dm_snprintf(path, PATH_MAX, sysfs_fmt_str, sysfs_dir,
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libsepol-2.0.41-4.el6.x86_64 libudev-147-2.41.el6.x86_64
(gdb) bt
#0  _dev_topology_attribute (attribute=0x7f5488136019 "queue/discard_max_bytes", sysfs_dir=0x1544254 "/sys", dev=0x0) at device/device.c:377
#1  0x00007f54880e520c in discard_pv_segment (peg=0x7f548006da10, discard_area_reduction=125) at metadata/pv_manip.c:209
#2  0x00007f54880cbf5b in _release_and_discard_lv_segment_area (seg=0x7f548006e850, s=<value optimized out>, area_reduction=<value optimized out>, with_discard=<value optimized out>) at metadata/lv_manip.c:309
#3  0x00007f54880cc849 in _lv_segment_reduce (lv=0x7f548006e110, extents=125, delete=1) at metadata/lv_manip.c:519
#4  _lv_reduce (lv=0x7f548006e110, extents=125, delete=1) at metadata/lv_manip.c:560
#5  0x00007f54880cca5c in lv_remove (lv=<value optimized out>) at metadata/lv_manip.c:632
#6  0x00007f54880e0100 in _delete_lv (mirror_lv=0x7f548006de98, lv=0x7f548006e110) at metadata/mirror.c:445
#7  0x00007f54880e304c in _remove_mirror_images (lv=<value optimized out>, num_removed=1, is_removable=0x7f548810f230 <_is_partial_lv>, removable_baton=0x0, remove_log=1, collapse=0, removed=0x7f54887e95fc, 
    preferred_only=1) at metadata/mirror.c:1077
#8  0x00007f54880e37eb in remove_mirror_images (lv=0x7f548006de98, num_mirrors=<value optimized out>, is_removable=0x7f548810f230 <_is_partial_lv>, removable_baton=0x0, remove_log=1) at metadata/mirror.c:1131
#9  0x00007f5488110517 in mirror_remove_missing (cmd=0x15410a0, lv=0x7f548006de98, force=<value optimized out>) at lvconvert.c:1362
#10 0x00007f54881116f6 in _lvconvert_mirrors_repair (cmd=0x15410a0, lv=0x7f548006de98, lp=0x7f54887e9840) at lvconvert.c:1414
#11 _lvconvert_mirrors (cmd=0x15410a0, lv=0x7f548006de98, lp=0x7f54887e9840) at lvconvert.c:1519
#12 0x00007f5488112636 in _lvconvert_single (cmd=0x15410a0, lv=0x7f548006de98, handle=0x7f54887e9840) at lvconvert.c:2033
#13 0x00007f5488112fb4 in lvconvert_single (cmd=0x15410a0, lp=0x7f54887e9840) at lvconvert.c:2135
#14 0x00007f5488113ac6 in lvconvert (cmd=0x15410a0, argc=1, argv=0x7f54887e99c8) at lvconvert.c:2216
#15 0x00007f548811984c in lvm_run_command (cmd=0x15410a0, argc=1, argv=0x7f54887e99c8) at lvmcmdline.c:1116
#16 0x00007f5488131ead in lvm2_run (handle=0x15410a0, cmdline=<value optimized out>) at lvmcmdlib.c:89
#17 0x00007f548859815a in _remove_failed_devices (dmt=<value optimized out>, event=<value optimized out>, unused=<value optimized out>) at dmeventd_mirror.c:145
#18 process_event (dmt=<value optimized out>, event=<value optimized out>, unused=<value optimized out>) at dmeventd_mirror.c:189
#19 0x000000000040408d in _do_process_event (arg=0x153dff0) at dmeventd.c:710
#20 _monitor_thread (arg=0x153dff0) at dmeventd.c:833
#21 0x0000003997c07851 in ?? ()
#22 0x00007f54887ea700 in ?? ()
#23 0x0000000000000000 in ?? ()
Comment 4 Alasdair Kergon 2012-09-17 19:14:14 EDT
pv->dev is NULL if a PV is missing.
Should discard_pv_segment just return 1 silently if (!peg->pv->dev) ?
Comment 5 Corey Marthaler 2012-09-18 16:40:05 EDT
Here's the easiest way I've found to reproduce this so far:

# With discards turned on in lvm.conf

[root@taft-04 ~]# pvscan
  PV /dev/sdb1   VG test  [135.66 GiB / 135.66 GiB free]
  PV /dev/sdc1   VG test  [135.66 GiB / 135.66 GiB free]
  PV /dev/sdd1   VG test  [135.66 GiB / 135.66 GiB free]

[root@taft-04 ~]# lvcreate -m 1 -n mirror -L 100M test
  Logical volume "mirror" created

[root@taft-04 ~]# lvs -a -o +devices
  LV                VG   Attr     LSize   Log         Copy%  Devices
  mirror            test mwi-a-m- 100.00m mirror_mlog 100.00 mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] test iwi-aom- 100.00m                    /dev/sdb1(0)
  [mirror_mimage_1] test iwi-aom- 100.00m                    /dev/sdc1(0)
  [mirror_mlog]     test lwi-aom-   4.00m                    /dev/sdd1(0)

[root@taft-04 ~]# echo offline > /sys/block/sdc/device/state

[root@taft-04 ~]# lvs -a -o +devices
  /dev/sdc1: read failed after 0 of 512 at 145669554176: Input/output error
  /dev/sdc1: read failed after 0 of 512 at 145669664768: Input/output error
  /dev/sdc1: read failed after 0 of 512 at 0: Input/output error
  /dev/sdc1: read failed after 0 of 512 at 4096: Input/output error
  /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid rh2PwK-WkZ1-pEWB-glsX-SudD-BlpK-B2xh11.
  LV                VG   Attr     LSize   Log         Copy%  Devices
  mirror            test mwi-a-m- 100.00m mirror_mlog 100.00 mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] test iwi-aom- 100.00m                    /dev/sdb1(0)
  [mirror_mimage_1] test iwi-aom- 100.00m                    unknown device(0)
  [mirror_mlog]     test lwi-aom-   4.00m                    /dev/sdd1(0)

[root@taft-04 ~]#  vgreduce --removemissing --force test
  /dev/sdc1: read failed after 0 of 512 at 145669554176: Input/output error
  /dev/sdc1: read failed after 0 of 512 at 145669664768: Input/output error
  /dev/sdc1: read failed after 0 of 512 at 0: Input/output error
  /dev/sdc1: read failed after 0 of 512 at 4096: Input/output error
  /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid rh2PwK-WkZ1-pEWB-glsX-SudD-BlpK-B2xh11.
Segmentation fault (core dumped)
Comment 6 Corey Marthaler 2012-09-18 18:05:29 EDT
Core was generated by `vgreduce --removemissing --force test'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004a13a0 in ?? ()
Missing separate debuginfos, use: debuginfo-install lvm2-2.02.97-3.el6.x86_64
(gdb) bt
#0  0x00000000004a13a0 in ?? ()
#1  0x000000000047bebc in discard_pv_segment ()
#2  0x0000000000462c0b in ?? ()
#3  0x00000000004634f9 in ?? ()
#4  0x000000000046370c in lv_remove ()
#5  0x0000000000476db0 in ?? ()
#6  0x0000000000479cfc in ?? ()
#7  0x000000000047a49b in remove_mirror_images ()
#8  0x0000000000417e27 in mirror_remove_missing ()
#9  0x0000000000437589 in vgreduce ()
#10 0x000000000042115c in lvm_run_command ()
#11 0x0000000000423b92 in lvm2_main ()
#12 0x0000003a4101ecdd in __libc_start_main () from /lib64/libc.so.6
#13 0x0000000000413df9 in _start ()
Comment 7 Alasdair Kergon 2012-09-19 07:04:30 EDT
I've reproduced this with a simple lvremove too under the same circumstances.  

lvremove should not be removing LVs where the underlying PVs are missing as it does not yet offer all the protections that vgreduce offers.

A regression introduced by 8c5bcdabab5b40498f2051715639cf07f14d877b I think.
Comment 8 Alasdair Kergon 2012-09-19 07:36:01 EDT
Also partial LVs are not highlighted in lvs.
Comment 10 Petr Rockai 2012-09-19 19:25:25 EDT
Well, the fix is not very good, since lvremove is actually required to work with missing PVs. If it's not safe, that's a bug and should be fixed, but making it impossible to remove broken LVs manually is a bad idea, in my opinion. I'm quite sure this is a bug in the discard code (which seems to be much newer). Do you still see problems in lvremove after fixing the discard code?
Comment 11 Zdenek Kabelac 2012-09-20 03:54:51 EDT
I guess there might be multiple problems together - where it's not quite clear what is the proper fix.

i.e.:

I've 3 PVs -  and multiple LVs spread over them -  
LV1 on PV1 and PV2
LV2 on PV2 and PV3
LV3 on PV3

Now - if PV1 and PV2 is missing - and I run  'lvremove LV1' - 
should we remove only LV1 - or also LV2 ?
And the situation gets probably more complex with mirrors in this game.
Comment 14 Petr Rockai 2012-10-08 11:33:10 EDT
Zdenek, you need the latter, but also the changes to discard by Alasdair. First of the patches is behaviour-changing and shouldn't go into .z I think.
Comment 17 Nenad Peric 2012-11-02 06:17:14 EDT
Did manual tests and ran a full helter_skelter group of tests.

Marking this verified with:

lvm2-2.02.98-2.el6.x86_64
device-mapper-1.02.77-2.el6.x86_64
kernel-2.6.32-336.el6.x86_64
Comment 18 errata-xmlrpc 2013-02-21 03:13:46 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0501.html

Note You need to log in before you can comment on or make changes to this bug.