Bug 1038870

Summary: second device failure to a mirror syncing from an upconvert from a previous failure causes lvm to deadlock
Product: Red Hat Enterprise Linux 7 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
lvm2 sub component: Default / Unclassified QA Contact: Cluster QE <mspqa-list>
Status: CLOSED NOTABUG Docs Contact:
Severity: high    
Priority: unspecified CC: agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, thornber, zkabelac
Version: 7.0Keywords: TestBlocker
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-20 22:29:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Corey Marthaler 2013-12-06 01:34:19 UTC
Description of problem:
Create a mirror, fail a leg, watch the down convert, bring the failed leg back online, upconvert it back to the original state, fail another leg while it's in process of converting, deadlock.

There is also another bug here I'm tracking down as I've found times when although the convert finishes to 100%, the tmp image never gets dropped. If this test waits for the syncing to finish (assuming it doesn't hit the other bug I just mentioned) this deadlock will not happen, if it doesn't wait, the deadlock always happens.


# ./revolution_9 

Creating mirror(s) on host-064.virt.lab.msp.redhat.com...
host-064.virt.lab.msp.redhat.com: lvcreate --type mirror --mirrorlog disk -m 2 -n mirror_1 -L 2G revolution_9  /dev/sdc1:0-2557 /dev/sdd1:0-2557 /dev/sde1:0-2557 /dev/sdf1:0-150
Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 7.30% )
   [...]
   1/1 mirror(s) are fully synced: ( 100.00% )
Creating ext on top of mirror(s) on host-064.virt.lab.msp.redhat.com...
mke2fs 1.42.8 (20-Jun-2013)
Mounting mirrored ext filesystems on host-064.virt.lab.msp.redhat.com...

================================================================================
Iteration 0.1 started at Thu Dec  5 17:16:16 CST 2013
================================================================================
ACTUAL LEG ORDER: /dev/sdc1 /dev/sdd1 /dev/sde1
Scenario kill_non_primary_leg: Kill non primary leg

********* Mirror info for this scenario *********
* mirrors:            mirror_1
* leg devices:        /dev/sdc1 /dev/sdd1 /dev/sde1
* log devices:        /dev/sdf1
* failpv(s):          /dev/sde1
* failnode(s):        host-064.virt.lab.msp.redhat.com
* lvmetad:            0
* leg fault policy:   remove
* log fault policy:   allocate
*************************************************

Current mirror/raid device structure(s):
  LV                  Attr       LSize   Cpy%Sync Devices
   mirror_1            mwi-aom---   2.00g   100.00 mirror_1_mimage_0(0),mirror_1_mimage_1(0),mirror_1_mimage_2(0)
   [mirror_1_mimage_0] iwi-aom---   2.00g          /dev/sdc1(0)
   [mirror_1_mimage_1] iwi-aom---   2.00g          /dev/sdd1(0)
   [mirror_1_mimage_2] iwi-aom---   2.00g          /dev/sde1(0)
   [mirror_1_mlog]     lwi-aom---   4.00m          /dev/sdf1(0)

AVAIL:2 - NEEDED:1
will_alloc_work=1

PV=/dev/sde1
        mirror_1_mimage_2: 6
PV=/dev/sde1
        mirror_1_mimage_2: 6

Writing verification files (checkit) to mirror(s) on...
        ---- host-064.virt.lab.msp.redhat.com ----

Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-064.virt.lab.msp.redhat.com ----

Disabling device sde on host-064.virt.lab.msp.redhat.com

Getting recovery check start time from /var/log/messages: Dec  5 17:16
Attempting I/O to cause mirror down conversion(s) on host-064.virt.lab.msp.redhat.com
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.113929 s, 368 MB/s

Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  Couldn't find device with uuid M1Rpc0-G8Sr-fpoo-kSe9-hszw-KrBS-mPNydP.
  LV                  Attr       LSize   Cpy%Sync Devices
   mirror_1            mwi-aom---   2.00g   100.00 mirror_1_mimage_0(0),mirror_1_mimage_1(0)
   [mirror_1_mimage_0] iwi-aom---   2.00g          /dev/sdc1(0)
   [mirror_1_mimage_1] iwi-aom---   2.00g          /dev/sdd1(0)
   [mirror_1_mlog]     lwi-aom---   4.00m          /dev/sdf1(0)

Verify that each of the mirror repairs finished successfully

Verifying FAILED device /dev/sde1 is *NOT* in the volume(s)
olog: 2.2
Verifying LOG device(s) /dev/sdf1 *ARE* in the mirror(s)
Verifying LEG device /dev/sdc1 *IS* in the volume(s)
Verifying LEG device /dev/sdd1 *IS* in the volume(s)
verify the dm devices associated with /dev/sde1 have been removed as expected
Checking REMOVAL of mirror_1_mimage_2 on:  host-064.virt.lab.msp.redhat.com

Verify that the mirror image order remains the same after the down conversion
EXPECTED LEG ORDER: /dev/sdc1 /dev/sdd1
ACTUAL LEG ORDER: /dev/sdc1 /dev/sdd1
Verifying files (checkit) on mirror(s) on...
        ---- host-064.virt.lab.msp.redhat.com ----

Enabling device sde on host-064.virt.lab.msp.redhat.com


HACK FOR BUG 799048: vgextend --restoremissing revolution_9 /dev/sde1
  WARNING: Inconsistent metadata found for VG revolution_9 - updating to use version 7
HACK FOR BUG 799048: vgextend revolution_9 /dev/sde1
  Physical volume '/dev/sde1' is already in volume group 'revolution_9'
  Unable to add physical volume '/dev/sde1' to volume group 'revolution_9'.
Getting PE for leg: /dev/sdc1
Getting PE for leg: /dev/sdd1
Getting PE for leg: /dev/sde1
Up converting linear(s) back to mirror(s) on host-064.virt.lab.msp.redhat.com...
host-064.virt.lab.msp.redhat.com: lvconvert --mirrorlog disk -m 2 -b revolution_9/mirror_1 /dev/sdc1:0-2557 /dev/sdd1:0-2557 /dev/sde1:0-2557 /dev/sdf1:0-150
Verifying the up conversions by checking if all original leg devices are back in the mirror(s)
Verifying device /dev/sdc1 *IS* one of the legs in the mirror(s)
Verifying device /dev/sdd1 *IS* one of the legs in the mirror(s)
Verifying device /dev/sde1 *IS* one of the legs in the mirror(s)

Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 23.75% )
   0/1 mirror(s) are fully synced: ( 34.11% )
   0/1 mirror(s) are fully synced: ( 53.27% )
   0/1 mirror(s) are fully synced: ( 65.94% )
   0/1 mirror(s) are fully synced: ( 95.14% )
   1/1 mirror(s) are fully synced: ( 100.00% )

Verifying files (checkit) on mirror(s) on...
        ---- host-064.virt.lab.msp.redhat.com ----

Stopping the io load (collie/xdoio) on mirror(s)


================================================================================
Iteration 0.2 started at Thu Dec  5 17:18:57 CST 2013
================================================================================
ACTUAL LEG ORDER: /dev/sdc1 /dev/sdd1 /dev/sde1
Scenario kill_log: Kill log

********* Mirror info for this scenario *********
* mirrors:            mirror_1
* leg devices:        /dev/sdc1 /dev/sdd1 /dev/sde1
* log devices:        /dev/sdf1
* failpv(s):          /dev/sdf1
* failnode(s):        host-064.virt.lab.msp.redhat.com
* lvmetad:            0
* leg fault policy:   remove
* log fault policy:   allocate
*************************************************




#### NOTE THE TMP IMAGE STILL EXISTS FROM PREVIOUS FAILURE ITERATION: ####

Current mirror/raid device structure(s):
  LV                     Attr       LSize   Cpy%Sync Devices
  mirror_1               cwi-aom---   2.00g   100.00 mirror_1_mimagetmp_2(0),mirror_1_mimage_2(0)
  [mirror_1_mimage_0]    iwi-aom---   2.00g          /dev/sdc1(0)
  [mirror_1_mimage_1]    iwi-aom---   2.00g          /dev/sdd1(0)
  [mirror_1_mimage_2]    iwi-aom---   2.00g          /dev/sde1(0)
  [mirror_1_mimagetmp_2] mwi-aom---   2.00g   100.00 mirror_1_mimage_0(0),mirror_1_mimage_1(0)   
  [mirror_1_mlog]        lwi-aom---   4.00m          /dev/sdf1(0)

AVAIL:2 - NEEDED:1
will_alloc_work=1

PV=/dev/sdf1
        mirror_1_mlog: 1.2
PV=/dev/sdf1
        mirror_1_mlog: 1.2

Writing verification files (checkit) to mirror(s) on...
        ---- host-064.virt.lab.msp.redhat.com ----

Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-064.virt.lab.msp.redhat.com ----

Disabling device sdf on host-064.virt.lab.msp.redhat.com

Getting recovery check start time from /var/log/messages: Dec  5 17:19
Attempting I/O to cause mirror down conversion(s) on host-064.virt.lab.msp.redhat.com
[DEADLOCK]


Version-Release number of selected component (if applicable):
3.10.0-53.el7.x86_64
lvm2-2.02.103-6.el7    BUILT: Wed Nov 27 02:28:25 CST 2013
lvm2-libs-2.02.103-6.el7    BUILT: Wed Nov 27 02:28:25 CST 2013
lvm2-cluster-2.02.103-6.el7    BUILT: Wed Nov 27 02:28:25 CST 2013
device-mapper-1.02.82-6.el7    BUILT: Wed Nov 27 02:28:25 CST 2013
device-mapper-libs-1.02.82-6.el7    BUILT: Wed Nov 27 02:28:25 CST 2013
device-mapper-event-1.02.82-6.el7    BUILT: Wed Nov 27 02:28:25 CST 2013
device-mapper-event-libs-1.02.82-6.el7    BUILT: Wed Nov 27 02:28:25 CST 2013
device-mapper-persistent-data-0.2.8-2.el7    BUILT: Wed Oct 30 10:20:48 CDT 2013
cmirror-2.02.103-6.el7    BUILT: Wed Nov 27 02:28:25 CST 2013


How reproducible:
Often

Comment 1 Jonathan Earl Brassow 2014-02-20 22:29:51 UTC
same problem as bug 1029218

Comment 2 Jonathan Earl Brassow 2014-02-20 22:59:50 UTC
(In reply to Jonathan Earl Brassow from comment #1)
> same problem as bug 1029218

D'oh!

Same as bug 1039218