Bug 794904

Summary: Redundant log leg failure fails to repair and causes kernel hang
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.3CC: agk, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, thornber, zkabelac
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.95-4.el6 Doc Type: Bug Fix
Doc Text:
This bug is a regression from the previous release. No release notes are necessary.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 15:01:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2012-02-17 21:33:13 UTC
Description of problem:
Scenario kill_secondary_log_2_legs_2_logs: Kill secondary log of synced 2 leg redundant log mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_secondary_log_2legs_2logs_1
* sync:               1
* striped:            0
* leg devices:        /dev/sdb1 /dev/sdg1
* log devices:        /dev/sdd1 /dev/sdh1
* no MDA devices:     
* failpv(s):          /dev/sdh1
* failnode(s):        taft-01
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate --mirrorlog mirrored -m 1 -n syncd_secondary_log_2legs_2logs_1 -L 500M helter_skelter /dev/sdb1:0-1000 /dev/sdg1:0-1000 /dev/sdd1:0-150 /dev/sdh1:0-150

Mirror Structure(s):
  LV                                                Attr     LSize   Copy%  Devices
  syncd_secondary_log_2legs_2logs_1                 mwi-a-m- 500.00m   4.00 syncd_secondary_log_2legs_2logs_1_mimage_0(0),syncd_secondary_log_2legs_2logs_1_mimage_1(0)
  [syncd_secondary_log_2legs_2logs_1_mimage_0]      Iwi-aom- 500.00m        /dev/sdb1(0)
  [syncd_secondary_log_2legs_2logs_1_mimage_1]      Iwi-aom- 500.00m        /dev/sdg1(0)
  [syncd_secondary_log_2legs_2logs_1_mlog]          mwi-aom-   4.00m 100.00 syncd_secondary_log_2legs_2logs_1_mlog_mimage_0(0),syncd_secondary_log_2legs_2logs_1_mlog_mimage_1(0)
  [syncd_secondary_log_2legs_2logs_1_mlog_mimage_0] iwi-aom-   4.00m        /dev/sdd1(0)
  [syncd_secondary_log_2legs_2logs_1_mlog_mimage_1] iwi-aom-   4.00m        /dev/sdh1(0)

PV=/dev/sdh1
        syncd_secondary_log_2legs_2logs_1_mlog_mimage_1: 1.3
PV=/dev/sdh1
        syncd_secondary_log_2legs_2logs_1_mlog_mimage_1: 1.3

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on taft-01...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on taft-01...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----

<start name="taft-01_syncd_secondary_log_2legs_2logs_1" pid="23835" time="Fri Feb 17 14:38:35 2012" type="cmd" />
Sleeping 10 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----

Disabling device sdh on taft-01
[DEADLOCK]


taft-01 qarshd[15258]: Running cmdline: echo offline > /sys/block/sdh/device/state &
taft-01 kernel: sd 3:0:0:7: rejecting I/O to offline device
taft-01 lvm[2997]: Secondary mirror device 253:4 has failed (D).
taft-01 lvm[2997]: Device failure in helter_skelter-syncd_secondary_log_2legs_2logs_1_mlog.
taft-01 lvm[2997]: Names including "_mlog" are reserved. Please choose a different LV name.
taft-01 lvm[2997]: Run `lvconvert --help' for more information.
taft-01 lvm[2997]: Repair of mirrored device helter_skelter-syncd_secondary_log_2legs_2logs_1_mlog failed.
taft-01 lvm[2997]: Failed to remove faulty devices in helter_skelter-syncd_secondary_log_2legs_2logs_1_mlog.
taft-01 qarshd[15261]: Running cmdline: pvs -a
taft-01 kernel: INFO: task kmirrord:15165 blocked for more than 120 seconds.
taft-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
taft-01 kernel: kmirrord      D 0000000000000002     0 15165      2 0x00000080
taft-01 kernel: ffff8801fac51bc0 0000000000000046 0000000000000000 ffff880216a5b280
taft-01 kernel: ffff8801fac51be0 ffffffffa0009e03 00000000fac51b40 ffffffffa0009a30
taft-01 kernel: ffff880215d665f8 ffff8801fac51fd8 000000000000f4e8 ffff880215d665f8
taft-01 kernel: Call Trace:
taft-01 kernel: [<ffffffffa0009e03>] ? dispatch_io+0x233/0x260 [dm_mod]
taft-01 kernel: [<ffffffffa0009a30>] ? vm_get_page+0x0/0x70 [dm_mod]
taft-01 kernel: [<ffffffff8109b809>] ? ktime_get_ts+0xa9/0xe0
taft-01 kernel: [<ffffffff814ed1e3>] io_schedule+0x73/0xc0
taft-01 kernel: [<ffffffffa0009ec5>] sync_io+0x95/0x110 [dm_mod]
taft-01 kernel: [<ffffffffa0002770>] ? dm_unplug_all+0x50/0x70 [dm_mod]
taft-01 kernel: [<ffffffff811136c5>] ? mempool_kmalloc+0x15/0x20
taft-01 kernel: [<ffffffff81113273>] ? mempool_alloc+0x63/0x140
taft-01 kernel: [<ffffffffa000a167>] dm_io+0x1b7/0x1c0 [dm_mod]
taft-01 kernel: [<ffffffffa0009a30>] ? vm_get_page+0x0/0x70 [dm_mod]
taft-01 kernel: [<ffffffffa00099a0>] ? vm_next_page+0x0/0x30 [dm_mod]
taft-01 kernel: [<ffffffffa00207e1>] disk_flush+0x91/0x170 [dm_log]
taft-01 kernel: [<ffffffffa0029722>] ? dm_rh_inc+0x42/0xd0 [dm_region_hash]
taft-01 kernel: [<ffffffffa00290d3>] dm_rh_flush+0x13/0x20 [dm_region_hash]
taft-01 kernel: [<ffffffffa0033b4f>] do_mirror+0x27f/0x6e0 [dm_mirror]
taft-01 kernel: [<ffffffffa00338d0>] ? do_mirror+0x0/0x6e0 [dm_mirror]
taft-01 kernel: [<ffffffff8108b2b0>] worker_thread+0x170/0x2a0
taft-01 kernel: [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
taft-01 kernel: [<ffffffff8108b140>] ? worker_thread+0x0/0x2a0
taft-01 kernel: [<ffffffff81090886>] kthread+0x96/0xa0
taft-01 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
taft-01 kernel: [<ffffffff810907f0>] ? kthread+0x0/0xa0
taft-01 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20



Version-Release number of selected component (if applicable):
2.6.32-220.el6.x86_64

lvm2-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
lvm2-libs-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
lvm2-cluster-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-libs-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-event-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
device-mapper-event-libs-1.02.71-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012
cmirror-2.02.92-0.40.el6    BUILT: Thu Feb 16 18:12:38 CST 2012

Comment 1 Corey Marthaler 2012-02-17 22:01:36 UTC
This is reproducible and also occurs when it's the primary redundant log being
failed.

Comment 2 Jonathan Earl Brassow 2012-03-12 19:02:11 UTC
you get redundant logs through the "raid1" segment type and there is no need to do the extra layering required with the 'mirror' segment type.  I will conditionally nack this bug; because there are better solutions available to do what is desired.

Comment 5 Jonathan Earl Brassow 2012-04-10 23:46:17 UTC
Regression was introduced in 2.02.89 from changes to the dmeventd code.

Comment 8 Jonathan Earl Brassow 2012-04-11 01:18:09 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
This bug is a regression from the previous release.  No release notes are necessary.

Comment 11 Corey Marthaler 2012-04-11 15:56:29 UTC
Fix verified in the latest rpms.

2.6.32-251.el6.x86_64
lvm2-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
lvm2-libs-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
lvm2-cluster-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
device-mapper-libs-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
device-mapper-event-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
device-mapper-event-libs-1.02.74-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012
cmirror-2.02.95-4.el6    BUILT: Wed Apr 11 09:03:19 CDT 2012


The following test case now passes:
./helter_skelter -o taft-01 -l /home/msp/cmarthal/work/sts/sts-root -r /usr/tests/sts-rhel6.3 -e kill_secondary_log_2_legs_2_logs -i

Comment 13 errata-xmlrpc 2012-06-20 15:01:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0962.html