Bug 431755

Summary: RHEL5 cmirror tracker: server can't handle log device failure
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: cmirrorAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: agk, ccaulfie, dwysocha, heinzm, iannis, mbroz
Target Milestone: rcKeywords: TestBlocker
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-04-27 15:03:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 430797    

Description Corey Marthaler 2008-02-06 19:41:39 UTC
Description of problem:
Senario: Kill disk log of synced 2 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      syncd_log_2legs
* sync:      1
* mirrors:   1
* disklog:   1
* failpv:    /dev/sde1
* legs:      2
* pvs:       /dev/sdd1 /dev/sdh1 /dev/sde1
************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate -m 1 -n syncd_log_2legs_1 -L 800M helter_skelter
/dev/sdd1:0-1000 /dev/sdh1:0-1000 /dev/sde1:0-150

Waiting until all mirrors become fully syncd...
        0/1 mirror(s) are fully synced: ( 1=0.00% )
        0/1 mirror(s) are fully synced: ( 1=32.50% )
        0/1 mirror(s) are fully synced: ( 1=64.50% )
        0/1 mirror(s) are fully synced: ( 1=96.50% )
        1/1 mirror(s) are fully synced: ( 1=100.00% )

Creating gfs on top of mirror(s) on taft-01...
Mounting mirrored gfs filesystems on taft-01...
Mounting mirrored gfs filesystems on taft-02...
Mounting mirrored gfs filesystems on taft-03...
Mounting mirrored gfs filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----
checkit starting with:
CREATE
Num files:          100
Random Seed:        11588
Verify XIOR Stream: /tmp/checkit_syncd_log_2legs_1
Working dir:        /mnt/syncd_log_2legs_1/checkit

        ---- taft-02 ----
checkit starting with:
CREATE
Num files:          100
Random Seed:        10811
Verify XIOR Stream: /tmp/checkit_syncd_log_2legs_1
Working dir:        /mnt/syncd_log_2legs_1/checkit

        ---- taft-03 ----
checkit starting with:
CREATE
Num files:          100
Random Seed:        11268
Verify XIOR Stream: /tmp/checkit_syncd_log_2legs_1
Working dir:        /mnt/syncd_log_2legs_1/checkit

        ---- taft-04 ----
checkit starting with:
CREATE
Num files:          100
Random Seed:        11254
Verify XIOR Stream: /tmp/checkit_syncd_log_2legs_1
Working dir:        /mnt/syncd_log_2legs_1/checkit


<start name="taft-01_1" pid="4798" time="Wed Feb  6 11:48:24 2008" type="cmd" />
<start name="taft-02_1" pid="4800" time="Wed Feb  6 11:48:24 2008" type="cmd" />
<start name="taft-03_1" pid="4802" time="Wed Feb  6 11:48:24 2008" type="cmd" />
<start name="taft-04_1" pid="4804" time="Wed Feb  6 11:48:24 2008" type="cmd" />

Disabling device sde on taft-01
Disabling device sde on taft-02
Disabling device sde on taft-03
Disabling device sde on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in
10+0 records out
[DEADLOCK]


The result is the server goes crazy:
Feb  6 11:48:08 taft-01 qarshd[11594]: Running cmdline: echo offline >
/sys/block/sde/device/state
Feb  6 11:48:08 taft-01 xinetd[6233]: EXIT: qarsh status=0 pid=11594 duration=0(sec)
Feb  6 11:48:08 taft-01 kernel: sd 1:0:0:4: rejecting I/O to offline device
Feb  6 11:48:08 taft-01 clogd[6761]: rw_log:  write failure: Input/output error
Feb  6 11:48:08 taft-01 clogd[6761]: Error writing to disk log
Feb  6 11:48:08 taft-01 clogd[6761]: rw_log:  write failure: Input/output error
Feb  6 11:48:08 taft-01 kernel: sd 1:0:0:4: rejecting I/O to offline device
Feb  6 11:48:08 taft-01 clogd[6761]: Error writing to disk log
Feb  6 11:48:08 taft-01 kernel: sd 1:0:0:4: rejecting I/O to offline device
Feb  6 11:48:08 taft-01 clogd[6761]: rw_log:  write failure: Input/output error
Feb  6 11:48:08 taft-01 clogd[6761]: Error writing to disk log
[...]


And the clients can no longer communicate:
Feb  6 11:48:15 taft-03 kernel: device-mapper: dm-log-clustered: Server error
while processing request [DM_CLOG_MARK_REGION]: -5
Feb  6 11:48:15 taft-03 kernel: device-mapper: dm-log-clustered: Server error
while processing request [DM_CLOG_GET_RESYNC_WORK]: -5
Feb  6 11:48:46 taft-03 last message repeated 2865 times
Feb  6 11:49:47 taft-03 last message repeated 6254 times


Version-Release number of selected component (if applicable):
cmirror-1.1.11-1.el5
kmod-cmirror-0.1.5-2.el5
lvm2-2.02.32-1.el5
lvm2-cluster-2.02.32-1.el5

Comment 1 RHEL Program Management 2008-02-06 19:47:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Corey Marthaler 2008-02-06 21:22:57 UTC
This is reproducable everytime. Marking as a Testblocker.

Comment 3 Corey Marthaler 2008-02-08 16:59:29 UTC
This bug is verified fixed in cmirror-1.1.13-1.el5/kmod-cmirror-0.1.6-1.el5.

Comment 4 RHEL Program Management 2008-03-11 19:36:37 UTC
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 6 Alasdair Kergon 2010-04-27 15:03:03 UTC
Assuming this VERIFIED fix got released.  Closing.
Reopen if it's not yet resolved.