Bug 381081 - cmirror write path appears deadlocked after recovery is successful
cmirror write path appears deadlocked after recovery is successful
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cmirror-kernel (Show other bugs)
All Linux
urgent Severity high
: rc
: ---
Assigned To: Jonathan Earl Brassow
Corey Marthaler
: ZStream
Depends On: 290821
Blocks: 399661 461297
  Show dependency treegraph
Reported: 2007-11-13 16:41 EST by RHEL Product and Program Management
Modified: 2011-01-14 13:30 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-10-28 11:05:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
dmsetup from taft-01 (4.06 KB, text/plain)
2009-05-06 17:57 EDT, Corey Marthaler
no flags Details
dmsetup from taft-02 (4.06 KB, text/plain)
2009-05-06 17:58 EDT, Corey Marthaler
no flags Details
dmsetup from taft-03 (4.06 KB, text/plain)
2009-05-06 18:00 EDT, Corey Marthaler
no flags Details
dmsetup from taft-04 (4.06 KB, text/plain)
2009-05-06 18:01 EDT, Corey Marthaler
no flags Details

  None (edit)
Description RHEL Product and Program Management 2007-11-13 16:41:21 EST
This bug has been copied from bug #290821 and has been proposed
to be backported to 4.6 z-stream (EUS).
Comment 6 Jason Baron 2007-11-28 17:25:22 EST
committed in stream U7 build 68.1. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/
Comment 7 Corey Marthaler 2008-04-04 12:39:21 EDT
This bug still exists. I hit it after only my second failure attempt of the
primary legs of 4 mirrors. Again, everything appears to have worked just fine
except for the deadlocked sync.

Senario: Kill primary leg of synced 3 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      syncd_primary_3legs
* sync:      1
* mirrors:   4
* disklog:   1
* failpv:    /dev/sdh1
* legs:      3
* pvs:       /dev/sdh1 /dev/sdg1 /dev/sde1 /dev/sdf1

Comment 8 Corey Marthaler 2008-04-07 10:17:28 EDT
Hit this again over the weekend.
Comment 9 Nate Straz 2008-04-09 09:56:28 EDT
I hit this on x86 using the same set of packages corey was using in comment #7,
the same test case, and sync was the stuck process.
Comment 11 Corey Marthaler 2008-05-23 10:03:01 EDT
Hit this again during 4.7 regression testing.
Comment 12 Corey Marthaler 2008-06-11 14:59:05 EDT
This is still reproducable, not sure why this is in ON_QA. I saw it hang for
well over an hour (though it did eventually complete).

Build Date: Wed 11 Jun 2008 07:03:46 AM CDT

Build Date: Wed 11 Jun 2008 08:56:04 AM CDT

Build Date: Mon 09 Jun 2008 09:28:41 AM CDT

Build Date: Tue 30 Jan 2007 05:28:02 PM CST

Build Date: Tue 03 Jun 2008 01:54:29 PM CDT
Comment 13 Corey Marthaler 2008-06-13 09:43:47 EDT
FWIW, I saw the sync cmd take 8 hours to finish last night. Also, I wonder if
this issue can lead to 450939, though I'm not sure how that would be? 

After this testcase passed (though after taking 8 hours), the cmirrors were
cleaned up and rebuild to start the next test case, which ended up hitting 450939.
Comment 14 Jonathan Earl Brassow 2008-06-18 16:57:38 EDT
Mikulas's patch for bug 432566 is an additional improvement on what was done for
bug 290821.

Original code causing problems looked like:
static void do_work(void *data)
	while (do_mirror(data)) {

Bug 290821 took out the schedule_timeout and replaced with a schedule.

Further performance refinements by Mikulas changed this section of code again
(bug 432566).

If we are patching the 4.6.z kernel, we should do it with Mikulas' patch.
Comment 15 Jonathan Earl Brassow 2008-06-18 17:19:16 EDT
Sorry, I got mixed up.

This bug is a 4.7 bug.  This means that mikulas' patch for 432566 should satisfy
this bug.

I would close this bug as a duplicate of 432566, but I'd like to see the test
rerun.  432566 addresses general performance.  This bug addresses the performace
of one process after a failure.  So, although Mikulas' patch includes the fix
made for 290821 (the predecessor of this bug), I think it makes sense to test
this again.
Comment 17 Nate Straz 2008-07-08 15:27:35 EDT
Corey says I hit this during cluster regression runs on RHEL4-U7-re20080702.0.


lvm2-2.02.37-3.el4    BUILT: Thu Jun 12 10:09:14 CDT 2008
lvm2-cluster-2.02.37-3.el4    BUILT: Thu Jun 12 10:26:38 CDT 2008
device-mapper-1.02.25-2.el4    BUILT: Mon Jun  9 09:26:33 CDT 2008
cmirror-1.0.1-1    BUILT: Tue Jan 30 17:22:43 CST 2007
cmirror-kernel-2.6.9-43.2    BUILT: Wed Jul  2 15:10:37 CDT 2008
Comment 18 Corey Marthaler 2008-07-08 15:41:41 EDT
Nate's cluster had the tell tale sign of the "stuck" sync process. Ususally that
process eventually finishes within an hour or two, however Nate didn't have time
to wait for it with the other 4.7 testing that needs to be finished. 
Comment 20 RHEL Product and Program Management 2008-07-09 15:21:48 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
Comment 24 RHEL Product and Program Management 2008-09-03 09:05:21 EDT
Updating PM score.
Comment 26 Nate Straz 2009-04-14 12:22:09 EDT
I'm still hitting this during 4.8 testing.


lvm2-2.02.42-5.el4    BUILT: Tue Mar 24 16:46:52 CDT 2009
lvm2-cluster-2.02.42-5.el4    BUILT: Tue Mar 24 16:53:04 CDT 2009
device-mapper-1.02.28-2.el4    BUILT: Fri Feb 20 05:08:22 CST 2009
cmirror-1.0.2-1.el4    BUILT: Thu Feb 26 15:29:22 CST 2009
cmirror-kernel-2.6.9-43.10.el4    BUILT: Mon Apr 13 11:07:09 CDT 2009

[root@tank-01 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    6   M   tank-01
   2    1    6   M   tank-03
   3    1    6   M   tank-04
   4    1    6   M   morph-01
   6    1    6   M   morph-03
   7    1    6   M   morph-04

/dev/sdj was disabled on all nodes, then I/O was issued to /mnt/nonsyncd_log_3legs_1 to force the down convert on tank-01.

I noticed this message on morph-01
dm-cmirror: Clear requests remain at postsuspend!
dm-cmirror:  - Ignoring clear request: 644
Comment 27 Corey Marthaler 2009-05-06 12:56:41 EDT
Just a note that I hit this again last night. The process had been stuck for some 16 hours, but as soon as I attempted to debug it, it became un-wedged. I'm guessing it was the additional write attempts to the gfs that caused the original sync process to finally continue again.
Comment 28 Corey Marthaler 2009-05-06 17:56:13 EDT
Attaching dmsetup info from each of the 4 nodes. None of the devices are in the SUSPEND state.
Comment 29 Corey Marthaler 2009-05-06 17:57:50 EDT
Created attachment 342737 [details]
dmsetup from taft-01
Comment 30 Corey Marthaler 2009-05-06 17:58:16 EDT
Created attachment 342738 [details]
dmsetup from taft-02
Comment 31 Corey Marthaler 2009-05-06 18:00:46 EDT
Created attachment 342739 [details]
dmsetup from taft-03
Comment 32 Corey Marthaler 2009-05-06 18:01:29 EDT
Created attachment 342740 [details]
dmsetup from taft-04
Comment 34 Chris Feist 2010-10-27 18:33:07 EDT
Removing cluster-4.9 since this won't make 4.9.
Comment 36 RHEL Product and Program Management 2010-10-28 11:05:57 EDT
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.
Comment 37 Corey Marthaler 2011-01-14 13:30:38 EST
FWIW, this issues still exists and continues to cause basic cmirror regression tests to fail.


lvm2-2.02.42-9.el4    BUILT: Thu Oct 21 15:49:57 CDT 2010
lvm2-cluster-2.02.42-9.el4    BUILT: Thu Oct 21 15:46:55 CDT 2010
device-mapper-1.02.28-3.el4    BUILT: Thu Mar  4 14:48:16 CST 2010
cmirror-1.0.2-1.el4    BUILT: Thu Feb 26 15:29:27 CST 2009
cmirror-kernel-2.6.9-43.14.el4    BUILT: Wed Dec 22 16:24:19 CST 2010

Note You need to log in before you can comment on or make changes to this bug.