381081 – cmirror write path appears deadlocked after recovery is successful

Bug 381081 - cmirror write path appears deadlocked after recovery is successful

Summary: cmirror write path appears deadlocked after recovery is successful

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cmirror-kernel
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Assignee:	Jonathan Earl Brassow
QA Contact:	Corey Marthaler
Docs Contact:
URL:
Whiteboard:
Depends On:	290821
Blocks:	399661 461297
TreeView+	depends on / blocked

Reported:	2007-11-13 21:41 UTC by RHEL Program Management
Modified:	2011-01-14 18:30 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2010-10-28 15:05:57 UTC
Embargoed:

Attachments	(Terms of Use)
dmsetup from taft-01 (4.06 KB, text/plain) 2009-05-06 21:57 UTC, Corey Marthaler	no flags	Details
dmsetup from taft-02 (4.06 KB, text/plain) 2009-05-06 21:58 UTC, Corey Marthaler	no flags	Details
dmsetup from taft-03 (4.06 KB, text/plain) 2009-05-06 22:00 UTC, Corey Marthaler	no flags	Details
dmsetup from taft-04 (4.06 KB, text/plain) 2009-05-06 22:01 UTC, Corey Marthaler	no flags	Details
View All

Description RHEL Program Management 2007-11-13 21:41:21 UTC

This bug has been copied from bug #290821 and has been proposed
to be backported to 4.6 z-stream (EUS).

Comment 6 Jason Baron 2007-11-28 22:25:22 UTC

committed in stream U7 build 68.1. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 7 Corey Marthaler 2008-04-04 16:39:21 UTC

This bug still exists. I hit it after only my second failure attempt of the
primary legs of 4 mirrors. Again, everything appears to have worked just fine
except for the deadlocked sync.

Senario: Kill primary leg of synced 3 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      syncd_primary_3legs
* sync:      1
* mirrors:   4
* disklog:   1
* failpv:    /dev/sdh1
* legs:      3
* pvs:       /dev/sdh1 /dev/sdg1 /dev/sde1 /dev/sdf1
************************************************

2.6.9-68.26.ELsmp
lvm2-2.02.27-2.el4_6.2
lvm2-cluster-2.02.27-6.el4
cmirror-1.0.1-1
cmirror-kernel-2.6.9-39.4

Comment 8 Corey Marthaler 2008-04-07 14:17:28 UTC

Hit this again over the weekend.

Comment 9 Nate Straz 2008-04-09 13:56:28 UTC

I hit this on x86 using the same set of packages corey was using in comment #7,
the same test case, and sync was the stuck process.

Comment 11 Corey Marthaler 2008-05-23 14:03:01 UTC

Hit this again during 4.7 regression testing.

Comment 12 Corey Marthaler 2008-06-11 18:59:05 UTC

This is still reproducable, not sure why this is in ON_QA. I saw it hang for
well over an hour (though it did eventually complete).

lvm2-2.02.37-2.el4
Build Date: Wed 11 Jun 2008 07:03:46 AM CDT

lvm2-cluster-2.02.37-2.el4
Build Date: Wed 11 Jun 2008 08:56:04 AM CDT

device-mapper-1.02.25-2.el4
Build Date: Mon 09 Jun 2008 09:28:41 AM CDT

cmirror-1.0.1-1
Build Date: Tue 30 Jan 2007 05:28:02 PM CST

cmirror-kernel-2.6.9-41.4
Build Date: Tue 03 Jun 2008 01:54:29 PM CDT

Comment 13 Corey Marthaler 2008-06-13 13:43:47 UTC

FWIW, I saw the sync cmd take 8 hours to finish last night. Also, I wonder if
this issue can lead to 450939, though I'm not sure how that would be? 

After this testcase passed (though after taking 8 hours), the cmirrors were
cleaned up and rebuild to start the next test case, which ended up hitting 450939.

Comment 14 Jonathan Earl Brassow 2008-06-18 20:57:38 UTC

Mikulas's patch for bug 432566 is an additional improvement on what was done for
bug 290821.

Original code causing problems looked like:
static void do_work(void *data)
{
	while (do_mirror(data)) {
		set_current_state(TASK_INTERRUPTIBLE);
		schedule_timeout(HZ/5);
	}
}

Bug 290821 took out the schedule_timeout and replaced with a schedule.

Further performance refinements by Mikulas changed this section of code again
(bug 432566).

If we are patching the 4.6.z kernel, we should do it with Mikulas' patch.

Comment 15 Jonathan Earl Brassow 2008-06-18 21:19:16 UTC

Sorry, I got mixed up.

This bug is a 4.7 bug.  This means that mikulas' patch for 432566 should satisfy
this bug.

I would close this bug as a duplicate of 432566, but I'd like to see the test
rerun.  432566 addresses general performance.  This bug addresses the performace
of one process after a failure.  So, although Mikulas' patch includes the fix
made for 290821 (the predecessor of this bug), I think it makes sense to test
this again.

Comment 17 Nate Straz 2008-07-08 19:27:35 UTC

Corey says I hit this during cluster regression runs on RHEL4-U7-re20080702.0.

2.6.9-76.ELhugemem

lvm2-2.02.37-3.el4    BUILT: Thu Jun 12 10:09:14 CDT 2008
lvm2-cluster-2.02.37-3.el4    BUILT: Thu Jun 12 10:26:38 CDT 2008
device-mapper-1.02.25-2.el4    BUILT: Mon Jun  9 09:26:33 CDT 2008
cmirror-1.0.1-1    BUILT: Tue Jan 30 17:22:43 CST 2007
cmirror-kernel-2.6.9-43.2    BUILT: Wed Jul  2 15:10:37 CDT 2008

Comment 18 Corey Marthaler 2008-07-08 19:41:41 UTC

Nate's cluster had the tell tale sign of the "stuck" sync process. Ususally that
process eventually finishes within an hour or two, however Nate didn't have time
to wait for it with the other 4.7 testing that needs to be finished.

Comment 20 RHEL Program Management 2008-07-09 19:21:48 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 24 RHEL Program Management 2008-09-03 13:05:21 UTC

Updating PM score.

Comment 26 Nate Straz 2009-04-14 16:22:09 UTC

I'm still hitting this during 4.8 testing.

2.6.9-87.ELhugemem

lvm2-2.02.42-5.el4    BUILT: Tue Mar 24 16:46:52 CDT 2009
lvm2-cluster-2.02.42-5.el4    BUILT: Tue Mar 24 16:53:04 CDT 2009
device-mapper-1.02.28-2.el4    BUILT: Fri Feb 20 05:08:22 CST 2009
cmirror-1.0.2-1.el4    BUILT: Thu Feb 26 15:29:22 CST 2009
cmirror-kernel-2.6.9-43.10.el4    BUILT: Mon Apr 13 11:07:09 CDT 2009

[root@tank-01 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    6   M   tank-01
   2    1    6   M   tank-03
   3    1    6   M   tank-04
   4    1    6   M   morph-01
   6    1    6   M   morph-03
   7    1    6   M   morph-04

/dev/sdj was disabled on all nodes, then I/O was issued to /mnt/nonsyncd_log_3legs_1 to force the down convert on tank-01.

I noticed this message on morph-01
dm-cmirror: Clear requests remain at postsuspend!
dm-cmirror:  - Ignoring clear request: 644

Comment 27 Corey Marthaler 2009-05-06 16:56:41 UTC

Just a note that I hit this again last night. The process had been stuck for some 16 hours, but as soon as I attempted to debug it, it became un-wedged. I'm guessing it was the additional write attempts to the gfs that caused the original sync process to finally continue again.

Comment 28 Corey Marthaler 2009-05-06 21:56:13 UTC

Attaching dmsetup info from each of the 4 nodes. None of the devices are in the SUSPEND state.

Comment 29 Corey Marthaler 2009-05-06 21:57:50 UTC

Created attachment 342737 [details]
dmsetup from taft-01

Comment 30 Corey Marthaler 2009-05-06 21:58:16 UTC

Created attachment 342738 [details]
dmsetup from taft-02

Comment 31 Corey Marthaler 2009-05-06 22:00:46 UTC

Created attachment 342739 [details]
dmsetup from taft-03

Comment 32 Corey Marthaler 2009-05-06 22:01:29 UTC

Created attachment 342740 [details]
dmsetup from taft-04

Comment 34 Chris Feist 2010-10-27 22:33:07 UTC

Removing cluster-4.9 since this won't make 4.9.

Comment 36 RHEL Program Management 2010-10-28 15:05:57 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 37 Corey Marthaler 2011-01-14 18:30:38 UTC

FWIW, this issues still exists and continues to cause basic cmirror regression tests to fail.

2.6.9-94.ELsmp

lvm2-2.02.42-9.el4    BUILT: Thu Oct 21 15:49:57 CDT 2010
lvm2-cluster-2.02.42-9.el4    BUILT: Thu Oct 21 15:46:55 CDT 2010
device-mapper-1.02.28-3.el4    BUILT: Thu Mar  4 14:48:16 CST 2010
cmirror-1.0.2-1.el4    BUILT: Thu Feb 26 15:29:27 CST 2009
cmirror-kernel-2.6.9-43.14.el4    BUILT: Wed Dec 22 16:24:19 CST 2010

Note You need to log in before you can comment on or make changes to this bug.