Bug 381081
Summary: | cmirror write path appears deadlocked after recovery is successful | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | RHEL Program Management <pm-rhel> | ||||||||||
Component: | cmirror-kernel | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||||
Status: | CLOSED WONTFIX | QA Contact: | Corey Marthaler <cmarthal> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | urgent | ||||||||||||
Version: | 4 | CC: | cfeist, iannis, jbaron, lhh, lwang, mpatocka, nstraz, riek, syeghiay | ||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2010-10-28 15:05:57 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 290821 | ||||||||||||
Bug Blocks: | 399661, 461297 | ||||||||||||
Attachments: |
|
Description
RHEL Program Management
2007-11-13 21:41:21 UTC
committed in stream U7 build 68.1. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ This bug still exists. I hit it after only my second failure attempt of the primary legs of 4 mirrors. Again, everything appears to have worked just fine except for the deadlocked sync. Senario: Kill primary leg of synced 3 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: syncd_primary_3legs * sync: 1 * mirrors: 4 * disklog: 1 * failpv: /dev/sdh1 * legs: 3 * pvs: /dev/sdh1 /dev/sdg1 /dev/sde1 /dev/sdf1 ************************************************ 2.6.9-68.26.ELsmp lvm2-2.02.27-2.el4_6.2 lvm2-cluster-2.02.27-6.el4 cmirror-1.0.1-1 cmirror-kernel-2.6.9-39.4 Hit this again over the weekend. I hit this on x86 using the same set of packages corey was using in comment #7, the same test case, and sync was the stuck process. Hit this again during 4.7 regression testing. This is still reproducable, not sure why this is in ON_QA. I saw it hang for well over an hour (though it did eventually complete). lvm2-2.02.37-2.el4 Build Date: Wed 11 Jun 2008 07:03:46 AM CDT lvm2-cluster-2.02.37-2.el4 Build Date: Wed 11 Jun 2008 08:56:04 AM CDT device-mapper-1.02.25-2.el4 Build Date: Mon 09 Jun 2008 09:28:41 AM CDT cmirror-1.0.1-1 Build Date: Tue 30 Jan 2007 05:28:02 PM CST cmirror-kernel-2.6.9-41.4 Build Date: Tue 03 Jun 2008 01:54:29 PM CDT FWIW, I saw the sync cmd take 8 hours to finish last night. Also, I wonder if this issue can lead to 450939, though I'm not sure how that would be? After this testcase passed (though after taking 8 hours), the cmirrors were cleaned up and rebuild to start the next test case, which ended up hitting 450939. Mikulas's patch for bug 432566 is an additional improvement on what was done for bug 290821. Original code causing problems looked like: static void do_work(void *data) { while (do_mirror(data)) { set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(HZ/5); } } Bug 290821 took out the schedule_timeout and replaced with a schedule. Further performance refinements by Mikulas changed this section of code again (bug 432566). If we are patching the 4.6.z kernel, we should do it with Mikulas' patch. Sorry, I got mixed up. This bug is a 4.7 bug. This means that mikulas' patch for 432566 should satisfy this bug. I would close this bug as a duplicate of 432566, but I'd like to see the test rerun. 432566 addresses general performance. This bug addresses the performace of one process after a failure. So, although Mikulas' patch includes the fix made for 290821 (the predecessor of this bug), I think it makes sense to test this again. Corey says I hit this during cluster regression runs on RHEL4-U7-re20080702.0. 2.6.9-76.ELhugemem lvm2-2.02.37-3.el4 BUILT: Thu Jun 12 10:09:14 CDT 2008 lvm2-cluster-2.02.37-3.el4 BUILT: Thu Jun 12 10:26:38 CDT 2008 device-mapper-1.02.25-2.el4 BUILT: Mon Jun 9 09:26:33 CDT 2008 cmirror-1.0.1-1 BUILT: Tue Jan 30 17:22:43 CST 2007 cmirror-kernel-2.6.9-43.2 BUILT: Wed Jul 2 15:10:37 CDT 2008 Nate's cluster had the tell tale sign of the "stuck" sync process. Ususally that process eventually finishes within an hour or two, however Nate didn't have time to wait for it with the other 4.7 testing that needs to be finished. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Updating PM score. I'm still hitting this during 4.8 testing. 2.6.9-87.ELhugemem lvm2-2.02.42-5.el4 BUILT: Tue Mar 24 16:46:52 CDT 2009 lvm2-cluster-2.02.42-5.el4 BUILT: Tue Mar 24 16:53:04 CDT 2009 device-mapper-1.02.28-2.el4 BUILT: Fri Feb 20 05:08:22 CST 2009 cmirror-1.0.2-1.el4 BUILT: Thu Feb 26 15:29:22 CST 2009 cmirror-kernel-2.6.9-43.10.el4 BUILT: Mon Apr 13 11:07:09 CDT 2009 [root@tank-01 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 6 M tank-01 2 1 6 M tank-03 3 1 6 M tank-04 4 1 6 M morph-01 6 1 6 M morph-03 7 1 6 M morph-04 /dev/sdj was disabled on all nodes, then I/O was issued to /mnt/nonsyncd_log_3legs_1 to force the down convert on tank-01. I noticed this message on morph-01 dm-cmirror: Clear requests remain at postsuspend! dm-cmirror: - Ignoring clear request: 644 Just a note that I hit this again last night. The process had been stuck for some 16 hours, but as soon as I attempted to debug it, it became un-wedged. I'm guessing it was the additional write attempts to the gfs that caused the original sync process to finally continue again. Attaching dmsetup info from each of the 4 nodes. None of the devices are in the SUSPEND state. Created attachment 342737 [details]
dmsetup from taft-01
Created attachment 342738 [details]
dmsetup from taft-02
Created attachment 342739 [details]
dmsetup from taft-03
Created attachment 342740 [details]
dmsetup from taft-04
Removing cluster-4.9 since this won't make 4.9. Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. FWIW, this issues still exists and continues to cause basic cmirror regression tests to fail. 2.6.9-94.ELsmp lvm2-2.02.42-9.el4 BUILT: Thu Oct 21 15:49:57 CDT 2010 lvm2-cluster-2.02.42-9.el4 BUILT: Thu Oct 21 15:46:55 CDT 2010 device-mapper-1.02.28-3.el4 BUILT: Thu Mar 4 14:48:16 CST 2010 cmirror-1.0.2-1.el4 BUILT: Thu Feb 26 15:29:27 CST 2009 cmirror-kernel-2.6.9-43.14.el4 BUILT: Wed Dec 22 16:24:19 CST 2010 |