Bug 153321
Summary: | node locks can deadlock cluster | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Adam "mantis" Manthei <amanthei> | ||||
Component: | gulm | Assignee: | michael conrad tadpol tilstra <mtilstra> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3 | CC: | cluster-maint, tao | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2005-05-25 16:41:13 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 154397, 160494 | ||||||
Attachments: |
|
Description
Adam "mantis" Manthei
2005-04-04 19:24:38 UTC
Created attachment 112680 [details]
logs of test run
I believe that this is the lock blocking all further activity
#=================
key : 'R0ZTIE4EZ2ZzMQAadHJpbi0wOC5sYWIubXNwLnJlZGhhdC5jb20A'
ExK : GFS , N, 4, gfs1, 26, trin-08.lab.msp.redhat.com
state : gio_lck_st_Unlock
LVBlen : 0
LVB :
HolderCount : 0
Holders :
LVBHolderCount : 0
LVBHolders :
ExpiredCount : 1
ExpiredHolders : [ trin-08.lab.msp.redhat.com ]
reply_waiter :
Waiters :
- key : 'R0ZTIE4EZ2ZzMQAadHJpbi0wOC5sYWIubXNwLnJlZGhhdC5jb20A'
ExK : GFS , N, 4, gfs1, 26, trin-08.lab.msp.redhat.com
name : trin-08.lab.msp.redhat.com
state : gio_lck_st_Exclusive
flags : Cachable
LVB :
Slave_rply : 0x0
Slave_sent : 0x0
idx : 4
High_Waiters :
Action_Waiters :
State_Waiters :
One thing that I did that was a little non standard in reproducing this bug was that I set the allowed misses to 100 and used manual fencing that way I could better control the timing of things. I'm still not sure why the customer was seeing this issue, especially given the higher node count. (I would think that more nodes would mean the less likely this case is to pop up!) I think that there might be another bug hiding here in here too. I have seen on my larger setups (~15 nodes) cases were the clients aren't being informed that there are expired nodes when dealing with large numbers of clients and client failure testing. This is an issue that belongs in another bug, but I have not been able to figure out what the frell is going on (I was also doing some rather unsupported and risky things at the time, so that may have merely been an artifact of the test I was running). The nodelocks were added to work around a deadlock when a node remounted after failure and tried replay its own journal. (#1206) They also helped deal with the version of the jid mapping code that was present. Knowing that I've fixed the jid mapper in future versions, and guessing that the previous work around is no longer needed, I back ported the jid mapping code from 6.1. This code has no nodelocks (or listlocks). Given your test above, this seems to have fixed things. Am running some other tests to see if this is a workable solution. *** Bug 154397 has been marked as a duplicate of this bug. *** Fix commited into rhel3 branch. Nodelocks have been removed. The steps above now work. Also ran a bunch of basic recovery itterations. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-466.html |