Bug 193128

Summary: clustat and clusvcadm hang
Product: [Retired] Red Hat Cluster Suite Reporter: Lenny Maiorani <lenny>
Component: dlmAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: ccaulfie, cluster-maint, kupcevic, rkenna, sghosh, teigland
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2006-0551 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-10 21:14:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 180185    
Attachments:
Description Flags
dlm_debug from each node during the problem occurrence
none
dlm_locks from each node during the problem occurrence
none
Changes magma's service manager module to use NL,Convert->EX instead of EX(noqueue)
none
srpm (tiny) of patched magma-plugins
none
i386 test only RPM
none
x86_64 test only RPM
none
Another recreation possibly
none
Patch 1 of 3 - should fix behavior.
none
Patch 2 of 3: Adds NULL lock support to sm plugin
none
Patch 3 of 3: Make clu_lock_verbose use the new NULL / promote scheme
none
Patch 2 of 3: Adds NULL lock support to SM plugin, with a bugfix
none
Patch from 1.0.6 -> current RHEL4
none
Magma patch from 1.0.4 -> current RHEL4 none

Description Lenny Maiorani 2006-05-25 15:24:14 UTC
Description of problem:
All 3 nodes do not show any VIPs when running Clustat.

/var/log/messages from node 3 indicates:

May 23 23:55:01 igrid03 crond(pam_unix)[3311]: session closed for user root
May 24 00:00:01 igrid03 crond(pam_unix)[13985]: session opened for user root by
(uid=0)
May 24 00:00:01 igrid03 crond(pam_unix)[13988]: session opened for user root by
(uid=0)
May 24 00:00:01 igrid03 crond(pam_unix)[13986]: session opened for user root by
(uid=0)
May 24 00:00:01 igrid03 crond(pam_unix)[13985]: session closed for user root
May 24 00:00:02 igrid03 crond(pam_unix)[13986]: session closed for user root
May 24 00:00:02 igrid03 crond(pam_unix)[29586]: session closed for user root
May 24 00:00:37 igrid03 clurgmgrd[12481]: <notice> Stopping service
172.17.10.62
May 24 00:00:37 igrid03 clurgmgrd: [12481]: <info> Removing IPv4 address
172.17.10.62/24 from eth3
May 24 00:00:48 igrid03 clurgmgrd[12481]: <notice> Service 172.17.10.62 is
stopped
May 24 00:01:01 igrid03 crond(pam_unix)[22551]: session opened for user root by
(uid=0)
May 24 00:01:01 igrid03 crond(pam_unix)[22551]: session closed for user root
May 24 00:01:02 igrid03 crond(pam_unix)[13988]: session closed for user root
May 24 00:01:08 igrid03 clurgmgrd[12481]: <notice> Starting stopped service
172.17.10.50
May 24 00:01:08 igrid03 clurgmgrd: [12481]: <info> Adding IPv4 address
172.17.10.50/24 to eth3
May 24 00:01:09 igrid03 clurgmgrd[12481]: <notice> Service 172.17.10.50 started
May 24 00:02:07 igrid03 clurgmgrd[12481]: <notice> Starting stopped service
172.17.10.52
May 24 00:02:07 igrid03 clurgmgrd: [12481]: <info> Adding IPv4 address
172.17.10.52/24 to eth5
May 24 00:02:08 igrid03 clurgmgrd[12481]: <notice> Service 172.17.10.52 started
May 24 00:02:32 igrid03 crond(pam_unix)[32417]: session closed for user root
May 24 00:02:36 igrid03 clurgmgrd[12481]: <notice> Stopping service
172.17.10.50
May 24 00:02:36 igrid03 clurgmgrd: [12481]: <info> Removing IPv4 address
172.17.10.50/24 from eth3
May 24 00:02:46 igrid03 clurgmgrd[12481]: <notice> Service 172.17.10.50 is
stopped
May 24 00:03:59 igrid03 kernel: dlm: Magma: reply from 1 no lock
May 24 00:03:59 igrid03 kernel: dlm: reply
May 24 00:03:59 igrid03 kernel: rh_cmd 5
May 24 00:03:59 igrid03 kernel: rh_lkid e7a00a1
May 24 00:03:59 igrid03 kernel: lockstate 0
May 24 00:03:59 igrid03 kernel: nodeid 2
May 24 00:03:59 igrid03 kernel: status 4294967274
May 24 00:03:59 igrid03 kernel: lkid eb3021c
May 24 00:04:16 igrid03 clurgmgrd[12481]: <warning> Node ID:0000000000000003
stuck with lock usrm::rg="172.17.10.73"
May 24 00:04:16 igrid03 clurgmgrd[12481]: <warning> Node ID:0000000000000003
stuck with lock usrm::rg="172.17.10.73"
May 24 00:05:01 igrid03 crond(pam_unix)[27525]: session opened for user root by
(uid=0)
May 24 00:05:01 igrid03 crond(pam_unix)[27525]: session closed for user root
May 24 00:08:51 igrid03 clurgmgrd[12481]: <warning> Node ID:0000000000000003
stuck with lock usrm::rg="172.17.10.73" 

I am not sure at this point if the cluster is accepting incoming connections. 
On nodes 1 & 2, running top, I see that clustat is taking up 90% + of processor
usage.

While clustat was hung...it looked as though no activity was occuring on the
nfsd daemons (according to `top`).  With guidance from Dennis I killed the hung
clustat processes and saw that nfs daemons became active.

Version-Release number of selected component (if applicable):
rgmanager-1.9.46-1.3speed

How reproducible:
Has occurred twice, but haven't been able to reproduce in lab environment

Additional Info:
Tried to move VIPs off of one node so that we could restart rgmanager.  The
clusvcadm command to move the VIPs hung.  Tried to stop rgmanager anyway but
the kill -TERM was unsuccessful.  Used kill -9 to kill rgmanager, but this
created a defunct clurgmgrd and rgmanager could not be restarted.

Before killing clurgmgrd an strace shows:

select(8, [7], NULL, NULL, {10, 0})     = 1 (in [7], left {10, 0})
recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0
select(8, [7], NULL, NULL, {10, 0})     = 1 (in [7], left {10, 0})
recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0
select(8, [7], NULL, NULL, {10, 0})     = 1 (in [7], left {10, 0})
recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0
select(8, [7], NULL, NULL, {10, 0})     = 1 (in [7], left {10, 0})
recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0
select(8, [7], NULL, NULL, {10, 0})     = 1 (in [7], left {10, 0})
recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0
select(8, [7], NULL, NULL, {10, 0})     = 1 (in [7], left {10, 0})
recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0
select(8, [7], NULL, NULL, {10, 0})     = 1 (in [7], left {10, 0})
recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0
select(8, [7], NULL, NULL, {10, 0})     = 1 (in [7], left {10, 0})
etc...

Is this related to bz #177467?

Comment 1 Lon Hohberger 2006-05-25 19:03:33 UTC
adding Patrick -- looks like there is a DLM error here?

Comment 2 Lenny Maiorani 2006-05-26 21:22:27 UTC
Could fix Lon had for bz#177467 be a similar fix for this issue just for DLM
instead of GULM?

Comment 3 Christine Caulfield 2006-05-30 10:17:10 UTC
It's not clear whether that DLM error is implicated or not really. The message
there indicates that a reply was received from the master for a lock that
doesn't exist on the client.

Is it possible to get the contents of /proc/cluster/dlm_debug on the failing
(and ideally, master) node and /proc/cluster/dlm_locks (for Magma) on the others?

To get a dump of locks you need to echo the lockspace name into
/proc/cluster/dlm_locks before reading it. In this case:

echo "Magma" >/proc/cluster/dlm_locks
cp /proc/cluster/dlm_locks locks.txt

Comment 4 Lenny Maiorani 2006-05-30 15:09:31 UTC
So far we have been unable to reproduce the situation on demand. When this
happens again we will get the info and send it along.

Comment 5 Lon Hohberger 2006-06-01 16:33:18 UTC
Ok, here's some additional information:

The "Stuck with Lock" messages come from querying the DLM about who has the
lock, after several seconds of not being granted the exclusive lock.

So, the query function says that node 1 has the lock, but the kernel message
says that there is no lock...

Comment 6 Christine Caulfield 2006-06-02 07:56:21 UTC
Looking at the code it's not doing a query as such, but a NOQUEUE attempt on the
lock (a subtle but very important difference to the DLM!).

The remote node is not saying there is no lock, it's the local node that's
denying there is an lkb to receive the answer that there /is/ a lock on the
remote node.

I still can't figure out why though.

Comment 7 Lenny Maiorani 2006-06-06 14:55:00 UTC
Created attachment 130611 [details]
dlm_debug from each node during the problem occurrence

Just recreated this. I am attaching the dlm_debug and dlm_locks.

Comment 8 Lenny Maiorani 2006-06-06 14:56:27 UTC
Created attachment 130612 [details]
dlm_locks from each node during the problem occurrence

Comment 9 Lenny Maiorani 2006-06-06 15:08:02 UTC
So... it looks like nodes 1 and 2 have think that node 3 is maintaining a lock
even though node 3 has requested numerous times to mark invalid. Is that a
correct interpretation?

Comment 10 Christine Caulfield 2006-06-07 09:53:55 UTC
It seems to be that all the nodes are repeatedly requesting the lock with
nowait, and it's exposing a race in the DLM that prevents it assigning a master.

I suspect the quickest workaround would be for rgmanager to get the lock NL then
try to convert it to the required mode afterwards. That would allow the
mastership of the lock to settle.

Of course the race needs fixing, but it looks like a hard one!.


Comment 11 Lon Hohberger 2006-06-14 14:28:07 UTC
Created attachment 130854 [details]
Changes magma's service manager module to use NL,Convert->EX instead of EX(noqueue)

Comment 12 Lon Hohberger 2006-06-14 14:29:08 UTC
Created attachment 130855 [details]
srpm (tiny) of patched magma-plugins

Comment 13 Lon Hohberger 2006-06-14 14:30:23 UTC
Created attachment 130857 [details]
i386 test only RPM

Comment 14 Lon Hohberger 2006-06-14 14:31:32 UTC
Created attachment 130859 [details]
x86_64 test only RPM

Comment 15 Lon Hohberger 2006-06-14 14:38:40 UTC
Ok, the patched RPMs change the way magma takes locks from the DLM.  Basically,
instead of attempting to take a lock with the LKF_NOQUEUE flag, it first takes a
NL (null) lock on a given resource.  Then, it attempts to convert the lock to
the desired lock type (read/write/exclusive in magma-terms).

If it fails to convert the lock to the desired type, the NL lock is released,
and the operation returns -1 / EAGAIN directly.

As I understand it, this should eliminate the race condition because it ensures
that the lock always has a master.

Let me know if this works for you.

Comment 16 Lenny Maiorani 2006-06-14 21:42:04 UTC
Created attachment 130929 [details]
Another recreation possibly

Attached is another possible recreation of this incident. I say possible
because the locks were in a different state and this cluster had some other
errors in configuration happening as well. This cluster was NOT using the new
magma-plugins from Lon.

dlm_debug and dlm_locks from 8 node cluster
 
Reproduced again, this time on an 8 node cluster without any traffic running.
Booted 5 of the 8 nodes to establish quorum, then several minutes later booted
the other 3. The first 5 started rgmanager, but the remaining 3 failed to
complete the rgmanger start. 
 
'ps ax | grep clurgmgrd' showed it running, but 'clustat' hung on all three
nodes utilizing 100% of the CPU. On the other 5 nodes, clustat showed the 3
failing nodes online but without rgmanager.

Comment 18 Lon Hohberger 2006-06-15 16:40:40 UTC
Created attachment 130990 [details]
Patch 1 of 3 - should fix behavior.

The original patch does not actually solve the problem in theory, because it
still takes and releases locks in a very tight loop.

Comment 19 Lon Hohberger 2006-06-15 16:41:49 UTC
Created attachment 130991 [details]
Patch 2 of 3: Adds NULL lock support to sm plugin

Comment 20 Lon Hohberger 2006-06-15 16:43:44 UTC
Created attachment 130992 [details]
Patch 3 of 3: Make clu_lock_verbose use the new NULL / promote scheme

Comment 21 Lon Hohberger 2006-06-15 23:03:50 UTC
Created attachment 131012 [details]
Patch 2 of 3: Adds NULL lock support to SM plugin, with a bugfix

Previous patch #130991 did not check the lksb->sb_status field.  This leads to
duplicate locks in some cases, since the call to dlm_ls_lock() may return
-1/EAGAIN as well as 0/SUCCESS/lksb->sb_status=EAGAIN, seemingly depending on
where the lock is mastered (but I have not researched more fully).  This patch
ensures both errno and lksb->sb_status are correctly checked.

Comment 22 Lon Hohberger 2006-06-16 20:09:51 UTC
Fixes for this are in CVS in RHEL4 and STABLE branches.

Comment 23 Lon Hohberger 2006-06-16 20:11:34 UTC
Created attachment 131081 [details]
Patch from 1.0.6 -> current RHEL4

Comment 24 Lon Hohberger 2006-06-16 20:38:24 UTC
Created attachment 131082 [details]
Magma patch from 1.0.4 -> current RHEL4

Comment 27 Red Hat Bugzilla 2006-08-10 21:14:02 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0551.html