Description of problem: All 3 nodes do not show any VIPs when running Clustat. /var/log/messages from node 3 indicates: May 23 23:55:01 igrid03 crond(pam_unix)[3311]: session closed for user root May 24 00:00:01 igrid03 crond(pam_unix)[13985]: session opened for user root by (uid=0) May 24 00:00:01 igrid03 crond(pam_unix)[13988]: session opened for user root by (uid=0) May 24 00:00:01 igrid03 crond(pam_unix)[13986]: session opened for user root by (uid=0) May 24 00:00:01 igrid03 crond(pam_unix)[13985]: session closed for user root May 24 00:00:02 igrid03 crond(pam_unix)[13986]: session closed for user root May 24 00:00:02 igrid03 crond(pam_unix)[29586]: session closed for user root May 24 00:00:37 igrid03 clurgmgrd[12481]: <notice> Stopping service 172.17.10.62 May 24 00:00:37 igrid03 clurgmgrd: [12481]: <info> Removing IPv4 address 172.17.10.62/24 from eth3 May 24 00:00:48 igrid03 clurgmgrd[12481]: <notice> Service 172.17.10.62 is stopped May 24 00:01:01 igrid03 crond(pam_unix)[22551]: session opened for user root by (uid=0) May 24 00:01:01 igrid03 crond(pam_unix)[22551]: session closed for user root May 24 00:01:02 igrid03 crond(pam_unix)[13988]: session closed for user root May 24 00:01:08 igrid03 clurgmgrd[12481]: <notice> Starting stopped service 172.17.10.50 May 24 00:01:08 igrid03 clurgmgrd: [12481]: <info> Adding IPv4 address 172.17.10.50/24 to eth3 May 24 00:01:09 igrid03 clurgmgrd[12481]: <notice> Service 172.17.10.50 started May 24 00:02:07 igrid03 clurgmgrd[12481]: <notice> Starting stopped service 172.17.10.52 May 24 00:02:07 igrid03 clurgmgrd: [12481]: <info> Adding IPv4 address 172.17.10.52/24 to eth5 May 24 00:02:08 igrid03 clurgmgrd[12481]: <notice> Service 172.17.10.52 started May 24 00:02:32 igrid03 crond(pam_unix)[32417]: session closed for user root May 24 00:02:36 igrid03 clurgmgrd[12481]: <notice> Stopping service 172.17.10.50 May 24 00:02:36 igrid03 clurgmgrd: [12481]: <info> Removing IPv4 address 172.17.10.50/24 from eth3 May 24 00:02:46 igrid03 clurgmgrd[12481]: <notice> Service 172.17.10.50 is stopped May 24 00:03:59 igrid03 kernel: dlm: Magma: reply from 1 no lock May 24 00:03:59 igrid03 kernel: dlm: reply May 24 00:03:59 igrid03 kernel: rh_cmd 5 May 24 00:03:59 igrid03 kernel: rh_lkid e7a00a1 May 24 00:03:59 igrid03 kernel: lockstate 0 May 24 00:03:59 igrid03 kernel: nodeid 2 May 24 00:03:59 igrid03 kernel: status 4294967274 May 24 00:03:59 igrid03 kernel: lkid eb3021c May 24 00:04:16 igrid03 clurgmgrd[12481]: <warning> Node ID:0000000000000003 stuck with lock usrm::rg="172.17.10.73" May 24 00:04:16 igrid03 clurgmgrd[12481]: <warning> Node ID:0000000000000003 stuck with lock usrm::rg="172.17.10.73" May 24 00:05:01 igrid03 crond(pam_unix)[27525]: session opened for user root by (uid=0) May 24 00:05:01 igrid03 crond(pam_unix)[27525]: session closed for user root May 24 00:08:51 igrid03 clurgmgrd[12481]: <warning> Node ID:0000000000000003 stuck with lock usrm::rg="172.17.10.73" I am not sure at this point if the cluster is accepting incoming connections. On nodes 1 & 2, running top, I see that clustat is taking up 90% + of processor usage. While clustat was hung...it looked as though no activity was occuring on the nfsd daemons (according to `top`). With guidance from Dennis I killed the hung clustat processes and saw that nfs daemons became active. Version-Release number of selected component (if applicable): rgmanager-1.9.46-1.3speed How reproducible: Has occurred twice, but haven't been able to reproduce in lab environment Additional Info: Tried to move VIPs off of one node so that we could restart rgmanager. The clusvcadm command to move the VIPs hung. Tried to stop rgmanager anyway but the kill -TERM was unsuccessful. Used kill -9 to kill rgmanager, but this created a defunct clurgmgrd and rgmanager could not be restarted. Before killing clurgmgrd an strace shows: select(8, [7], NULL, NULL, {10, 0}) = 1 (in [7], left {10, 0}) recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0 select(8, [7], NULL, NULL, {10, 0}) = 1 (in [7], left {10, 0}) recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0 select(8, [7], NULL, NULL, {10, 0}) = 1 (in [7], left {10, 0}) recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0 select(8, [7], NULL, NULL, {10, 0}) = 1 (in [7], left {10, 0}) recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0 select(8, [7], NULL, NULL, {10, 0}) = 1 (in [7], left {10, 0}) recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0 select(8, [7], NULL, NULL, {10, 0}) = 1 (in [7], left {10, 0}) recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0 select(8, [7], NULL, NULL, {10, 0}) = 1 (in [7], left {10, 0}) recvfrom(7, "", 20, MSG_PEEK, NULL, NULL) = 0 select(8, [7], NULL, NULL, {10, 0}) = 1 (in [7], left {10, 0}) etc... Is this related to bz #177467?
adding Patrick -- looks like there is a DLM error here?
Could fix Lon had for bz#177467 be a similar fix for this issue just for DLM instead of GULM?
It's not clear whether that DLM error is implicated or not really. The message there indicates that a reply was received from the master for a lock that doesn't exist on the client. Is it possible to get the contents of /proc/cluster/dlm_debug on the failing (and ideally, master) node and /proc/cluster/dlm_locks (for Magma) on the others? To get a dump of locks you need to echo the lockspace name into /proc/cluster/dlm_locks before reading it. In this case: echo "Magma" >/proc/cluster/dlm_locks cp /proc/cluster/dlm_locks locks.txt
So far we have been unable to reproduce the situation on demand. When this happens again we will get the info and send it along.
Ok, here's some additional information: The "Stuck with Lock" messages come from querying the DLM about who has the lock, after several seconds of not being granted the exclusive lock. So, the query function says that node 1 has the lock, but the kernel message says that there is no lock...
Looking at the code it's not doing a query as such, but a NOQUEUE attempt on the lock (a subtle but very important difference to the DLM!). The remote node is not saying there is no lock, it's the local node that's denying there is an lkb to receive the answer that there /is/ a lock on the remote node. I still can't figure out why though.
Created attachment 130611 [details] dlm_debug from each node during the problem occurrence Just recreated this. I am attaching the dlm_debug and dlm_locks.
Created attachment 130612 [details] dlm_locks from each node during the problem occurrence
So... it looks like nodes 1 and 2 have think that node 3 is maintaining a lock even though node 3 has requested numerous times to mark invalid. Is that a correct interpretation?
It seems to be that all the nodes are repeatedly requesting the lock with nowait, and it's exposing a race in the DLM that prevents it assigning a master. I suspect the quickest workaround would be for rgmanager to get the lock NL then try to convert it to the required mode afterwards. That would allow the mastership of the lock to settle. Of course the race needs fixing, but it looks like a hard one!.
Created attachment 130854 [details] Changes magma's service manager module to use NL,Convert->EX instead of EX(noqueue)
Created attachment 130855 [details] srpm (tiny) of patched magma-plugins
Created attachment 130857 [details] i386 test only RPM
Created attachment 130859 [details] x86_64 test only RPM
Ok, the patched RPMs change the way magma takes locks from the DLM. Basically, instead of attempting to take a lock with the LKF_NOQUEUE flag, it first takes a NL (null) lock on a given resource. Then, it attempts to convert the lock to the desired lock type (read/write/exclusive in magma-terms). If it fails to convert the lock to the desired type, the NL lock is released, and the operation returns -1 / EAGAIN directly. As I understand it, this should eliminate the race condition because it ensures that the lock always has a master. Let me know if this works for you.
Created attachment 130929 [details] Another recreation possibly Attached is another possible recreation of this incident. I say possible because the locks were in a different state and this cluster had some other errors in configuration happening as well. This cluster was NOT using the new magma-plugins from Lon. dlm_debug and dlm_locks from 8 node cluster Reproduced again, this time on an 8 node cluster without any traffic running. Booted 5 of the 8 nodes to establish quorum, then several minutes later booted the other 3. The first 5 started rgmanager, but the remaining 3 failed to complete the rgmanger start. 'ps ax | grep clurgmgrd' showed it running, but 'clustat' hung on all three nodes utilizing 100% of the CPU. On the other 5 nodes, clustat showed the 3 failing nodes online but without rgmanager.
Created attachment 130990 [details] Patch 1 of 3 - should fix behavior. The original patch does not actually solve the problem in theory, because it still takes and releases locks in a very tight loop.
Created attachment 130991 [details] Patch 2 of 3: Adds NULL lock support to sm plugin
Created attachment 130992 [details] Patch 3 of 3: Make clu_lock_verbose use the new NULL / promote scheme
Created attachment 131012 [details] Patch 2 of 3: Adds NULL lock support to SM plugin, with a bugfix Previous patch #130991 did not check the lksb->sb_status field. This leads to duplicate locks in some cases, since the call to dlm_ls_lock() may return -1/EAGAIN as well as 0/SUCCESS/lksb->sb_status=EAGAIN, seemingly depending on where the lock is mastered (but I have not researched more fully). This patch ensures both errno and lksb->sb_status are correctly checked.
Fixes for this are in CVS in RHEL4 and STABLE branches.
Created attachment 131081 [details] Patch from 1.0.6 -> current RHEL4
Created attachment 131082 [details] Magma patch from 1.0.4 -> current RHEL4
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0551.html