Bug 210732 - ccsd doesn't spot cluster going quorate
Summary: ccsd doesn't spot cluster going quorate
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard: ReviewOct20
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-10-13 22:29 UTC by Robert Peterson
Modified: 2009-04-16 22:29 UTC (History)
2 users (show)

Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-28 21:11:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Robert Peterson 2006-10-13 22:29:47 UTC
Description of problem:
If, for some reason, the cluster isn't quorate when ccsd first
checks, it goes into a never-ending loop putting these messages
into syslog:

ccsd[1916]: Cluster is not quorate.  Refusing connection. 
ccsd[1916]: Error while processing connect: Connection refused 

It never checks again.

Version-Release number of selected component (if applicable):
RHEL5 beta 1 with latest cluster tree from 13 Oct 2006.

How reproducible:
About once every 20 reboots of the cluster.

Steps to Reproduce:
1. Reboot your entire cluster
  
Actual results:
Hang starting cman init script

Expected results:
No hang starting cman init script

Additional info:
I don't know this code, but I suspect the problem is in
function handle_cluster_event of cluster_mgr.c.
Code looks something like this:
  if (cman_flag) {
    cman_flag = 0;
    if (cman_reason == CMAN_REASON_STATECHANGE) {
      quorate = cman_is_quorate(handle);
      free_member_list(members);
      members = get_member_list(handle);
    }
  }
We might possibly need an extra check, like this:
	  if (!quorate)
		  cman_flag = 1;
So the quorate check is done again.  Either that or fix whatever
timing problem is causing it to get to this code before the
cluster is in fact quorate.

Comment 1 Christine Caulfield 2006-10-16 10:38:07 UTC
It's not true to say "it never checks again", particularly as you say it works
about 95% of the time ;-)

ccsd gets events from cman when nodes join or leave the cluster and it re-reads
quorate when this happens. The code you mention looks fine to me. the cman_flag
is set in the callback function above.

I wonder if it's possible that something is blocking in the cluster_manager thread.

Comment 2 Christine Caulfield 2006-10-16 15:02:45 UTC
ah, it looks like it might be a libcman bug. Can you try this patch ?

diff -u -p -r1.28 libcman.c
--- libcman.c	5 Oct 2006 07:48:33 -0000	1.28
+++ libcman.c	16 Oct 2006 15:01:35 -0000
@@ -233,7 +233,7 @@ static int loopy_writev(int fd, struct i
 			return len;
 
 		byte_cnt += len;
-		while (len >= iovptr->iov_len)
+		if (len >= iovptr->iov_len)
 		{
 			len -= iovptr->iov_len;
 			iovptr++;




Comment 3 Robert Peterson 2006-10-16 18:23:54 UTC
This fix didn't break anything, but I was still able to recreate the
problem by using the revolver test on the smoke cluster.  Here is
output from the node (salem) in the failed state:

[root@salem ../cluster/cman/daemon]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  30716   2006-10-16 11:03:13  camel
   2   M  30716   2006-10-16 11:03:13  merit
   3   M  30716   2006-10-16 11:03:13  winston
   4   M  30716   2006-10-16 11:03:13  kool
   5   M  30688   2006-10-16 11:03:13  salem
[root@salem ../cluster/cman/daemon]# cman_tool status
Version: 6.0.1
Config Version: 1
Cluster Name: smoke
Cluster Id: 3471
Cluster Member: Yes
Cluster Generation: 30716
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0  
Node name: salem
Node ID: 5
Multicast addresses: 239.192.13.156 
Node addresses: 10.15.89.57 
[root@salem ../cluster/cman/daemon]# tail -2 /var/log/messages
Oct 16 13:19:20 salem ccsd[1991]: Cluster is not quorate.  Refusing connection. 
Oct 16 13:19:20 salem ccsd[1991]: Error while processing connect: Connection
refused 

In other words, cman_tool seems to indicate the cluster is quorate,
and yet these two messages continue to be dumped in the syslog at a 
rate of once every second.


Comment 4 Christine Caulfield 2006-10-17 07:19:05 UTC
The way to try and debug this is to strace the cluster manager thread of ccsd 
(that's the second thread that shows up on "ps -efL") then cause a cluster event
- I use "cman_tool expected -e1". you should see ccsd try (and fail) to read the
quorate status.

At least, that's what happened last time :)

I'll try reproduce this myself but it seems to take some time to make it happen.

Comment 5 Kiersten (Kerri) Anderson 2006-10-17 15:13:25 UTC
Revolver found problem, blocking QE testing so is a beta2 blocker.  Devel ACK.

Comment 6 RHEL Program Management 2006-10-17 15:17:21 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.

Comment 7 Christine Caulfield 2006-10-17 16:48:55 UTC
Aha, there seems to be a startup race in ccsd where it gets the current state,
then enables notifications. This should be the other way round.

Checking in cluster_mgr.c;
/cvs/cluster/cluster/ccs/daemon/cluster_mgr.c,v  <--  cluster_mgr.c
new revision: 1.22; previous revision: 1.21
done


Comment 8 Jay Turner 2006-10-17 18:56:32 UTC
QE ack for RHEL5B2 based on 4d of the release criteria.

Comment 9 Robert Peterson 2006-10-19 13:50:46 UTC
I was still able to recreate this problem with the latest code
and new instrumentation.  Changing back to assigned status.


Comment 10 Robert Peterson 2006-10-20 14:33:02 UTC
Armed with a new fix from Patrick Caulfield, I tested the failing
scenario.  It successfully passed an all-night test with more than
100 iterations times 3 combinations, all successfully.  Therefore,
because Patrick is out today, I committed the change to CVS.
Also marking this bugzilla as modified.


Comment 15 Nate Straz 2007-12-13 17:22:23 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.


Note You need to log in before you can comment on or make changes to this bug.