Bug 210732 - ccsd doesn't spot cluster going quorate
ccsd doesn't spot cluster going quorate
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
ReviewOct20
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-10-13 18:29 EDT by Robert Peterson
Modified: 2009-04-16 18:29 EDT (History)
2 users (show)

See Also:
Fixed In Version: 5.0.0
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-11-28 16:11:42 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Robert Peterson 2006-10-13 18:29:47 EDT
Description of problem:
If, for some reason, the cluster isn't quorate when ccsd first
checks, it goes into a never-ending loop putting these messages
into syslog:

ccsd[1916]: Cluster is not quorate.  Refusing connection. 
ccsd[1916]: Error while processing connect: Connection refused 

It never checks again.

Version-Release number of selected component (if applicable):
RHEL5 beta 1 with latest cluster tree from 13 Oct 2006.

How reproducible:
About once every 20 reboots of the cluster.

Steps to Reproduce:
1. Reboot your entire cluster
  
Actual results:
Hang starting cman init script

Expected results:
No hang starting cman init script

Additional info:
I don't know this code, but I suspect the problem is in
function handle_cluster_event of cluster_mgr.c.
Code looks something like this:
  if (cman_flag) {
    cman_flag = 0;
    if (cman_reason == CMAN_REASON_STATECHANGE) {
      quorate = cman_is_quorate(handle);
      free_member_list(members);
      members = get_member_list(handle);
    }
  }
We might possibly need an extra check, like this:
	  if (!quorate)
		  cman_flag = 1;
So the quorate check is done again.  Either that or fix whatever
timing problem is causing it to get to this code before the
cluster is in fact quorate.
Comment 1 Christine Caulfield 2006-10-16 06:38:07 EDT
It's not true to say "it never checks again", particularly as you say it works
about 95% of the time ;-)

ccsd gets events from cman when nodes join or leave the cluster and it re-reads
quorate when this happens. The code you mention looks fine to me. the cman_flag
is set in the callback function above.

I wonder if it's possible that something is blocking in the cluster_manager thread.
Comment 2 Christine Caulfield 2006-10-16 11:02:45 EDT
ah, it looks like it might be a libcman bug. Can you try this patch ?

diff -u -p -r1.28 libcman.c
--- libcman.c	5 Oct 2006 07:48:33 -0000	1.28
+++ libcman.c	16 Oct 2006 15:01:35 -0000
@@ -233,7 +233,7 @@ static int loopy_writev(int fd, struct i
 			return len;
 
 		byte_cnt += len;
-		while (len >= iovptr->iov_len)
+		if (len >= iovptr->iov_len)
 		{
 			len -= iovptr->iov_len;
 			iovptr++;


Comment 3 Robert Peterson 2006-10-16 14:23:54 EDT
This fix didn't break anything, but I was still able to recreate the
problem by using the revolver test on the smoke cluster.  Here is
output from the node (salem) in the failed state:

[root@salem ../cluster/cman/daemon]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  30716   2006-10-16 11:03:13  camel
   2   M  30716   2006-10-16 11:03:13  merit
   3   M  30716   2006-10-16 11:03:13  winston
   4   M  30716   2006-10-16 11:03:13  kool
   5   M  30688   2006-10-16 11:03:13  salem
[root@salem ../cluster/cman/daemon]# cman_tool status
Version: 6.0.1
Config Version: 1
Cluster Name: smoke
Cluster Id: 3471
Cluster Member: Yes
Cluster Generation: 30716
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0  
Node name: salem
Node ID: 5
Multicast addresses: 239.192.13.156 
Node addresses: 10.15.89.57 
[root@salem ../cluster/cman/daemon]# tail -2 /var/log/messages
Oct 16 13:19:20 salem ccsd[1991]: Cluster is not quorate.  Refusing connection. 
Oct 16 13:19:20 salem ccsd[1991]: Error while processing connect: Connection
refused 

In other words, cman_tool seems to indicate the cluster is quorate,
and yet these two messages continue to be dumped in the syslog at a 
rate of once every second.
Comment 4 Christine Caulfield 2006-10-17 03:19:05 EDT
The way to try and debug this is to strace the cluster manager thread of ccsd 
(that's the second thread that shows up on "ps -efL") then cause a cluster event
- I use "cman_tool expected -e1". you should see ccsd try (and fail) to read the
quorate status.

At least, that's what happened last time :)

I'll try reproduce this myself but it seems to take some time to make it happen.
Comment 5 Kiersten (Kerri) Anderson 2006-10-17 11:13:25 EDT
Revolver found problem, blocking QE testing so is a beta2 blocker.  Devel ACK.
Comment 6 RHEL Product and Program Management 2006-10-17 11:17:21 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.
Comment 7 Christine Caulfield 2006-10-17 12:48:55 EDT
Aha, there seems to be a startup race in ccsd where it gets the current state,
then enables notifications. This should be the other way round.

Checking in cluster_mgr.c;
/cvs/cluster/cluster/ccs/daemon/cluster_mgr.c,v  <--  cluster_mgr.c
new revision: 1.22; previous revision: 1.21
done
Comment 8 Jay Turner 2006-10-17 14:56:32 EDT
QE ack for RHEL5B2 based on 4d of the release criteria.
Comment 9 Robert Peterson 2006-10-19 09:50:46 EDT
I was still able to recreate this problem with the latest code
and new instrumentation.  Changing back to assigned status.
Comment 10 Robert Peterson 2006-10-20 10:33:02 EDT
Armed with a new fix from Patrick Caulfield, I tested the failing
scenario.  It successfully passed an all-night test with more than
100 iterations times 3 combinations, all successfully.  Therefore,
because Patrick is out today, I committed the change to CVS.
Also marking this bugzilla as modified.
Comment 15 Nate Straz 2007-12-13 12:22:23 EST
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.