210732 – ccsd doesn't spot cluster going quorate

Bug 210732 - ccsd doesn't spot cluster going quorate

Summary: ccsd doesn't spot cluster going quorate

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:	ReviewOct20
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-10-13 22:29 UTC by Robert Peterson
Modified:	2009-04-16 22:29 UTC (History)
CC List:	2 users (show)
Fixed In Version:	5.0.0
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-11-28 21:11:42 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Robert Peterson 2006-10-13 22:29:47 UTC

Description of problem:
If, for some reason, the cluster isn't quorate when ccsd first
checks, it goes into a never-ending loop putting these messages
into syslog:

ccsd[1916]: Cluster is not quorate.  Refusing connection. 
ccsd[1916]: Error while processing connect: Connection refused 

It never checks again.

Version-Release number of selected component (if applicable):
RHEL5 beta 1 with latest cluster tree from 13 Oct 2006.

How reproducible:
About once every 20 reboots of the cluster.

Steps to Reproduce:
1. Reboot your entire cluster
  
Actual results:
Hang starting cman init script

Expected results:
No hang starting cman init script

Additional info:
I don't know this code, but I suspect the problem is in
function handle_cluster_event of cluster_mgr.c.
Code looks something like this:
  if (cman_flag) {
    cman_flag = 0;
    if (cman_reason == CMAN_REASON_STATECHANGE) {
      quorate = cman_is_quorate(handle);
      free_member_list(members);
      members = get_member_list(handle);
    }
  }
We might possibly need an extra check, like this:
	  if (!quorate)
		  cman_flag = 1;
So the quorate check is done again.  Either that or fix whatever
timing problem is causing it to get to this code before the
cluster is in fact quorate.

Comment 1 Christine Caulfield 2006-10-16 10:38:07 UTC

It's not true to say "it never checks again", particularly as you say it works
about 95% of the time ;-)

ccsd gets events from cman when nodes join or leave the cluster and it re-reads
quorate when this happens. The code you mention looks fine to me. the cman_flag
is set in the callback function above.

I wonder if it's possible that something is blocking in the cluster_manager thread.

Comment 2 Christine Caulfield 2006-10-16 15:02:45 UTC

ah, it looks like it might be a libcman bug. Can you try this patch ?

diff -u -p -r1.28 libcman.c
--- libcman.c	5 Oct 2006 07:48:33 -0000	1.28
+++ libcman.c	16 Oct 2006 15:01:35 -0000
@@ -233,7 +233,7 @@ static int loopy_writev(int fd, struct i
 			return len;
 
 		byte_cnt += len;
-		while (len >= iovptr->iov_len)
+		if (len >= iovptr->iov_len)
 		{
 			len -= iovptr->iov_len;
 			iovptr++;

Comment 3 Robert Peterson 2006-10-16 18:23:54 UTC

This fix didn't break anything, but I was still able to recreate the
problem by using the revolver test on the smoke cluster.  Here is
output from the node (salem) in the failed state:

[root@salem ../cluster/cman/daemon]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  30716   2006-10-16 11:03:13  camel
   2   M  30716   2006-10-16 11:03:13  merit
   3   M  30716   2006-10-16 11:03:13  winston
   4   M  30716   2006-10-16 11:03:13  kool
   5   M  30688   2006-10-16 11:03:13  salem
[root@salem ../cluster/cman/daemon]# cman_tool status
Version: 6.0.1
Config Version: 1
Cluster Name: smoke
Cluster Id: 3471
Cluster Member: Yes
Cluster Generation: 30716
Membership state: Cluster-Member
Nodes: 5
Expected votes: 5
Total votes: 5
Quorum: 3  
Active subsystems: 6
Flags: 
Ports Bound: 0  
Node name: salem
Node ID: 5
Multicast addresses: 239.192.13.156 
Node addresses: 10.15.89.57 
[root@salem ../cluster/cman/daemon]# tail -2 /var/log/messages
Oct 16 13:19:20 salem ccsd[1991]: Cluster is not quorate.  Refusing connection. 
Oct 16 13:19:20 salem ccsd[1991]: Error while processing connect: Connection
refused 

In other words, cman_tool seems to indicate the cluster is quorate,
and yet these two messages continue to be dumped in the syslog at a 
rate of once every second.

Comment 4 Christine Caulfield 2006-10-17 07:19:05 UTC

The way to try and debug this is to strace the cluster manager thread of ccsd 
(that's the second thread that shows up on "ps -efL") then cause a cluster event
- I use "cman_tool expected -e1". you should see ccsd try (and fail) to read the
quorate status.

At least, that's what happened last time :)

I'll try reproduce this myself but it seems to take some time to make it happen.

Comment 5 Kiersten (Kerri) Anderson 2006-10-17 15:13:25 UTC

Revolver found problem, blocking QE testing so is a beta2 blocker.  Devel ACK.

Comment 6 RHEL Program Management 2006-10-17 15:17:21 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.

Comment 7 Christine Caulfield 2006-10-17 16:48:55 UTC

Aha, there seems to be a startup race in ccsd where it gets the current state,
then enables notifications. This should be the other way round.

Checking in cluster_mgr.c;
/cvs/cluster/cluster/ccs/daemon/cluster_mgr.c,v  <--  cluster_mgr.c
new revision: 1.22; previous revision: 1.21
done

Comment 8 Jay Turner 2006-10-17 18:56:32 UTC

QE ack for RHEL5B2 based on 4d of the release criteria.

Comment 9 Robert Peterson 2006-10-19 13:50:46 UTC

I was still able to recreate this problem with the latest code
and new instrumentation.  Changing back to assigned status.

Comment 10 Robert Peterson 2006-10-20 14:33:02 UTC

Armed with a new fix from Patrick Caulfield, I tested the failing
scenario.  It successfully passed an all-night test with more than
100 iterations times 3 combinations, all successfully.  Therefore,
because Patrick is out today, I committed the change to CVS.
Also marking this bugzilla as modified.

Comment 15 Nate Straz 2007-12-13 17:22:23 UTC

Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.