Description of problem: If, for some reason, the cluster isn't quorate when ccsd first checks, it goes into a never-ending loop putting these messages into syslog: ccsd[1916]: Cluster is not quorate. Refusing connection. ccsd[1916]: Error while processing connect: Connection refused It never checks again. Version-Release number of selected component (if applicable): RHEL5 beta 1 with latest cluster tree from 13 Oct 2006. How reproducible: About once every 20 reboots of the cluster. Steps to Reproduce: 1. Reboot your entire cluster Actual results: Hang starting cman init script Expected results: No hang starting cman init script Additional info: I don't know this code, but I suspect the problem is in function handle_cluster_event of cluster_mgr.c. Code looks something like this: if (cman_flag) { cman_flag = 0; if (cman_reason == CMAN_REASON_STATECHANGE) { quorate = cman_is_quorate(handle); free_member_list(members); members = get_member_list(handle); } } We might possibly need an extra check, like this: if (!quorate) cman_flag = 1; So the quorate check is done again. Either that or fix whatever timing problem is causing it to get to this code before the cluster is in fact quorate.
It's not true to say "it never checks again", particularly as you say it works about 95% of the time ;-) ccsd gets events from cman when nodes join or leave the cluster and it re-reads quorate when this happens. The code you mention looks fine to me. the cman_flag is set in the callback function above. I wonder if it's possible that something is blocking in the cluster_manager thread.
ah, it looks like it might be a libcman bug. Can you try this patch ? diff -u -p -r1.28 libcman.c --- libcman.c 5 Oct 2006 07:48:33 -0000 1.28 +++ libcman.c 16 Oct 2006 15:01:35 -0000 @@ -233,7 +233,7 @@ static int loopy_writev(int fd, struct i return len; byte_cnt += len; - while (len >= iovptr->iov_len) + if (len >= iovptr->iov_len) { len -= iovptr->iov_len; iovptr++;
This fix didn't break anything, but I was still able to recreate the problem by using the revolver test on the smoke cluster. Here is output from the node (salem) in the failed state: [root@salem ../cluster/cman/daemon]# cman_tool nodes Node Sts Inc Joined Name 1 M 30716 2006-10-16 11:03:13 camel 2 M 30716 2006-10-16 11:03:13 merit 3 M 30716 2006-10-16 11:03:13 winston 4 M 30716 2006-10-16 11:03:13 kool 5 M 30688 2006-10-16 11:03:13 salem [root@salem ../cluster/cman/daemon]# cman_tool status Version: 6.0.1 Config Version: 1 Cluster Name: smoke Cluster Id: 3471 Cluster Member: Yes Cluster Generation: 30716 Membership state: Cluster-Member Nodes: 5 Expected votes: 5 Total votes: 5 Quorum: 3 Active subsystems: 6 Flags: Ports Bound: 0 Node name: salem Node ID: 5 Multicast addresses: 239.192.13.156 Node addresses: 10.15.89.57 [root@salem ../cluster/cman/daemon]# tail -2 /var/log/messages Oct 16 13:19:20 salem ccsd[1991]: Cluster is not quorate. Refusing connection. Oct 16 13:19:20 salem ccsd[1991]: Error while processing connect: Connection refused In other words, cman_tool seems to indicate the cluster is quorate, and yet these two messages continue to be dumped in the syslog at a rate of once every second.
The way to try and debug this is to strace the cluster manager thread of ccsd (that's the second thread that shows up on "ps -efL") then cause a cluster event - I use "cman_tool expected -e1". you should see ccsd try (and fail) to read the quorate status. At least, that's what happened last time :) I'll try reproduce this myself but it seems to take some time to make it happen.
Revolver found problem, blocking QE testing so is a beta2 blocker. Devel ACK.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering. This request is not yet committed for inclusion in release.
Aha, there seems to be a startup race in ccsd where it gets the current state, then enables notifications. This should be the other way round. Checking in cluster_mgr.c; /cvs/cluster/cluster/ccs/daemon/cluster_mgr.c,v <-- cluster_mgr.c new revision: 1.22; previous revision: 1.21 done
QE ack for RHEL5B2 based on 4d of the release criteria.
I was still able to recreate this problem with the latest code and new instrumentation. Changing back to assigned status.
Armed with a new fix from Patrick Caulfield, I tested the failing scenario. It successfully passed an all-night test with more than 100 iterations times 3 combinations, all successfully. Therefore, because Patrick is out today, I committed the change to CVS. Also marking this bugzilla as modified.
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.