Description of problem: Qdiskd allows node counts to drop below half. During this time, qdiskd may need to call the cman_kill_node API, which can take up to 6 seconds to complete. This causes problems with qdiskd - which expectes somewhat realtime behavior. A 6 second interruption is a bit much: stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000009> send(4, "<167>Jan 19 09:35:55 qdiskd[3916"..., 73, MSG_NOSIGNAL) = 73 <0.000009> vvvvvvvvvvvvv ioctl(3, 0x4004780c, 0x3) = 0 <5.997827> ^^^^^^^^^^^^^ gettid() = 3916 <0.000007> time(NULL) = 1169170561 <0.000007> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000010> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000009> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000009> send(4, "<167>Jan 19 09:36:01 qdiskd[3916"..., 78, MSG_NOSIGNAL) = 78 <0.000011> time(NULL) = 1169170561 <0.000006> gettimeofday({1169170561, 438972}, NULL) = 0 <0.000006> time(NULL) = 1169170561 <0.000007> lseek(5, 16384, SEEK_SET) = 16384 <0.000007> write(5, "\36\273\336\0\352\232p b\371W\224P\0\0\0\0\0\0\0\0\0\0"..., 512) = 512 <0.000327> gettimeofday({1169170561, 439470}, NULL) = 0 <0.000007> gettid() = 3916 <0.000007> time(NULL) = 1169170561 <0.000007> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000010> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000009> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000009> send(4, "<167>Jan 19 09:36:01 qdiskd[3916"..., 73, MSG_NOSIGNAL) = 73 <0.000010> ioctl(3, 0x4004780c, 0x4) = 0 <5.99724/0> select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout) <0.000012> Version-Release number of selected component (if applicable): cman-1.0.11-0, cman-kernel How reproducible: Sometimes. Steps to Reproduce: 1. Start 9 nodes 2. Start qdisk 3. Kill 5 nodes at the same time Actual results: Remaning nodes go crazy and reboot because the qdisk master takes 12+ seconds to complete, with a maximum of 10 seconds specified in cluster.conf. Expected results: Immediate return or near-immediate return.
Note, in the above strace, it's taking 6 seconds per node to complete the cman_kill_node() function.
What's happening is that cman is busy trying to make sense of a cluster that's just had 4 nodes (deliberately) killed on it and so it sending messages for state transitions. The way this works is that cman will do an (internal) state transition for each node that leaves. Now, the way a dead node gets detected during a state transition is that it doesn't respond to messages, hence the similar message in syslog, and that timeout is set to 5 (messages, one per second). Because cman is not a windowed protocol each message sent has to wait until the previous messages have been acknowledged or ignored. As mentioned above it takes 5 seconds for the node to ignore a message, so the sending process (qdiskd) is waiting 5 seconds for each node to timeout before it can send its message.
Fix by forcing all kill messages that need ACKS (basically those that come from userland) to queue rather than wait. Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v <-- membership.c new revision: 1.44.2.26; previous revision: 1.44.2.25 done