223462 – ioctl() for kill_node takes almost 6 seconds in some cases

Bug 223462 - ioctl() for kill_node takes almost 6 seconds in some cases

Summary: ioctl() for kill_node takes almost 6 seconds in some cases

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-01-19 15:49 UTC by Lon Hohberger
Modified:	2009-04-16 20:01 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2007:0135
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-05-16 15:54:21 UTC
Embargoed:

Attachments	(Terms of Use)

Description Lon Hohberger 2007-01-19 15:49:18 UTC

Description of problem:

Qdiskd allows node counts to drop below half.  During this time, qdiskd may need
to call the cman_kill_node API, which can take up to 6 seconds to complete. 
This causes problems with qdiskd - which expectes somewhat realtime behavior.  A
6 second interruption is a bit much:

stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000009>
send(4, "<167>Jan 19 09:35:55 qdiskd[3916"..., 73, MSG_NOSIGNAL) = 73 <0.000009>

vvvvvvvvvvvvv
ioctl(3, 0x4004780c, 0x3)               = 0 <5.997827>
^^^^^^^^^^^^^

gettid()                                = 3916 <0.000007>
time(NULL)                              = 1169170561 <0.000007>
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000010>
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000009>
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000009>
send(4, "<167>Jan 19 09:36:01 qdiskd[3916"..., 78, MSG_NOSIGNAL) = 78 <0.000011>
time(NULL)                              = 1169170561 <0.000006>
gettimeofday({1169170561, 438972}, NULL) = 0 <0.000006>
time(NULL)                              = 1169170561 <0.000007>
lseek(5, 16384, SEEK_SET)               = 16384 <0.000007>
write(5, "\36\273\336\0\352\232p b\371W\224P\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
<0.000327>
gettimeofday({1169170561, 439470}, NULL) = 0 <0.000007>
gettid()                                = 3916 <0.000007>
time(NULL)                              = 1169170561 <0.000007>
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000010>
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000009>
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=165, ...}) = 0 <0.000009>
send(4, "<167>Jan 19 09:36:01 qdiskd[3916"..., 73, MSG_NOSIGNAL) = 73 <0.000010>
ioctl(3, 0x4004780c, 0x4)               = 0 <5.99724/0>
select(4, [3], NULL, NULL, {0, 0})      = 0 (Timeout) <0.000012>


Version-Release number of selected component (if applicable): cman-1.0.11-0,
cman-kernel

How reproducible:  Sometimes.


Steps to Reproduce:
1. Start 9 nodes
2. Start qdisk
3. Kill 5 nodes at the same time
  
Actual results: Remaning nodes go crazy and reboot because the qdisk master
takes 12+ seconds to complete, with a maximum of 10 seconds specified in
cluster.conf.

Expected results: Immediate return or near-immediate return.

Comment 1 Lon Hohberger 2007-01-19 15:51:32 UTC

Note, in the above strace, it's taking 6 seconds per node to complete the
cman_kill_node() function.

Comment 2 Christine Caulfield 2007-01-19 16:10:26 UTC

What's happening is that cman is busy trying to make sense of a cluster that's
just had 4 nodes (deliberately) killed on it and so it sending messages for
state transitions. The way this works is that cman will do an (internal) state
transition for each node that leaves.

Now, the way a dead node gets detected during a state transition is that it
doesn't respond to messages, hence the similar message in syslog, and that
timeout is set to 5 (messages, one per second).

Because cman is not a windowed protocol each message sent has to wait until the
previous messages have been acknowledged or ignored. As mentioned above it takes
5 seconds for the node to ignore a message, so the sending process (qdiskd) is
waiting 5 seconds for each node to timeout before it can send its message.

Comment 3 Christine Caulfield 2007-01-19 16:21:53 UTC

Fix by forcing all kill messages that need ACKS (basically those that come from
userland) to queue rather than wait.

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision: 1.44.2.26; previous revision: 1.44.2.25
done

Note You need to log in before you can comment on or make changes to this bug.