25047 – nanny incorrectly reports connect failures

Bug 25047 - nanny incorrectly reports connect failures

Summary: nanny incorrectly reports connect failures

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat High Availability Server
Classification:	Retired
Component:	piranha
Sub Component:
Version:	1.0
Hardware:	i386
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Phil Copeland
QA Contact:	Phil Copeland
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-01-26 21:38 UTC by Derek Glidden
Modified:	2005-10-31 22:00 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2001-07-31 18:40:23 UTC
Embargoed:

Attachments	(Terms of Use)

Description Red Hat Bugzilla 2001-01-26 21:38:54 UTC

In some cases, nanny will fail in a connect attempt to a real-server when
that real server is actually alive and responsive.  As soon as the re-entry
timeout is reached, that server will be added right back into the LVS
tables. Syslog logs messages such as:

Jan 21 04:02:11 lvs nanny[30387]: shutting down 192.168.1.1:80 due to
connection failure
Jan 21 04:02:11 lvs nanny[30387]: running command  "/usr/sbin/ipvsadm" "-d"
"-t" "192.168.1.1:80" "-r" "192.168.1.1"
Jan 21 04:03:12 lvs nanny[30295]: making 192.168.1.1:80 available
Jan 21 04:03:12 lvs nanny[30295]: running command  "/usr/sbin/ipvsadm" "-a"
"-t" "192.168.1.1:80" "-r" "192.168.1.1" "-m" "-w" "100"

The cause: nanny creates a socket descriptor that it uses for communicating
with the real-server it's supposed to be monitoring.  an fcntl (on line 357
of the 0.4.17-7 package source that I have) attempts to clear the
O_NONBLOCK flag set when the socket is created which should wait until all
I/o has been completed on that descriptor so it can be re-used, however in
at least some cases this does not actually wait until that socket is
finished being used and the connect() immediately following will fail with
an EISCONN error which nanny reports as a connection failure to the
real-server and will remove it from IPVS tables. 

Turning on verbose logging with nanny seems to give the socket just enough
extra time to clear while the piranha_log() functions are being called so
that the failures do not happen.  (Or at least happen an extremely small
fraction of the number of times they do without verbose logging enabled.) 
However nanny's verbose logging on an LVS server with more than a couple of
dozen nanny processing watching real-servers/services generates a
tremendous amount of logging data which causes syslog to consume most of
the available CPU time and disk space.

Comment 1 Red Hat Bugzilla 2001-05-07 20:30:48 UTC

I am going to look into actually fixing the problem rather than play games with
the timeout value.

Comment 2 Red Hat Bugzilla 2001-07-31 18:40:19 UTC

Had a simular problem.  I updated to the lastest non-beta lvs set of rpms and 
the problem went away and have not returned.  Also had to do the kernal patch 
for RH 6.2.

Note You need to log in before you can comment on or make changes to this bug.