Bug 25047

Summary:	nanny incorrectly reports connect failures
Product:	[Retired] Red Hat High Availability Server	Reporter:	Derek Glidden <dglidden>
Component:	piranha	Assignee:	Phil Copeland <copeland>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Phil Copeland <copeland>
Severity:	high	Docs Contact:
Priority:	high
Version:	1.0	CC:	christoph.hoeggerl
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2001-07-31 18:40:23 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Red Hat Bugzilla 2001-01-26 21:38:54 UTC

In some cases, nanny will fail in a connect attempt to a real-server when
that real server is actually alive and responsive.  As soon as the re-entry
timeout is reached, that server will be added right back into the LVS
tables. Syslog logs messages such as:

Jan 21 04:02:11 lvs nanny[30387]: shutting down 192.168.1.1:80 due to
connection failure
Jan 21 04:02:11 lvs nanny[30387]: running command  "/usr/sbin/ipvsadm" "-d"
"-t" "192.168.1.1:80" "-r" "192.168.1.1"
Jan 21 04:03:12 lvs nanny[30295]: making 192.168.1.1:80 available
Jan 21 04:03:12 lvs nanny[30295]: running command  "/usr/sbin/ipvsadm" "-a"
"-t" "192.168.1.1:80" "-r" "192.168.1.1" "-m" "-w" "100"

The cause: nanny creates a socket descriptor that it uses for communicating
with the real-server it's supposed to be monitoring.  an fcntl (on line 357
of the 0.4.17-7 package source that I have) attempts to clear the
O_NONBLOCK flag set when the socket is created which should wait until all
I/o has been completed on that descriptor so it can be re-used, however in
at least some cases this does not actually wait until that socket is
finished being used and the connect() immediately following will fail with
an EISCONN error which nanny reports as a connection failure to the
real-server and will remove it from IPVS tables. 

Turning on verbose logging with nanny seems to give the socket just enough
extra time to clear while the piranha_log() functions are being called so
that the failures do not happen.  (Or at least happen an extremely small
fraction of the number of times they do without verbose logging enabled.) 
However nanny's verbose logging on an LVS server with more than a couple of
dozen nanny processing watching real-servers/services generates a
tremendous amount of logging data which causes syslog to consume most of
the available CPU time and disk space.

Comment 1 Red Hat Bugzilla 2001-05-07 20:30:48 UTC

I am going to look into actually fixing the problem rather than play games with
the timeout value.

Comment 2 Red Hat Bugzilla 2001-07-31 18:40:19 UTC

Had a simular problem.  I updated to the lastest non-beta lvs set of rpms and 
the problem went away and have not returned.  Also had to do the kernal patch 
for RH 6.2.