Red Hat Bugzilla – Bug 25047
nanny incorrectly reports connect failures
Last modified: 2005-10-31 17:00:50 EST
In some cases, nanny will fail in a connect attempt to a real-server when
that real server is actually alive and responsive. As soon as the re-entry
timeout is reached, that server will be added right back into the LVS
tables. Syslog logs messages such as:
Jan 21 04:02:11 lvs nanny: shutting down 192.168.1.1:80 due to
Jan 21 04:02:11 lvs nanny: running command "/usr/sbin/ipvsadm" "-d"
"-t" "192.168.1.1:80" "-r" "192.168.1.1"
Jan 21 04:03:12 lvs nanny: making 192.168.1.1:80 available
Jan 21 04:03:12 lvs nanny: running command "/usr/sbin/ipvsadm" "-a"
"-t" "192.168.1.1:80" "-r" "192.168.1.1" "-m" "-w" "100"
The cause: nanny creates a socket descriptor that it uses for communicating
with the real-server it's supposed to be monitoring. an fcntl (on line 357
of the 0.4.17-7 package source that I have) attempts to clear the
O_NONBLOCK flag set when the socket is created which should wait until all
I/o has been completed on that descriptor so it can be re-used, however in
at least some cases this does not actually wait until that socket is
finished being used and the connect() immediately following will fail with
an EISCONN error which nanny reports as a connection failure to the
real-server and will remove it from IPVS tables.
Turning on verbose logging with nanny seems to give the socket just enough
extra time to clear while the piranha_log() functions are being called so
that the failures do not happen. (Or at least happen an extremely small
fraction of the number of times they do without verbose logging enabled.)
However nanny's verbose logging on an LVS server with more than a couple of
dozen nanny processing watching real-servers/services generates a
tremendous amount of logging data which causes syslog to consume most of
the available CPU time and disk space.
I am going to look into actually fixing the problem rather than play games with
the timeout value.
Had a simular problem. I updated to the lastest non-beta lvs set of rpms and
the problem went away and have not returned. Also had to do the kernal patch
for RH 6.2.