17660 – file-handle leak in piranha/nanny

Bug 17660 - file-handle leak in piranha/nanny

Summary: file-handle leak in piranha/nanny

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat High Availability Server
Classification:	Retired
Component:	piranha
Sub Component:
Version:	1.0
Hardware:	i386
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Phil Copeland
QA Contact:	Wil Harris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2000-09-18 20:49 UTC by Derek Glidden
Modified:	2005-10-31 22:00 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2000-09-21 23:01:19 UTC
Embargoed:

Attachments	(Terms of Use)

Description Red Hat Bugzilla 2000-09-18 20:49:21 UTC

The piranha-0.4.16-3 and 0.4.17-2 packages both have file handle leaks. 
When pulse is started, the number of allocated/used filehandles as reported
in /proc/sys/fs/file-nr jumps to a hugh number of file handles, the more
services you have in /etc/lvs.cf, the more file handles get allocated.  On
our production LVS systems, they may slowly "drain out" over the course of
a few days, but can immediately crash the machine if file-max is not set
very large, and in our LVS test setup, both our LVS boxes crashed over the
weekend due to hitting file-max limits while running (number of
file-handles used/allocated _increased_ over the weekend during normal
testing use.)

Piranha 0.4.16-7 did not appear to have this leaky problem, but refused to
start up the nat-router alias/VIP automatically.

Comment 1 Red Hat Bugzilla 2000-09-18 22:37:39 UTC

I'm in the process of testing the updates to 0.4.17-3 and I hope to have these
up tonight in the experimental ftp area as getting it into production takes
about a week of jumping up and down on the HA frame with various tests. The main
changes being that you can now stipulate the netmasks being used on the external
VIP and the internal NAT device

You should be able to find this shortly from.

	ftp://people.redhat.com/kbarrett/HA/experimental/

I had a quick lok at the file handles problem you mentioned but wasn't able to
get anything beyond the following 

[root@ha6 piranha]# cat /proc/sys/fs/file-nr
763     271     4096

when I placed the system under extreme load (basically saturated the full
100Mbit in duplex mode)

Phil
=--=

Comment 2 Red Hat Bugzilla 2000-09-18 22:52:48 UTC

Note that these symptoms only occur for us when there are active primary and
backup LVS machines setup using "piranha."

This site has 27 active "virtual" tags in its /etc/lvs.cf and was started fresh
this afternoon with RedHat 6.2 and HA packages applied with piranha-0.4.17-2
(and it's mostly idle):

# cat /proc/sys/fs/file-nr 
18818   15088   262144

our testbed server(s) with identical software installs but only two active
services (We've just moved into new facilities, so we don't have
addressing/services set up quite yet for me to pound on our test servers the way
we are on the production servers.  I'd like to dump a bunch of services onto
them as well and see what the behaviour is.) were rebooted this afternoon and
show:  

# cat /proc/sys/fs/file-nr
832     0       4096

I *am* a little confused why the two behave so differently...  Our production
server will jump into the 10's of thousands of handles immediately upon starting
'pulse' which will slowly "drain away" while our test servers will slowly
increase the amount of filehandles they use at any given time until they feel
like crashing, but the test servers seem to de-allocate them successfully when
pulse/nanny is done with them (as long as they don't crash from trying to
allocate more than file-max.)  

Both piranha-0.4.16-3 and 0.4.17-2 have exactly the same symptoms.  0.4.16-7
didn't seem to have any problems with file-handle leaks or inordinate usage, but
wouldn't bring up the nat-router VIP for us.

Comment 3 Red Hat Bugzilla 2000-09-18 23:29:44 UTC

Ok,. thats the latest RPMS up for all architctures.. lets see how that behaves

I've started putting on extra servers on our HA rack system here to see if I can
duplicate your problem.

Lets see what happens

Phil
=--=

Comment 4 Red Hat Bugzilla 2000-09-19 00:56:02 UTC

I forgot to mention that the previous version that didn't bring up the other
interfaces was because of a NULL being passed into execve() which caused
ifconfig not to be called.

Appologies

Please try the 0.4.17-3 rpm and let me know if this is good for you or of
further problems still exist

	ftp://people.redhat.com/kbarrett/HA/experimental/

Thanks

Phil
=--=

Comment 5 Red Hat Bugzilla 2000-09-20 15:24:45 UTC

0.4.17-3 seems to be much better behaved than 0.4.17-2 was, more like 0.4.16-7's
behaviour:

# cat /proc/sys/fs/file-nr 
855     24      262144

with plenty of lvs virtual services and about a hundred portfw rules installed.

However 0.4.17-2 had some sort of catastrophic meltdown last night that caused
the VM system to start randomly killing processes on both boxes, so both the
primary and backup machines were down when we came in this morning.  I'll have
to keep an eye on this latest code for the next couple of days to see if it has
any problems like that.

Comment 6 Red Hat Bugzilla 2000-09-21 23:01:17 UTC

Please let us know what happens. We have not seen this behavior.

Comment 7 Red Hat Bugzilla 2000-09-26 13:22:00 UTC

piranha-0.4.17-4 is available to play with
I can't create this file handles problem with it and you get a few extra
features you'd probably
want.

ftp://people.redhat.com/kbarrett/HA/experimental/

Phil
=--=

Note You need to log in before you can comment on or make changes to this bug.