78634 – >200 NFS export clients per export

Bug 78634 - >200 NFS export clients per export

Summary: >200 NFS export clients per export

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	clumanager
Sub Component:
Version:	2.1
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-11-26 18:50 UTC by Chris Kloiber
Modified:	2008-05-01 15:38 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2003-02-19 14:17:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Chris Kloiber 2002-11-26 18:50:31 UTC

Description of Problem:

Customer with 2000 node Beowulf cluster needs to define ~2000 NFS exports
through cluadmin, but is prevented from doing so by a hard limit in the code.

--copied from tech-list--

A few things to note:

- The max size of /etc/cluster.conf is 1MB (due to the fixed size of the
space for configuration data in the shared partition).  This can be
increased, but that's not the problem...

- Consider the following unreadable TCL code:

    proc _nextexportID { serviceID deviceID } {
        # started w/ _nextclientID
        set ids [lsort -integer [_listexportIDs $serviceID $deviceID ]]
        for {set i 0} {$i < 200} {incr i} {
            if { -1 == [lsearch -exact $ids $i] } {
                return $i
            }
        }
    }
    proc _nextclientID { serviceID deviceID exportID } {
        # started w/ _nextdeviceID
        set ids [lsort -integer [_listclientIDs $serviceID $deviceID
$exportID ]]
        for {set i 0} {$i < 200} {incr i} {
            if { -1 == [lsearch -exact $ids $i] } {
                return $i
            }
        }
    }

Looks like the original author coded a hard limitation.  We can, of
course, change this, but we wouldn't be able to have an erratum out any
time soon - QA is quite backed up doing security errata; a bugfix
erratum won't appear high on their radar.

Hand-editing /etc/cluster.conf is a _bad_ idea, but it _is_ possible to
do, as long as the file size <= 1048576 bytes.  That leaves about 500
bytes per export, which should be reasonable.

--- end ---

Please increase (to at least 2000) or remove these limitations, thanks.

Comment 1 Lon Hohberger 2002-12-02 15:21:09 UTC

Patch looks clean.

Comment 2 Lon Hohberger 2002-12-02 15:45:15 UTC

Oops - updated wrong bugzilla.

Comment 3 Lon Hohberger 2002-12-04 20:12:56 UTC

I tested cluadmin with a change of the max to 10k.  It runs but is *very* slow,
especially when displaying single mounts with several thousand exports each.  

Be aware that service starts can take up to (n * 120) seconds (where n is the
number of exports) if DNS is misbehaving and clients' hostnames/IP addresses are
not present in /etc/hosts.

For large numbers of exports, I recommend increasing the quorum daemon's ping
interval, as well as having all clients' hostnames and IP addresses listed in
/etc/hosts.

Comment 4 Jim Strong 2002-12-04 20:26:05 UTC

Currently, to start one service with one export and 2005 clients, it takes about
3 hours to start.  Clients are all in /etc/hosts.
What is the recommended value to change the quorum ping interval?

Comment 5 Lon Hohberger 2002-12-04 23:00:43 UTC

4 should be fine; sorry for the delay.

Also, try using netgroups if you haven't already done so (Red Hat Support should
be sending you a netgroup file to try...).

I am under the impression that this is all caused by wildcards not working, and
would like to know what scenarios you have proven this to be true, as well as on
what versions of clumanager.

For instance, certain problems in clumanager 1.0.11 specifically surrounding
wildcarded NFS exports were fixed in 1.0.16-7 and I was wondering if you were
possibly using 1.0.11-1 when you saw the 'Stale NFS file handle...' errors.

Comment 6 Lon Hohberger 2002-12-12 20:12:36 UTC

Follow-up.

Observations from testing:

There are performance problems in clumanager's DB management code as well as
scalability problems with the way clumanager uses exportfs making the use of
2000+ exports prohibitively slow.  After applying several optimizations, the
speed of the DB management code improved significantly (ie, took less than half
the time per query), but this was not sufficient to alleviate the problem in itself.

Known Solution:

Use IP wildcards, hostname wildcards, or netgroups to reduce the number of
exports. Use of netgroups solved the problem in this case.

Resolution:

Since there are currently multiple solutions, I see this as a feature request
for now.  Will escalate if it becomes a more severe problem.

Comment 7 Lon Hohberger 2003-02-19 14:17:07 UTC

I'll leave this as a potential limitation to 1.0.x.  Because the speed is
prohibitively slow when scaling to large numbers (probably due to the TCL->C
conversion),  I'll encourage people to use wildcards or netgroups for now.

In the case this bug report was filed, netgroups were used to make up for the
slowness.

Note You need to log in before you can comment on or make changes to this bug.