Description of Problem: Customer with 2000 node Beowulf cluster needs to define ~2000 NFS exports through cluadmin, but is prevented from doing so by a hard limit in the code. --copied from tech-list-- A few things to note: - The max size of /etc/cluster.conf is 1MB (due to the fixed size of the space for configuration data in the shared partition). This can be increased, but that's not the problem... - Consider the following unreadable TCL code: proc _nextexportID { serviceID deviceID } { # started w/ _nextclientID set ids [lsort -integer [_listexportIDs $serviceID $deviceID ]] for {set i 0} {$i < 200} {incr i} { if { -1 == [lsearch -exact $ids $i] } { return $i } } } proc _nextclientID { serviceID deviceID exportID } { # started w/ _nextdeviceID set ids [lsort -integer [_listclientIDs $serviceID $deviceID $exportID ]] for {set i 0} {$i < 200} {incr i} { if { -1 == [lsearch -exact $ids $i] } { return $i } } } Looks like the original author coded a hard limitation. We can, of course, change this, but we wouldn't be able to have an erratum out any time soon - QA is quite backed up doing security errata; a bugfix erratum won't appear high on their radar. Hand-editing /etc/cluster.conf is a _bad_ idea, but it _is_ possible to do, as long as the file size <= 1048576 bytes. That leaves about 500 bytes per export, which should be reasonable. --- end --- Please increase (to at least 2000) or remove these limitations, thanks.
Patch looks clean.
Oops - updated wrong bugzilla.
I tested cluadmin with a change of the max to 10k. It runs but is *very* slow, especially when displaying single mounts with several thousand exports each. Be aware that service starts can take up to (n * 120) seconds (where n is the number of exports) if DNS is misbehaving and clients' hostnames/IP addresses are not present in /etc/hosts. For large numbers of exports, I recommend increasing the quorum daemon's ping interval, as well as having all clients' hostnames and IP addresses listed in /etc/hosts.
Currently, to start one service with one export and 2005 clients, it takes about 3 hours to start. Clients are all in /etc/hosts. What is the recommended value to change the quorum ping interval?
4 should be fine; sorry for the delay. Also, try using netgroups if you haven't already done so (Red Hat Support should be sending you a netgroup file to try...). I am under the impression that this is all caused by wildcards not working, and would like to know what scenarios you have proven this to be true, as well as on what versions of clumanager. For instance, certain problems in clumanager 1.0.11 specifically surrounding wildcarded NFS exports were fixed in 1.0.16-7 and I was wondering if you were possibly using 1.0.11-1 when you saw the 'Stale NFS file handle...' errors.
Follow-up. Observations from testing: There are performance problems in clumanager's DB management code as well as scalability problems with the way clumanager uses exportfs making the use of 2000+ exports prohibitively slow. After applying several optimizations, the speed of the DB management code improved significantly (ie, took less than half the time per query), but this was not sufficient to alleviate the problem in itself. Known Solution: Use IP wildcards, hostname wildcards, or netgroups to reduce the number of exports. Use of netgroups solved the problem in this case. Resolution: Since there are currently multiple solutions, I see this as a feature request for now. Will escalate if it becomes a more severe problem.
I'll leave this as a potential limitation to 1.0.x. Because the speed is prohibitively slow when scaling to large numbers (probably due to the TCL->C conversion), I'll encourage people to use wildcards or netgroups for now. In the case this bug report was filed, netgroups were used to make up for the slowness.