Red Hat Bugzilla – Bug 250718
fs.sh inefficient scripting leads to load peaks and disk saturation
Last modified: 2010-10-22 13:08:52 EDT
Description of problem:
In at least my x86_64 machines, I get an incomprehensible number of fs.sh execs
per day. Rgmanagers invokes the script ok as configured, once per 10 seconds for
each fs, but there are so many subshells created that the number of fs.sh execs
grows. I have 26 file systems to check. In each ten-second period, there are six
seconds with 0 fs.sh execs, one second with abt 100..200 execs, two seconds with
abt 1500..2000 execs, and one second with abt 100..200 execs. This adds up to
some tens of millions of fs.sh execs per day.
Something with these fs.sh execs creates a periodical load fluctuation for my
machines. In a mostly idle system, the load increases to abt 4 and then
decreases again. (See the attached graph. I inserted an exit 0 to the front of
fs.sh status switch-case the day before yesterday (Aug 01), and the periodical
peaks disappeared. The smaller peaks yesterday (Aug 02) are 'real' load peaks
(that is, I have an explanation for them) and have nothing to do with fs.sh.
Also, I caught a beginning load increase in action on an another server,
inserted the exit 0 - loads promptly fell back from abt 10.0 to abt 0.5. (Sorry,
The problem with these load peaks is, if I have a lot of real fs/disk load on
the system, it might start acting wildly if the real load peaks coincide with
the fs.sh caused peaks. I've seen loads like 500... and a really badly stuck
system. (I still don't get what it is that periodically increases and decreases.
The number of fs.sh invocations is more or less constant all the time. Somehow
they add up somewhere, then something seems to flow over, and starts to slowly
add up again.) Also, they seem to sharpen real load peaks, even when the system
doesn't get stuck.
Version-Release number of selected component (if applicable):
2.0.27-2.1lhh.el5 (also 2.0.24-1.el5)
Enable process accounting.
Create a resource group with an fs resource (ext3, on an fc disk. I've got
qlogic hba's and eva) and start it.
Ten to twenty thousand fs.sh execs per minute, according to process accounting.
Periodical load peaks. System getting stuck on disk operations.
Abt same numbers of fs.sh execs as there are, for example, ip.sh execs. No
excess load or disk saturation.
My system is a Centos 5, but Lon asked me to file a bugzilla anyway... ;)
Created attachment 160582 [details]
Weekly load graph for one system with the fs.sh problem
Excellent, thanks. There are a number of optimizations we can make fairly
quickly. such as replacing pattern matching/substitution utilities
(grep/awk/etc) with pure bash script. This will (by itself!) reduce load a bit,
but there's more we can do for sure.
Apparently, the load caused by the fs.sh execs wasn't the reason my system got
stuck; the reason was plain and simple memory starvation. Now, there were the
load peaks on a mostly idle system that went away as I added the exit 0 into
fs.sh status. But the other system, with real load, stopped getting stuck only
after I added more memory. So the real culprit wasn't fs.sh after all. And this
means I should lower the severity of this bug, too.
As a side note, and this should perhaps be a separate bugzilla, after adding the
memory and being able to see what's actually happening with loads on the busy
system, I noticed that there were still small load peaks left, with a height of
abt +6 (that is, they add abt 6 units of load to any real load there is), and
with abt an eleven-hour period. These peaks don't seem to be reflected in any
other statistics; I can only assume there is something going on inside kernel...
The not-so-busy system still doesn't have the load peaks. It also doesn't have
as many clustered services running as the other one.
Oh yes, the smaller peaks with the 11-hour period aren't caused by ip.sh, or at
least they didn't go away when I put an exit 0 into the beginning of ip.sh status.
There are other, additional ways we can limit load here, too. For example, if
we disable status checks for the 'service.sh' agent (which is a no-op). That's
one less, although that one only happens once per hour by default.
One way to make this work is to build a FS replacement agent in C.
All cluster version 5 defects should be reported under red hat enterprise linux
5 product name - not cluster suite.
I've written a program which might help - it's sort of a drop-in replacement for
* This forks to call the 'findfs' utility
* This *DOES NOT* update /etc/mtab - the standard 'mount' utility is not spawned.
* Specifying your file system type (ext2, ext3) is required in cluster.conf.
* force_unmount, self_fence, etc. are not implemented at this point
* You must move (or chmod -x) fs.sh if you intend to try this out,
* fsc does no logging whatsoever; you should test with rg_test suitably before
trying it in a cluster. See:
...for more information about how to use rg_test to test your services.
Let me know if this is the right direction for you.
If you require them for any testing, I can make the following changes fairly easily:
* make self_fence work
* fstype default to 'ext3'
* alternatively, we could build the mount(1) command
line and fork + exec it. This will reduce performance
a lot, however, it will update mtab and make the fstype
Created attachment 296043 [details]
A patch to readlinkr.c to prevent handling an ablosute link as relative
fsc seems good so far, I've yet to gather the courage to apply it in the
production environment. With the patch attached, it seems to work OK with my
I've applied your patch to my source base. All feedback is appreciated; even if
you're not running it in production.
Created attachment 296164 [details]
Weekly load with fs.sh up to 26th and fsc beg. with 27th
I did replace fs.sh with fsc on a not-so-critical production cluster with 48
cluster-controlled ext3 file systems. The results (disappearance of the phantom
load peaks) are clearly visible on the weekly load graphs; see attachment.
*** Bug 474364 has been marked as a duplicate of this bug. ***
Whoops - current agent w/ patch applied:
Also missing is a check to see if the file system is still accessible.
Updated agent. Includes external_mount="[0|1]" option, which forks/execs mount/umount during start/stop. This has the benefit of updating /etc/mtab.
Also includes self_fence support and an auto-generated man page.
Created attachment 333878 [details]
fs.sh which has a quick_status option.
This agent is an updated fs.sh agent which has a quick_status option. The quick_status option trades off verbosity for speed. When quick_status="1" in cluster.conf for a given file system, fs.sh does not fork().
I verified this using 'strace -vf'.
Note: It can fork if you are using symbolic links, LABEL= or UUID=.
Also, because it does not fork, it also does not log.
Lack of logging is a known limitation of fsc, so this new agent does not introduce something which was not already a trade off for using fsc.
Created attachment 333890 [details]
strace of old fs.sh without quick_status
Created attachment 333891 [details]
strace of new fs.sh using quick_status="1"
Created attachment 333893 [details]
strace of fsc
fsc is still faster and produces less strace output, but fs.sh with quick_status is pretty good and saves a lot of maintenance that would be introduced if we included fsc directly. Also, it's less confusing to 'turn on' quick_status than it is to swap resource agents around.
A lot of the "bloat" in the newer fs.sh strace output is rt_sigprocmask() which occurs many times.
[root@molly ~]# wc -l fs.sh-old.out
[root@molly ~]# wc -l ./fs.sh-new.out
[root@molly ~]# grep -v rt_sig ./fs.sh-new.out | wc -l
[root@molly ~]# wc -l fsc.out
However, the important parts...
[root@molly ~]# grep ^clone\(Proc ./fs.sh-old.out | wc -l
[root@molly ~]# grep ^clone\(Proc ./fs.sh-new.out | wc -l
[root@molly ~]# grep ^clone\(Proc ./fsc.out | wc -l
Note that the RelaxNG schema doesn't know about the new quick_status parameter and therefore will be upset about it.
It cannot be added to the schema until the fs.sh change is deemed acceptable.
I updated fsc based on patch from Eduardo Damato; he noticed that the format string was wrong if there were no mount options specified. Oops :)
Note that fsc more or less got rejected for upstream inclusion on the basis that it's a waste of effort to maintain a second agent to do something we already provide. Furthermore, it's written in C.
This is why fs.sh was carefully (and painfully) updated to eliminate fork() and clone().
*** Bug 487600 has been marked as a duplicate of this bug. ***
Also, a related patch here which allows administrators to cap status check children:
We have similar problem on our cluster, where are about 25 ext3 resources. Cluster have only two nodes and load on both nodes are consistently about 10...
We've just updated fs.sh to the one with quick_status option and first test looks fine. So, is there any chance that this new fs.sh will be released as a official errata?
Quick_status is slated for RHEL 5.4 inclusion.
~~ Attention - RHEL 5.4 Beta Released! ~~
RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!
If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.
Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.
Questions can be posted to this bug or your customer or partner representative.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.