Bug 490455 - rg_test hangs when running against cluster
rg_test hangs when running against cluster
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
5.3
All Linux
low Severity medium
: rc
: ---
Assigned To: Lon Hohberger
Cluster QE
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-03-16 10:44 EDT by Shane Bradley
Modified: 2016-04-26 10:15 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 07:04:58 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
strace of rg_test command (5.34 MB, text/plain)
2009-03-16 10:45 EDT, Shane Bradley
no flags Details
sosreport of the cluster node (7.68 MB, application/x-bzip2)
2009-03-16 10:46 EDT, Shane Bradley
no flags Details
Output of rg_test from rhel5 branch (397.10 KB, text/plain)
2009-03-16 13:20 EDT, Lon Hohberger
no flags Details
modified ip.sh file (18.56 KB, text/plain)
2009-04-17 11:20 EDT, Shane Bradley
no flags Details
Updated output. (398.17 KB, text/plain)
2009-04-20 16:10 EDT, Lon Hohberger
no flags Details

  None (edit)
Description Shane Bradley 2009-03-16 10:44:13 EDT
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.7) Gecko/2009030503 Fedora/3.0.7-1.fc10 Firefox/3.0.7

The following command will hang forever when ran:
$ rg_test test /etc/cluster/cluster.conf


The only way to unhang the command is to control-c it.

Here is snippet of the last part of an strace that was ran.
9895  16:45:40.840503 read(255, "\n#\n# Flush NFS request queue.  This might be done in the ip resource in the\n# future, but keep this around for now.\n#\n# 
clunfsops $nfsop_arg -d ${OCF_RESKEY_device}\n#\n\nexit $rv\n", 5927) = 177 <0.000017>
9895  16:45:40.840580 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
9895  16:45:40.840629 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000014>
9895  16:45:40.840703 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000014>
9895  16:45:40.840765 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
9895  16:45:40.840812 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000015>
9895  16:45:40.840891 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000015>
9895  16:45:40.840956 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000015>
9895  16:45:40.841020 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000017>
9895  16:45:40.841086 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000005>
9895  16:45:40.841138 exit_group(0)     = ?
9538  16:45:40.841297 <... read resumed> "", 4096) = 0 <0.001343>
9538  16:45:40.841334 wait4(9895, NULL, 0, NULL) = 9895 <0.000019>
9538  16:45:40.841387 --- SIGCHLD (Child exited) @ 0 (0) ---
9538  16:45:40.841412 close(4)          = 0 <0.000018>
9538  16:45:40.844009 getdents(3, {}, 4096) = 0 <0.000018>
9538  16:45:40.844104 close(3)          = 0 <0.000025>
9538  16:45:40.975832 write(2, "Out of memory malloc(40960) @ 0x41030c\n", 39) = 39 <0.000023>
9538  16:58:06.743104 --- SIGINT (Interrupt) @ 0 (0) ---
9538  16:58:07.502940 --- SIGINT (Interrupt) @ 0 (0) ---
9538  16:58:08.623072 --- SIGQUIT (Quit) @ 0 (0) ---
9538  16:58:08.951075 --- SIGQUIT (Quit) @ 0 (0) ---
9538  16:58:19.793813 --- SIGTERM (Terminated) @ 0 (0) ---
9538  16:58:25.845004 +++ killed by SIGKILL +++


----------

I will attach the strace and an sosreport from a node.


Reproducible: Always

Steps to Reproduce:
1.$ rg_test test /etc/cluster/cluster.conf
2.Control-C is needed to kill or it will just hang forever
Actual Results:  
The command will appear to hang forever.

Expected Results:  
The command should return the results of the command: 
"$ rg_test test /etc/cluster/cluster.conf
Comment 1 Shane Bradley 2009-03-16 10:45:31 EDT
Created attachment 335356 [details]
strace of rg_test command

Here is strace from running command:
$ rgtest test /etc/cluster/cluster.conf
Comment 2 Shane Bradley 2009-03-16 10:46:37 EDT
Created attachment 335357 [details]
sosreport of the cluster node
Comment 3 Lon Hohberger 2009-03-16 13:12:41 EDT
Hah, rg_test ran out of memory.  It uses a local allocator for debugging purposes.  I've never seen a cluster.conf that large.
Comment 4 Lon Hohberger 2009-03-16 13:20:04 EDT
Created attachment 335379 [details]
Output of rg_test from rhel5 branch

Oddly, it worked for me.  I will need to retest on a RHEL5 node.
Comment 7 Lon Hohberger 2009-03-24 14:42:30 EDT
Retried on 2.0.46-1 again on provided cluster.conf.  Still works for me.  Either there's an architecture-specific bug here or the altered ip.sh is causing problems.
Comment 8 Shane Bradley 2009-04-17 11:20:32 EDT
Created attachment 340019 [details]
modified ip.sh file

this is the modified ip.shdl. They have updated to the latest 46-1 build of rgmanager and get same results as before.
Comment 9 Lon Hohberger 2009-04-20 16:10:51 EDT
Created attachment 340426 [details]
Updated output.

Output after installing "modified" ip.sh (if you click link in bugzilla, its file name is "ip.shdl").

Still works for me.
Comment 11 Lon Hohberger 2009-04-21 14:11:30 EDT
After chasing down things with Shane, rg_test uses a slab allocator for finding memory leaks, and we init+free the parser several times, as well as parse multiple documents.

libxml2-2.6.26-2.1.2.1 -> works 100% of the time

libxml2-2.6.26-2.1.2.7 (and 2.1.9) -> fails some of the time, depending on how the parser reads the conf file (e.g. ./foo.conf fails; ~/foo.conf works)

So what happens is the newer libxml2 allocates more memory and hits a hard 8MB limit in the slab allocator used within rg_test.  Note that rgmanager does not use this slab allocator; it's there primarily for debugging purposes, and serves no particularly useful purpose apart from that.

This isn't actually a bug in either rg_test or libxml2; rather, it's an interaction problem which occurred when a couple of libxml2 buffer resize patches were added.  Libxml2 now can (in certain conditions) temporarily require >16MB of parser space to parse a cluster.conf this large, and so, I recommend either we increase the slab allocator's limit to 32MB -or- we simply disable linking against the slab allocator in rg_test.

Users of rg_test facing this problem have choices for a workaround:
 * downgrade libxml2 to 2.1.2.1 (please see errata for libxml2 to see what bugs have been fixed since)
 * install an older libxml2 somewhere and LD_PRELOAD it when running rg_test
Comment 15 Chris Ward 2009-07-03 14:27:22 EDT
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.
Comment 17 errata-xmlrpc 2009-09-02 07:04:58 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1339.html

Note You need to log in before you can comment on or make changes to this bug.