490455 – rg_test hangs when running against cluster

Bug 490455 - rg_test hangs when running against cluster

Summary: rg_test hangs when running against cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-03-16 14:44 UTC by Shane Bradley
Modified:	2018-10-20 01:39 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 11:04:58 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
strace of rg_test command (5.34 MB, text/plain) 2009-03-16 14:45 UTC, Shane Bradley	no flags	Details
sosreport of the cluster node (7.68 MB, application/x-bzip2) 2009-03-16 14:46 UTC, Shane Bradley	no flags	Details
Output of rg_test from rhel5 branch (397.10 KB, text/plain) 2009-03-16 17:20 UTC, Lon Hohberger	no flags	Details
modified ip.sh file (18.56 KB, text/plain) 2009-04-17 15:20 UTC, Shane Bradley	no flags	Details
Updated output. (398.17 KB, text/plain) 2009-04-20 20:10 UTC, Lon Hohberger	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1339	0	normal	SHIPPED_LIVE	Low: rgmanager security, bug fix, and enhancement update	2009-09-01 10:42:29 UTC

Description Shane Bradley 2009-03-16 14:44:13 UTC

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.7) Gecko/2009030503 Fedora/3.0.7-1.fc10 Firefox/3.0.7

The following command will hang forever when ran:
$ rg_test test /etc/cluster/cluster.conf


The only way to unhang the command is to control-c it.

Here is snippet of the last part of an strace that was ran.
9895  16:45:40.840503 read(255, "\n#\n# Flush NFS request queue.  This might be done in the ip resource in the\n# future, but keep this around for now.\n#\n# 
clunfsops $nfsop_arg -d ${OCF_RESKEY_device}\n#\n\nexit $rv\n", 5927) = 177 <0.000017>
9895  16:45:40.840580 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
9895  16:45:40.840629 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000014>
9895  16:45:40.840703 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000014>
9895  16:45:40.840765 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005>
9895  16:45:40.840812 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000015>
9895  16:45:40.840891 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000015>
9895  16:45:40.840956 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000015>
9895  16:45:40.841020 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000017>
9895  16:45:40.841086 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000005>
9895  16:45:40.841138 exit_group(0)     = ?
9538  16:45:40.841297 <... read resumed> "", 4096) = 0 <0.001343>
9538  16:45:40.841334 wait4(9895, NULL, 0, NULL) = 9895 <0.000019>
9538  16:45:40.841387 --- SIGCHLD (Child exited) @ 0 (0) ---
9538  16:45:40.841412 close(4)          = 0 <0.000018>
9538  16:45:40.844009 getdents(3, {}, 4096) = 0 <0.000018>
9538  16:45:40.844104 close(3)          = 0 <0.000025>
9538  16:45:40.975832 write(2, "Out of memory malloc(40960) @ 0x41030c\n", 39) = 39 <0.000023>
9538  16:58:06.743104 --- SIGINT (Interrupt) @ 0 (0) ---
9538  16:58:07.502940 --- SIGINT (Interrupt) @ 0 (0) ---
9538  16:58:08.623072 --- SIGQUIT (Quit) @ 0 (0) ---
9538  16:58:08.951075 --- SIGQUIT (Quit) @ 0 (0) ---
9538  16:58:19.793813 --- SIGTERM (Terminated) @ 0 (0) ---
9538  16:58:25.845004 +++ killed by SIGKILL +++


----------

I will attach the strace and an sosreport from a node.


Reproducible: Always

Steps to Reproduce:
1.$ rg_test test /etc/cluster/cluster.conf
2.Control-C is needed to kill or it will just hang forever
Actual Results:  
The command will appear to hang forever.

Expected Results:  
The command should return the results of the command: 
"$ rg_test test /etc/cluster/cluster.conf

Comment 1 Shane Bradley 2009-03-16 14:45:31 UTC

Created attachment 335356 [details]
strace of rg_test command

Here is strace from running command:
$ rgtest test /etc/cluster/cluster.conf

Comment 2 Shane Bradley 2009-03-16 14:46:37 UTC

Created attachment 335357 [details]
sosreport of the cluster node

Comment 3 Lon Hohberger 2009-03-16 17:12:41 UTC

Hah, rg_test ran out of memory.  It uses a local allocator for debugging purposes.  I've never seen a cluster.conf that large.

Comment 4 Lon Hohberger 2009-03-16 17:20:04 UTC

Created attachment 335379 [details]
Output of rg_test from rhel5 branch

Oddly, it worked for me.  I will need to retest on a RHEL5 node.

Comment 7 Lon Hohberger 2009-03-24 18:42:30 UTC

Retried on 2.0.46-1 again on provided cluster.conf.  Still works for me.  Either there's an architecture-specific bug here or the altered ip.sh is causing problems.

Comment 8 Shane Bradley 2009-04-17 15:20:32 UTC

Created attachment 340019 [details]
modified ip.sh file

this is the modified ip.shdl. They have updated to the latest 46-1 build of rgmanager and get same results as before.

Comment 9 Lon Hohberger 2009-04-20 20:10:51 UTC

Created attachment 340426 [details]
Updated output.

Output after installing "modified" ip.sh (if you click link in bugzilla, its file name is "ip.shdl").

Still works for me.

Comment 11 Lon Hohberger 2009-04-21 18:11:30 UTC

After chasing down things with Shane, rg_test uses a slab allocator for finding memory leaks, and we init+free the parser several times, as well as parse multiple documents.

libxml2-2.6.26-2.1.2.1 -> works 100% of the time

libxml2-2.6.26-2.1.2.7 (and 2.1.9) -> fails some of the time, depending on how the parser reads the conf file (e.g. ./foo.conf fails; ~/foo.conf works)

So what happens is the newer libxml2 allocates more memory and hits a hard 8MB limit in the slab allocator used within rg_test.  Note that rgmanager does not use this slab allocator; it's there primarily for debugging purposes, and serves no particularly useful purpose apart from that.

This isn't actually a bug in either rg_test or libxml2; rather, it's an interaction problem which occurred when a couple of libxml2 buffer resize patches were added.  Libxml2 now can (in certain conditions) temporarily require >16MB of parser space to parse a cluster.conf this large, and so, I recommend either we increase the slab allocator's limit to 32MB -or- we simply disable linking against the slab allocator in rg_test.

Users of rg_test facing this problem have choices for a workaround:
 * downgrade libxml2 to 2.1.2.1 (please see errata for libxml2 to see what bugs have been fixed since)
 * install an older libxml2 somewhere and LD_PRELOAD it when running rg_test

Comment 12 Lon Hohberger 2009-04-27 17:17:00 UTC

http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=0e67106dff2c3fd393e1e2702f97b1ff4c0dea8f

Comment 15 Chris Ward 2009-07-03 18:27:22 UTC

~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 17 errata-xmlrpc 2009-09-02 11:04:58 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1339.html

Note You need to log in before you can comment on or make changes to this bug.