User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.7) Gecko/2009030503 Fedora/3.0.7-1.fc10 Firefox/3.0.7 The following command will hang forever when ran: $ rg_test test /etc/cluster/cluster.conf The only way to unhang the command is to control-c it. Here is snippet of the last part of an strace that was ran. 9895 16:45:40.840503 read(255, "\n#\n# Flush NFS request queue. This might be done in the ip resource in the\n# future, but keep this around for now.\n#\n# clunfsops $nfsop_arg -d ${OCF_RESKEY_device}\n#\n\nexit $rv\n", 5927) = 177 <0.000017> 9895 16:45:40.840580 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005> 9895 16:45:40.840629 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000014> 9895 16:45:40.840703 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000014> 9895 16:45:40.840765 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000005> 9895 16:45:40.840812 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000015> 9895 16:45:40.840891 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000015> 9895 16:45:40.840956 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000015> 9895 16:45:40.841020 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 <0.000017> 9895 16:45:40.841086 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000005> 9895 16:45:40.841138 exit_group(0) = ? 9538 16:45:40.841297 <... read resumed> "", 4096) = 0 <0.001343> 9538 16:45:40.841334 wait4(9895, NULL, 0, NULL) = 9895 <0.000019> 9538 16:45:40.841387 --- SIGCHLD (Child exited) @ 0 (0) --- 9538 16:45:40.841412 close(4) = 0 <0.000018> 9538 16:45:40.844009 getdents(3, {}, 4096) = 0 <0.000018> 9538 16:45:40.844104 close(3) = 0 <0.000025> 9538 16:45:40.975832 write(2, "Out of memory malloc(40960) @ 0x41030c\n", 39) = 39 <0.000023> 9538 16:58:06.743104 --- SIGINT (Interrupt) @ 0 (0) --- 9538 16:58:07.502940 --- SIGINT (Interrupt) @ 0 (0) --- 9538 16:58:08.623072 --- SIGQUIT (Quit) @ 0 (0) --- 9538 16:58:08.951075 --- SIGQUIT (Quit) @ 0 (0) --- 9538 16:58:19.793813 --- SIGTERM (Terminated) @ 0 (0) --- 9538 16:58:25.845004 +++ killed by SIGKILL +++ ---------- I will attach the strace and an sosreport from a node. Reproducible: Always Steps to Reproduce: 1.$ rg_test test /etc/cluster/cluster.conf 2.Control-C is needed to kill or it will just hang forever Actual Results: The command will appear to hang forever. Expected Results: The command should return the results of the command: "$ rg_test test /etc/cluster/cluster.conf
Created attachment 335356 [details] strace of rg_test command Here is strace from running command: $ rgtest test /etc/cluster/cluster.conf
Created attachment 335357 [details] sosreport of the cluster node
Hah, rg_test ran out of memory. It uses a local allocator for debugging purposes. I've never seen a cluster.conf that large.
Created attachment 335379 [details] Output of rg_test from rhel5 branch Oddly, it worked for me. I will need to retest on a RHEL5 node.
Retried on 2.0.46-1 again on provided cluster.conf. Still works for me. Either there's an architecture-specific bug here or the altered ip.sh is causing problems.
Created attachment 340019 [details] modified ip.sh file this is the modified ip.shdl. They have updated to the latest 46-1 build of rgmanager and get same results as before.
Created attachment 340426 [details] Updated output. Output after installing "modified" ip.sh (if you click link in bugzilla, its file name is "ip.shdl"). Still works for me.
After chasing down things with Shane, rg_test uses a slab allocator for finding memory leaks, and we init+free the parser several times, as well as parse multiple documents. libxml2-2.6.26-2.1.2.1 -> works 100% of the time libxml2-2.6.26-2.1.2.7 (and 2.1.9) -> fails some of the time, depending on how the parser reads the conf file (e.g. ./foo.conf fails; ~/foo.conf works) So what happens is the newer libxml2 allocates more memory and hits a hard 8MB limit in the slab allocator used within rg_test. Note that rgmanager does not use this slab allocator; it's there primarily for debugging purposes, and serves no particularly useful purpose apart from that. This isn't actually a bug in either rg_test or libxml2; rather, it's an interaction problem which occurred when a couple of libxml2 buffer resize patches were added. Libxml2 now can (in certain conditions) temporarily require >16MB of parser space to parse a cluster.conf this large, and so, I recommend either we increase the slab allocator's limit to 32MB -or- we simply disable linking against the slab allocator in rg_test. Users of rg_test facing this problem have choices for a workaround: * downgrade libxml2 to 2.1.2.1 (please see errata for libxml2 to see what bugs have been fixed since) * install an older libxml2 somewhere and LD_PRELOAD it when running rg_test
http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=0e67106dff2c3fd393e1e2702f97b1ff4c0dea8f
~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1339.html