From Bugzilla Helper: User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/125.2 (KHTML, like Gecko) Safari/125.8 Description of problem: I opened support request 333770 and they asked me to submit this bug report. On rebooting cluster member, clumembd started twice and is now consuming all the cpu. The application service running on the member is still up and running and it is a production system so I was asked not to fail it over or reboot until the next maintenance window. clustat output shows the two members as UNKNOWN but the other cluster member is working fine in clustat, recognizing both members. Support made this comments, in between the #### lines: ################### Also I got this data from our cluster development team: Looks like clumembd and clusvcmgrd ran amok somewhere. Please install the matching clumanager-debuginfo and file a bugzilla with: - strace -p 2233 - strace -p 705 - gdb /usr/sbin/clumembd 705 - bt - quit - gdb /usr/sbin/clusvcmgrd 2233 - bt - quit In any case, I figure I can have a fix for it really quickly if it's a tight-loop (which it looks like -- just look at the run times: 297 minutes). I wasn't aware of any place this could happen. #################### strace -p 2233 outputs line after line of: select(1024, [10], NULL, NULL, {0, 0}) = 0 (Timeout) strace -p 705 outputs line after line of: wait4(2233, 0xbfffa8a8, WNOHANG, NULL) = 0 gdb 2233 with bt and quit: #0 0xb747b337 in ___newselect_nocancel () from /lib/tls/libc.so.6 #1 0x080537c3 in pulsar () #2 0x08051006 in pulsar () #3 0x080524f9 in pulsar () #4 0x0804dbdc in pulsar () #5 0xb73bb768 in __libc_start_main () from /lib/tls/libc.so.6 #6 0x0804a459 in ?? () gdb 705 with bt and quit: #0 0xb744efb9 in wait4 () from /lib/tls/libc.so.6 #1 0x0805263a in pulsar () #2 0x0804db1b in pulsar () #3 0xb73bb768 in __libc_start_main () from /lib/tls/libc.so.6 #4 0x0804a459 in ?? () Version-Release number of selected component (if applicable): clumanager-1.2.9-1 How reproducible: Didn't try Steps to Reproduce: 1.Since it is a production machine I have not tried to reproduce it. 2. 3. Additional info:
The backtrace looks like it was taken without debugging symbols.
Created attachment 101291 [details] Fixes infinite loop. There was a bug in the VF code which caused a tight loop in the event of a timeout instead of a normal recovery. This patch will prevent it, and should enable normal operation. This patch is against 1.2.16, but it will apply against 1.2.12, 1.2.9, and 1.2.3 as well.
Adding cperry to cc list.
Adding vanhoof to CC list
*** This bug has been marked as a duplicate of 125741 ***
1.2.18pre1 patch (unsupported; test only, etc.) http://people.redhat.com/lhh/clumanager-1.2.16-1.2.18pre1.patch This includes the fix for this bug and a few others.
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.
Fixing product name. Clumanager on RHEL3 was part of RHCS3, not RHEL3