Bug 126316 - clumembd running twice on reboot
clumembd running twice on reboot
Status: CLOSED DUPLICATE of bug 125741
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: clumanager (Show other bugs)
3
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
:
Depends On:
Blocks: 125741
  Show dependency treegraph
 
Reported: 2004-06-18 17:56 EDT by Robert Reynolds
Modified: 2009-04-16 16:15 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-02-21 14:04:09 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Fixes infinite loop. (523 bytes, patch)
2004-06-21 08:40 EDT, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Robert Reynolds 2004-06-18 17:56:14 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/125.2 (KHTML, like Gecko) Safari/125.8

Description of problem:
I opened support request 333770 and they asked me to submit this bug report.

On rebooting cluster member, clumembd started twice and is now consuming all the cpu.  The application service running on the member is still up and running and it is a production system so I was asked not to fail it over or reboot until the next maintenance window.  clustat output shows the two members as UNKNOWN but the other cluster member is working fine in clustat, recognizing both members.

Support made this comments, in between the #### lines:

###################

Also I got this data from our cluster development team:

Looks like clumembd and clusvcmgrd ran amok somewhere. Please install the 
matching clumanager-debuginfo and file a bugzilla with: 

- strace -p 2233
- strace -p 705
- gdb /usr/sbin/clumembd 705
 - bt
 - quit
- gdb /usr/sbin/clusvcmgrd 2233
 - bt
 - quit

In any case, I figure I can have a fix for it really quickly if it's a 
tight-loop (which it looks like -- just look at the run times: 297 minutes). I 
wasn't aware of any place this could happen. 

####################

strace -p 2233 outputs line after line of:
select(1024, [10], NULL, NULL, {0, 0})  = 0 (Timeout)

strace -p 705 outputs line after line of:
wait4(2233, 0xbfffa8a8, WNOHANG, NULL)  = 0

gdb 2233 with bt and quit:
#0  0xb747b337 in ___newselect_nocancel () from /lib/tls/libc.so.6
#1  0x080537c3 in pulsar ()
#2  0x08051006 in pulsar ()
#3  0x080524f9 in pulsar ()
#4  0x0804dbdc in pulsar ()
#5  0xb73bb768 in __libc_start_main () from /lib/tls/libc.so.6
#6  0x0804a459 in ?? ()

gdb 705 with bt and quit:
#0  0xb744efb9 in wait4 () from /lib/tls/libc.so.6
#1  0x0805263a in pulsar ()
#2  0x0804db1b in pulsar ()
#3  0xb73bb768 in __libc_start_main () from /lib/tls/libc.so.6
#4  0x0804a459 in ?? ()

Version-Release number of selected component (if applicable):
clumanager-1.2.9-1

How reproducible:
Didn't try

Steps to Reproduce:
1.Since it is a production machine I have not tried to reproduce it.
2.
3.
    

Additional info:
Comment 1 Lon Hohberger 2004-06-21 08:34:04 EDT
The backtrace looks like it was taken without debugging symbols.
Comment 2 Lon Hohberger 2004-06-21 08:40:43 EDT
Created attachment 101291 [details]
Fixes infinite loop.

There was a bug in the VF code which caused a tight loop in the event of a
timeout instead of a normal recovery.  This patch will prevent it, and should
enable normal operation.

This patch is against 1.2.16, but it will apply against 1.2.12, 1.2.9, and
1.2.3 as well.
Comment 4 Lon Hohberger 2004-06-21 08:52:53 EDT
Adding cperry to cc list.
Comment 5 Lon Hohberger 2004-06-21 08:56:47 EDT
Adding vanhoof to CC list
Comment 7 Lon Hohberger 2004-08-27 13:06:40 EDT

*** This bug has been marked as a duplicate of 125741 ***
Comment 8 Lon Hohberger 2004-09-02 11:58:12 EDT
1.2.18pre1 patch (unsupported; test only, etc.)

http://people.redhat.com/lhh/clumanager-1.2.16-1.2.18pre1.patch

This includes the fix for this bug and a few others.
Comment 9 Red Hat Bugzilla 2006-02-21 14:04:09 EST
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.
Comment 10 Lon Hohberger 2007-12-21 10:09:54 EST
Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3

Note You need to log in before you can comment on or make changes to this bug.