126316 – clumembd running twice on reboot

Bug 126316 - clumembd running twice on reboot

Summary: clumembd running twice on reboot

Keywords:
Status:	CLOSED DUPLICATE of bug 125741
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	125741
TreeView+	depends on / blocked

Reported:	2004-06-18 21:56 UTC by Robert Reynolds
Modified:	2009-04-16 20:15 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-02-21 19:04:09 UTC
Embargoed:

Attachments	(Terms of Use)
Fixes infinite loop. (523 bytes, patch) 2004-06-21 12:40 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Description Robert Reynolds 2004-06-18 21:56:14 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/125.2 (KHTML, like Gecko) Safari/125.8

Description of problem:
I opened support request 333770 and they asked me to submit this bug report.

On rebooting cluster member, clumembd started twice and is now consuming all the cpu.  The application service running on the member is still up and running and it is a production system so I was asked not to fail it over or reboot until the next maintenance window.  clustat output shows the two members as UNKNOWN but the other cluster member is working fine in clustat, recognizing both members.

Support made this comments, in between the #### lines:

###################

Also I got this data from our cluster development team:

Looks like clumembd and clusvcmgrd ran amok somewhere. Please install the 
matching clumanager-debuginfo and file a bugzilla with: 

- strace -p 2233
- strace -p 705
- gdb /usr/sbin/clumembd 705
 - bt
 - quit
- gdb /usr/sbin/clusvcmgrd 2233
 - bt
 - quit

In any case, I figure I can have a fix for it really quickly if it's a 
tight-loop (which it looks like -- just look at the run times: 297 minutes). I 
wasn't aware of any place this could happen. 

####################

strace -p 2233 outputs line after line of:
select(1024, [10], NULL, NULL, {0, 0})  = 0 (Timeout)

strace -p 705 outputs line after line of:
wait4(2233, 0xbfffa8a8, WNOHANG, NULL)  = 0

gdb 2233 with bt and quit:
#0  0xb747b337 in ___newselect_nocancel () from /lib/tls/libc.so.6
#1  0x080537c3 in pulsar ()
#2  0x08051006 in pulsar ()
#3  0x080524f9 in pulsar ()
#4  0x0804dbdc in pulsar ()
#5  0xb73bb768 in __libc_start_main () from /lib/tls/libc.so.6
#6  0x0804a459 in ?? ()

gdb 705 with bt and quit:
#0  0xb744efb9 in wait4 () from /lib/tls/libc.so.6
#1  0x0805263a in pulsar ()
#2  0x0804db1b in pulsar ()
#3  0xb73bb768 in __libc_start_main () from /lib/tls/libc.so.6
#4  0x0804a459 in ?? ()

Version-Release number of selected component (if applicable):
clumanager-1.2.9-1

How reproducible:
Didn't try

Steps to Reproduce:
1.Since it is a production machine I have not tried to reproduce it.
2.
3.
    

Additional info:

Comment 1 Lon Hohberger 2004-06-21 12:34:04 UTC

The backtrace looks like it was taken without debugging symbols.

Comment 2 Lon Hohberger 2004-06-21 12:40:43 UTC

Created attachment 101291 [details]
Fixes infinite loop.

There was a bug in the VF code which caused a tight loop in the event of a
timeout instead of a normal recovery.  This patch will prevent it, and should
enable normal operation.

This patch is against 1.2.16, but it will apply against 1.2.12, 1.2.9, and
1.2.3 as well.

Comment 4 Lon Hohberger 2004-06-21 12:52:53 UTC

Adding cperry to cc list.

Comment 5 Lon Hohberger 2004-06-21 12:56:47 UTC

Adding vanhoof to CC list

Comment 7 Lon Hohberger 2004-08-27 17:06:40 UTC


*** This bug has been marked as a duplicate of 125741 ***

Comment 8 Lon Hohberger 2004-09-02 15:58:12 UTC

1.2.18pre1 patch (unsupported; test only, etc.)

http://people.redhat.com/lhh/clumanager-1.2.16-1.2.18pre1.patch

This includes the fix for this bug and a few others.

Comment 9 Red Hat Bugzilla 2006-02-21 19:04:09 UTC

Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.

Comment 10 Lon Hohberger 2007-12-21 15:09:54 UTC

Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3

Note You need to log in before you can comment on or make changes to this bug.