Bug 440006 - rgmanager stuck on stop
rgmanager stuck on stop
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: rgmanager (Show other bugs)
4
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-04-01 06:55 EDT by Juanjo Villaplana
Modified: 2009-04-16 16:36 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2008-0791
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-25 15:16:10 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
clurgmgrd strace (5.29 KB, application/octet-stream)
2008-04-01 07:03 EDT, Juanjo Villaplana
no flags Details
Cluster config file (2.11 KB, text/plain)
2008-04-01 07:09 EDT, Juanjo Villaplana
no flags Details
'cman_tool status' on clu110 (271 bytes, text/plain)
2008-04-02 16:12 EDT, Juanjo Villaplana
no flags Details
'cman_tool services' on clu110 (329 bytes, text/plain)
2008-04-02 16:13 EDT, Juanjo Villaplana
no flags Details

  None (edit)
Description Juanjo Villaplana 2008-04-01 06:55:34 EDT
Description of problem:

Can't stop rgmanager on a 4-node DLM cluster.

Version-Release number of selected component (if applicable):

RHEL 4.6
kernel-smp-2.6.9-67.0.4.EL
rgmanager-1.9.72-1

How reproducible:

Almost always.

Steps to Reproduce:
1. Let the cluster run for at least 3 days
2. service rgmanager stop
  
Actual results:

# service rgmanager stop
Shutting down Cluster Service Manager...
Waiting for services to stop:

and rgmanager never stops.

Expected results:

# service rgmanager stop
Shutting down Cluster Service Manager...
Services are stopped. 

Additional info:

rgmanager stops OK after a cluster restart, but it can't be stopped if the
cluster runs was already running for some days (I don't know the exact amount of
time needed to reproduce this issue).
Comment 1 Juanjo Villaplana 2008-04-01 07:03:23 EDT
Created attachment 299875 [details]
clurgmgrd strace

These are the clurgmgrd processes running on a node:

# ps -elf | grep clurg
5 S root     16331     1  0  79  -1 -  2394 wait   Mar28 ?	  00:00:00
clurgmgrd -t 30
5 S root     16332 16331  0  75  -1 -  6093 109952 Mar28 ?	  00:00:00
clurgmgrd -t 30
4 S root     31140 15051  0  76   0 - 12768 pipe_w 09:26 pts/3	  00:00:00 grep
clurg

and the attached file is the output for:

# strace -p 16331 -p 16332 -o /tmp/clurgmgrd.strace
Process 16331 attached - interrupt to quit
Process 16332 attached - interrupt to quit

associated to a stuck "service rgmanager stop".
Comment 2 Juanjo Villaplana 2008-04-01 07:09:30 EDT
Created attachment 299876 [details]
Cluster config file

This is the test cluster we have configured to diagnose this issue.

Note that we have reproduced this issue even before configuring any service on
the cluster.
Comment 3 Lon Hohberger 2008-04-02 14:14:38 EDT
Actually, what I'd like if you can get it is:

 * install the rgmanager-debuginfo-1.9.72-1 package
 * run: gdb /usr/sbin/clurgmgrd <higher-numbered PID>
 * in gdb: thr a a bt
Comment 4 Lon Hohberger 2008-04-02 14:15:06 EDT
Comment on attachment 299876 [details]
Cluster config file

Fixing mime type of attachment
Comment 5 Lon Hohberger 2008-04-02 14:23:44 EDT
Also, 'cman_tool status' and 'cman_tool services' would be useful.
Comment 6 Juanjo Villaplana 2008-04-02 16:12:02 EDT
Created attachment 300133 [details]
'cman_tool status' on clu110
Comment 7 Juanjo Villaplana 2008-04-02 16:13:03 EDT
Created attachment 300134 [details]
'cman_tool services' on clu110
Comment 8 Juanjo Villaplana 2008-04-02 17:16:10 EDT
I couldn't find rgmanager-debuginfo-1.9.72-1 on RHN, so i had to compile it from
rgmanager src rpm, but it doesn't seem to work:

# gdb /usr/sbin/clurgmgrd 16332
[...]
warning: the debug information found in
"/usr/lib/debug//usr/sbin/clurgmgrd.debug" does not match "/usr/sbin/clurgmgrd"
(CRC mismatch).

(no debugging symbols found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Attaching to program: /usr/sbin/clurgmgrd, process 16332
ptrace: Operation not permitted.
/tmp/16332: No such file or directory.
(gdb)
Comment 9 Lon Hohberger 2008-04-03 12:40:38 EDT
I do not know why it is not available on RHN, but here it is:

x86_64 (I think this is the architecture you're using) -

http://people.redhat.com/lhh/rgmanager-debuginfo-1.9.72-1.x86_64.rpm

i386 -

http://people.redhat.com/lhh/rgmanager-debuginfo-1.9.72-1.i386.rpm

Comment 10 Juanjo Villaplana 2008-04-03 12:58:04 EDT
Still no luck, we have solved only a half of the problem, gdb doesn't like
clurgmgrd: 

# gdb /usr/sbin/clurgmgrd 16332
GNU gdb Red Hat Linux (6.3.0.0-1.153.el4_6.2rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db
library "/lib64/tls/libthread_db.so.1".

Attaching to program: /usr/sbin/clurgmgrd, process 16332
ptrace: Operation not permitted.
/root/16332: No such file or directory.
(gdb) where 
No stack.
(gdb)
Comment 11 Juanjo Villaplana 2008-04-03 13:13:30 EDT
As we have 3 other nodes in the cluster, I tried to attach gdb to clurgmgrd
before issuing "service rgmanager stop" (gdb is still attached):

# ps -ef | grep clurg
root     14159     1  0 Mar28 ?        00:00:00 clurgmgrd -t 30
root     14160 14159  0 Mar28 ?        00:00:00 clurgmgrd -t 30
root     18411 12893  0 19:00 pts/3    00:00:00 grep clurg
# gdb /usr/sbin/clurgmgrd 14160
[...]
0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6
(gdb) c
Continuing.
[New Thread 1084561728 (LWP 19038)]
[Thread 1084561728 (LWP 19038) exited]
[New Thread 1084561728 (LWP 19094)]
[Thread 1084561728 (LWP 19094) exited]
[New Thread 1084561728 (LWP 19120)]
[Thread 1084561728 (LWP 19120) exited]
Program received signal SIGTERM, Terminated.
[Switching to Thread 182894167264 (LWP 14160)]
0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6
(gdb)
Continuing.

Program received signal SIGTERM, Terminated.
0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6
(gdb)
Continuing.

Program received signal SIG32, Real-time event 32.
[Switching to Thread 1084229984 (LWP 14621)]
0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6
(gdb) c
Continuing.
[Thread 1084229984 (LWP 14621) exited]

Program received signal SIGINT, Interrupt.
[Switching to Thread 182894167264 (LWP 14160)]
0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6

(gdb) thr a a bt

Thread 1 (Thread 182894167264 (LWP 14160)):
#0  0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x0000002a95701f7a in cluster_plugin_version () from /lib64/magma/magma_sm.so
#2  0x0000002a95702400 in cluster_plugin_version () from /lib64/magma/magma_sm.so
#3  0x000000000041b30c in cp_logout ()
#4  0x0000000000419eab in clu_disconnect ()
#5  0x0000000000405dcd in cleanup (cluster_fd=6) at main.c:630
#6  0x0000000000406699 in main (argc=3, argv=0x7fbffffe18) at main.c:916
#7  0x000000324ba1c3fb in __libc_start_main () from /lib64/tls/libc.so.6
#8  0x000000000040377a in _start ()
#9  0x0000007fbffffe08 in ?? ()
#10 0x000000000000001c in ?? ()
#11 0x0000000000000003 in ?? ()
#12 0x0000007fbfffff7c in ?? ()
#13 0x0000007fbfffff86 in ?? ()
#14 0x0000007fbfffff89 in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
Comment 12 Lon Hohberger 2008-04-08 11:07:29 EDT
Ok, thanks.  Could you also tell me your:
  magma
  magma-plugins

...versions?
Comment 13 Lon Hohberger 2008-04-08 11:07:47 EDT
I've produced this, but the symptoms (and backtrace) are different.
Comment 14 Juanjo Villaplana 2008-04-08 16:46:25 EDT
# rpm -qa magma*
magma-plugins-1.0.12-0
magma-1.0.8-1
magma-devel-1.0.8-1
Comment 15 Lon Hohberger 2008-04-15 11:07:00 EDT
Pushed to RHEL4 git branch
Comment 16 RHEL Product and Program Management 2008-04-15 11:48:43 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 17 Lon Hohberger 2008-04-15 12:15:16 EDT
Hi,

I believe I've fixed this - we will have a package for you to test either today
or tomorrow.  It could be that I found a different problem, however, the
symptoms were very similar to what you described.

-- Lon
Comment 19 Juanjo Villaplana 2008-06-23 03:49:31 EDT
Hi Lon,

Any progress on this issue?
Comment 21 errata-xmlrpc 2008-07-25 15:16:10 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0791.html

Note You need to log in before you can comment on or make changes to this bug.