Description of problem: Can't stop rgmanager on a 4-node DLM cluster. Version-Release number of selected component (if applicable): RHEL 4.6 kernel-smp-2.6.9-67.0.4.EL rgmanager-1.9.72-1 How reproducible: Almost always. Steps to Reproduce: 1. Let the cluster run for at least 3 days 2. service rgmanager stop Actual results: # service rgmanager stop Shutting down Cluster Service Manager... Waiting for services to stop: and rgmanager never stops. Expected results: # service rgmanager stop Shutting down Cluster Service Manager... Services are stopped. Additional info: rgmanager stops OK after a cluster restart, but it can't be stopped if the cluster runs was already running for some days (I don't know the exact amount of time needed to reproduce this issue).
Created attachment 299875 [details] clurgmgrd strace These are the clurgmgrd processes running on a node: # ps -elf | grep clurg 5 S root 16331 1 0 79 -1 - 2394 wait Mar28 ? 00:00:00 clurgmgrd -t 30 5 S root 16332 16331 0 75 -1 - 6093 109952 Mar28 ? 00:00:00 clurgmgrd -t 30 4 S root 31140 15051 0 76 0 - 12768 pipe_w 09:26 pts/3 00:00:00 grep clurg and the attached file is the output for: # strace -p 16331 -p 16332 -o /tmp/clurgmgrd.strace Process 16331 attached - interrupt to quit Process 16332 attached - interrupt to quit associated to a stuck "service rgmanager stop".
Created attachment 299876 [details] Cluster config file This is the test cluster we have configured to diagnose this issue. Note that we have reproduced this issue even before configuring any service on the cluster.
Actually, what I'd like if you can get it is: * install the rgmanager-debuginfo-1.9.72-1 package * run: gdb /usr/sbin/clurgmgrd <higher-numbered PID> * in gdb: thr a a bt
Comment on attachment 299876 [details] Cluster config file Fixing mime type of attachment
Also, 'cman_tool status' and 'cman_tool services' would be useful.
Created attachment 300133 [details] 'cman_tool status' on clu110
Created attachment 300134 [details] 'cman_tool services' on clu110
I couldn't find rgmanager-debuginfo-1.9.72-1 on RHN, so i had to compile it from rgmanager src rpm, but it doesn't seem to work: # gdb /usr/sbin/clurgmgrd 16332 [...] warning: the debug information found in "/usr/lib/debug//usr/sbin/clurgmgrd.debug" does not match "/usr/sbin/clurgmgrd" (CRC mismatch). (no debugging symbols found) Using host libthread_db library "/lib64/tls/libthread_db.so.1". Attaching to program: /usr/sbin/clurgmgrd, process 16332 ptrace: Operation not permitted. /tmp/16332: No such file or directory. (gdb)
I do not know why it is not available on RHN, but here it is: x86_64 (I think this is the architecture you're using) - http://people.redhat.com/lhh/rgmanager-debuginfo-1.9.72-1.x86_64.rpm i386 - http://people.redhat.com/lhh/rgmanager-debuginfo-1.9.72-1.i386.rpm
Still no luck, we have solved only a half of the problem, gdb doesn't like clurgmgrd: # gdb /usr/sbin/clurgmgrd 16332 GNU gdb Red Hat Linux (6.3.0.0-1.153.el4_6.2rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db library "/lib64/tls/libthread_db.so.1". Attaching to program: /usr/sbin/clurgmgrd, process 16332 ptrace: Operation not permitted. /root/16332: No such file or directory. (gdb) where No stack. (gdb)
As we have 3 other nodes in the cluster, I tried to attach gdb to clurgmgrd before issuing "service rgmanager stop" (gdb is still attached): # ps -ef | grep clurg root 14159 1 0 Mar28 ? 00:00:00 clurgmgrd -t 30 root 14160 14159 0 Mar28 ? 00:00:00 clurgmgrd -t 30 root 18411 12893 0 19:00 pts/3 00:00:00 grep clurg # gdb /usr/sbin/clurgmgrd 14160 [...] 0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6 (gdb) c Continuing. [New Thread 1084561728 (LWP 19038)] [Thread 1084561728 (LWP 19038) exited] [New Thread 1084561728 (LWP 19094)] [Thread 1084561728 (LWP 19094) exited] [New Thread 1084561728 (LWP 19120)] [Thread 1084561728 (LWP 19120) exited] Program received signal SIGTERM, Terminated. [Switching to Thread 182894167264 (LWP 14160)] 0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6 (gdb) Continuing. Program received signal SIGTERM, Terminated. 0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6 (gdb) Continuing. Program received signal SIG32, Real-time event 32. [Switching to Thread 1084229984 (LWP 14621)] 0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6 (gdb) c Continuing. [Thread 1084229984 (LWP 14621) exited] Program received signal SIGINT, Interrupt. [Switching to Thread 182894167264 (LWP 14160)] 0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6 (gdb) thr a a bt Thread 1 (Thread 182894167264 (LWP 14160)): #0 0x000000324bac0596 in __select_nocancel () from /lib64/tls/libc.so.6 #1 0x0000002a95701f7a in cluster_plugin_version () from /lib64/magma/magma_sm.so #2 0x0000002a95702400 in cluster_plugin_version () from /lib64/magma/magma_sm.so #3 0x000000000041b30c in cp_logout () #4 0x0000000000419eab in clu_disconnect () #5 0x0000000000405dcd in cleanup (cluster_fd=6) at main.c:630 #6 0x0000000000406699 in main (argc=3, argv=0x7fbffffe18) at main.c:916 #7 0x000000324ba1c3fb in __libc_start_main () from /lib64/tls/libc.so.6 #8 0x000000000040377a in _start () #9 0x0000007fbffffe08 in ?? () #10 0x000000000000001c in ?? () #11 0x0000000000000003 in ?? () #12 0x0000007fbfffff7c in ?? () #13 0x0000007fbfffff86 in ?? () #14 0x0000007fbfffff89 in ?? () #15 0x0000000000000000 in ?? () (gdb) c Continuing.
Ok, thanks. Could you also tell me your: magma magma-plugins ...versions?
I've produced this, but the symptoms (and backtrace) are different.
# rpm -qa magma* magma-plugins-1.0.12-0 magma-1.0.8-1 magma-devel-1.0.8-1
Pushed to RHEL4 git branch
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Hi, I believe I've fixed this - we will have a package for you to test either today or tomorrow. It could be that I found a different problem, however, the symptoms were very similar to what you described. -- Lon
Hi Lon, Any progress on this issue?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0791.html