Bug 194361 - deadlock with 'service rgmanager stop'
deadlock with 'service rgmanager stop'
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: ccs (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-06-07 11:00 EDT by Corey Marthaler
Modified: 2009-04-16 16:20 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2006-0554
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-08-10 17:16:28 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Fixes infinite loop in ccsd which causes rgmanager (and other processes) to hang in read (673 bytes, patch)
2006-06-15 17:02 EDT, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Corey Marthaler 2006-06-07 11:00:05 EDT
Description of problem:
When I attempted to shutdown rgmanager ('service rgmanager stop') on link-01, I
saw a *ton* of these ccsd messages which appeared to have caused that service
stop cmd to hang.

Jun  6 15:56:26 link-01 rgmanager: [4415]: <notice> Shutting down Cluster
Service Manager...
Jun  6 15:56:26 link-01 ccsd[2667]: Unable to write package back to sender:
Broken pipe
Jun  6 15:56:26 link-01 last message repeated 11 times
Jun  6 15:56:26 link-01 clurgmgrd[3287]: <notice> Shutting down
Jun  6 15:56:26 link-01 ccsd[2667]: Unable to write package back to sender:
Broken pipe
Jun  6 15:56:27 link-01 last message repeated 15591 times
Jun  6 15:56:27 link-01 clurgmgrd[3287]: <notice> Shutdown complete, exiting
Jun  6 15:56:27 link-01 ccsd[2667]: Unable to write package back to sender:
Broken pipe
Jun  6 15:56:57 link-01 last message repeated 1477955 times
Jun  6 15:57:58 link-01 last message repeated 2886568 times
[...]

Everything else seems to be alright though:
[root@link-01 sbin]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    3   M   link-01
   2    1    3   M   link-08
   3    1    3   M   link-02

[root@link-01 sbin]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[3 1 2]

DLM Lock Space:  "clvmd"                             3   3 run       -
[3 1 2]

DLM Lock Space:  "LINK_1280"                         4   5 run       -
[3 1 2]

DLM Lock Space:  "LINK_1281"                         6   7 run       -
[3 1 2]

DLM Lock Space:  "LINK_1282"                         8   9 run       -
[3 1 2]

GFS Mount Group: "LINK_1280"                         5   6 run       -
[3 1 2]

GFS Mount Group: "LINK_1281"                         7   8 run       -
[3 1 2]

GFS Mount Group: "LINK_1282"                         9  10 run       -
[3 1 2]

# clustat hangs though:
[root@link-01 ~]# strace clustat
execve("/usr/sbin/clustat", ["clustat"], [/* 21 vars */]) = 0
[...]
stat("/lib64/magma/magma_sm.so", {st_mode=S_IFREG|0755, st_size=24976, ...}) = 0
open("/lib64/magma/magma_sm.so", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\27\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=24976, ...}) = 0
mmap(NULL, 1071744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =
0x2a95576000
mprotect(0x2a9557c000, 1047168, PROT_NONE) = 0
mmap(0x2a9567b000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x2a9567b000
close(3)                                = 0
socket(0x1e /* PF_??? */, SOCK_DGRAM, 3) = 3
rt_sigaction(SIGINT, {0x402298, [INT], SA_RESTORER|SA_RESTART, 0x3f3292e2b0},
{SIG_DFL}, 8) = 0
rt_sigaction(SIGTERM, {0x402298, [TERM], SA_RESTORER|SA_RESTART, 0x3f3292e2b0},
{SIG_DFL}, 8) = 0
ioctl(3, 0x7805, 0)                     = 1
ioctl(3, 0x80107803, 0)                 = 3
ioctl(3, 0x80107803, 0x7fbffff830)      = 3
open("/proc/cluster/services", O_RDONLY) = 4
read(4, "Service          Name           "..., 4096) = 714
read(4, "", 4096)                       = 0
close(4)                                = 0
getuid()                                = 0
ioctl(3, 0xffffffff80107803, 0)         = 3
ioctl(3, 0xffffffff80107803, 0x7fbffff820) = 3
socket(PF_FILE, SOCK_STREAM, 0)         = 4
connect(4, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"}, 110) = 0
write(4, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
read(4,  <unfinished ...>

Version-Release number of selected component (if applicable):
[root@link-01 sbin]# rpm -q ccs
ccs-1.0.6-0
[root@link-01 sbin]# uname -ar
Linux link-01 2.6.9-39.ELsmp #1 SMP Thu Jun 1 18:01:55 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux

I'll try to reproduce and gather more info.
Comment 1 Corey Marthaler 2006-06-07 15:45:30 EDT
This is reproducable and looks to be an rgmanager issue.

All I had to do was get a cman cluster up, start rgmanger on all in cluster, and
then stop rgmanager one node at a time.

[root@link-02 ~]# service rgmanager stop
Shutting down Cluster Service Manager...
Services are stopped.
[HANG]

Jun  7 09:44:55 link-02 clurgmgrd[4223]: <info> Magma Event: Membership Change
Jun  7 09:44:55 link-02 clurgmgrd[4223]: <info> State change: link-08 DOWN
Jun  7 09:44:55 link-02 clurgmgrd[4223]: <info> Event (0:1:0) Processed
Jun  7 09:45:09 link-02 rgmanager: [4331]: <notice> Shutting down Cluster
Service Manager...
Jun  7 09:45:09 link-02 clurgmgrd[4223]: <notice> Shutting down
Jun  7 09:45:09 link-02 ccsd[4017]: Unable to write package back to sender:
Broken pipe
Jun  7 09:45:09 link-02 last message repeated 136 times
Jun  7 09:45:09 link-02 clurgmgrd[4223]: <notice> Shutdown complete, exiting
Jun  7 09:45:09 link-02 ccsd[4017]: Unable to write package back to sender:
Broken pipe
Jun  7 09:45:39 link-02 last message repeated 1417523 times
Comment 2 Corey Marthaler 2006-06-07 16:20:55 EDT
Here is what the init script did:

[root@link-01 ~]# pidof clurgmgrd
4084 4083
[root@link-01 ~]# kill -TERM 4084 4083

Jun  7 10:13:00 link-01 clurgmgrd[4084]: <notice> Shutting down
Jun  7 10:13:00 link-01 ccsd[3966]: Unable to write package back to sender:
Broken pipe
Jun  7 10:13:00 link-01 last message repeated 10 times
Jun  7 10:13:00 link-01 clurgmgrd[4084]: <notice> Shutdown complete, exiting
Jun  7 10:13:00 link-01 ccsd[3966]: Unable to write package back to sender:
Broken pipe
Comment 3 Lon Hohberger 2006-06-14 10:10:53 EDT
Ok, there's at least two problems here:

(a) Some sort of hang during shutdown, and 
(b) infinite retry in a send loop on a dead connection in ccsd.


Clustat normally hangs during shutdown, because rgmanager isn't accepting
requests at that time.  I suppose I could make rgmanager send a "Sorry, shutting
down" message, but (b) needs to be solved irrespective of what I do.
Comment 4 Lon Hohberger 2006-06-14 10:12:25 EDT
Clustat usually recovers after rgmanager exits, FWIW.
Comment 5 Corey Marthaler 2006-06-14 10:39:15 EDT
Yeah, clustat hanging wasn't the problem, only a sympton of the service shutdown
hanging. I assumed that if the service shutdown properly, that a clustat would
also not hang. :) As mentioned in comment #2, it was the kill of the clurgmgrd
processes, which the stop init script does, that was hangning and thus causing
the other probems with clustat and such. 
Comment 6 Lon Hohberger 2006-06-15 16:59:47 EDT
I have a fix which solves the hang, but I don't know what caused it in the first
place.

Here's what the log messages look like w/ the patch:

Jun 15 17:05:31 red clurgmgrd[19323]: <notice> Shutting down
Jun 15 17:05:31 red ccsd[16783]: Unable to write package back to sender: Broken pipe
Jun 15 17:05:31 red ccsd[16783]: Error while processing request: Operation not
permitted
Jun 15 17:05:36 red clurgmgrd[19323]: <info> Event (1:1:1) Processed
Comment 7 Lon Hohberger 2006-06-15 17:02:09 EDT
Created attachment 131001 [details]
Fixes infinite loop in ccsd which causes rgmanager (and other processes) to hang in read
Comment 8 Lon Hohberger 2006-06-15 17:13:35 EDT
Irrespective of the cause of the bad socket problem that the above patch works
around, this particular problem is not unique to rgmanager - anything using ccsd
can be affected.

Somewhere in process_connect, the file descriptor 'afd' is getting messed up,
causing write(2) to return -1/EPIPE (even though the other side of it has not
been closed).  The fd is still in /proc/<pid>/fd, and the descriptor is valid
(and in the descriptor table), and the message is valid.

If we include the above patch for U4 (which we should definitely do), this
bugzilla can be removed from the blocker list, but remain open - until we find
the root cause.
Comment 9 Lon Hohberger 2006-06-16 16:14:43 EDT
Fixes in CVS
Comment 12 Red Hat Bugzilla 2006-08-10 17:16:30 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0554.html

Note You need to log in before you can comment on or make changes to this bug.