Bug 194361

Summary:

deadlock with 'service rgmanager stop'

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Corey Marthaler <cmarthal>

Component:

ccs

Assignee:

Lon Hohberger <lhh>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

cluster-maint, lhh

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2006-0554

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-08-10 21:16:28 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Fixes infinite loop in ccsd which causes rgmanager (and other processes) to hang in read	none

Description Corey Marthaler 2006-06-07 15:00:05 UTC

Description of problem:
When I attempted to shutdown rgmanager ('service rgmanager stop') on link-01, I
saw a *ton* of these ccsd messages which appeared to have caused that service
stop cmd to hang.

Jun  6 15:56:26 link-01 rgmanager: [4415]: <notice> Shutting down Cluster
Service Manager...
Jun  6 15:56:26 link-01 ccsd[2667]: Unable to write package back to sender:
Broken pipe
Jun  6 15:56:26 link-01 last message repeated 11 times
Jun  6 15:56:26 link-01 clurgmgrd[3287]: <notice> Shutting down
Jun  6 15:56:26 link-01 ccsd[2667]: Unable to write package back to sender:
Broken pipe
Jun  6 15:56:27 link-01 last message repeated 15591 times
Jun  6 15:56:27 link-01 clurgmgrd[3287]: <notice> Shutdown complete, exiting
Jun  6 15:56:27 link-01 ccsd[2667]: Unable to write package back to sender:
Broken pipe
Jun  6 15:56:57 link-01 last message repeated 1477955 times
Jun  6 15:57:58 link-01 last message repeated 2886568 times
[...]

Everything else seems to be alright though:
[root@link-01 sbin]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    3   M   link-01
   2    1    3   M   link-08
   3    1    3   M   link-02

[root@link-01 sbin]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[3 1 2]

DLM Lock Space:  "clvmd"                             3   3 run       -
[3 1 2]

DLM Lock Space:  "LINK_1280"                         4   5 run       -
[3 1 2]

DLM Lock Space:  "LINK_1281"                         6   7 run       -
[3 1 2]

DLM Lock Space:  "LINK_1282"                         8   9 run       -
[3 1 2]

GFS Mount Group: "LINK_1280"                         5   6 run       -
[3 1 2]

GFS Mount Group: "LINK_1281"                         7   8 run       -
[3 1 2]

GFS Mount Group: "LINK_1282"                         9  10 run       -
[3 1 2]

# clustat hangs though:
[root@link-01 ~]# strace clustat
execve("/usr/sbin/clustat", ["clustat"], [/* 21 vars */]) = 0
[...]
stat("/lib64/magma/magma_sm.so", {st_mode=S_IFREG|0755, st_size=24976, ...}) = 0
open("/lib64/magma/magma_sm.so", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\27\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=24976, ...}) = 0
mmap(NULL, 1071744, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =
0x2a95576000
mprotect(0x2a9557c000, 1047168, PROT_NONE) = 0
mmap(0x2a9567b000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x2a9567b000
close(3)                                = 0
socket(0x1e /* PF_??? */, SOCK_DGRAM, 3) = 3
rt_sigaction(SIGINT, {0x402298, [INT], SA_RESTORER|SA_RESTART, 0x3f3292e2b0},
{SIG_DFL}, 8) = 0
rt_sigaction(SIGTERM, {0x402298, [TERM], SA_RESTORER|SA_RESTART, 0x3f3292e2b0},
{SIG_DFL}, 8) = 0
ioctl(3, 0x7805, 0)                     = 1
ioctl(3, 0x80107803, 0)                 = 3
ioctl(3, 0x80107803, 0x7fbffff830)      = 3
open("/proc/cluster/services", O_RDONLY) = 4
read(4, "Service          Name           "..., 4096) = 714
read(4, "", 4096)                       = 0
close(4)                                = 0
getuid()                                = 0
ioctl(3, 0xffffffff80107803, 0)         = 3
ioctl(3, 0xffffffff80107803, 0x7fbffff820) = 3
socket(PF_FILE, SOCK_STREAM, 0)         = 4
connect(4, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"}, 110) = 0
write(4, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
read(4,  <unfinished ...>

Version-Release number of selected component (if applicable):
[root@link-01 sbin]# rpm -q ccs
ccs-1.0.6-0
[root@link-01 sbin]# uname -ar
Linux link-01 2.6.9-39.ELsmp #1 SMP Thu Jun 1 18:01:55 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux

I'll try to reproduce and gather more info.

Comment 1 Corey Marthaler 2006-06-07 19:45:30 UTC

This is reproducable and looks to be an rgmanager issue.

All I had to do was get a cman cluster up, start rgmanger on all in cluster, and
then stop rgmanager one node at a time.

[root@link-02 ~]# service rgmanager stop
Shutting down Cluster Service Manager...
Services are stopped.
[HANG]

Jun  7 09:44:55 link-02 clurgmgrd[4223]: <info> Magma Event: Membership Change
Jun  7 09:44:55 link-02 clurgmgrd[4223]: <info> State change: link-08 DOWN
Jun  7 09:44:55 link-02 clurgmgrd[4223]: <info> Event (0:1:0) Processed
Jun  7 09:45:09 link-02 rgmanager: [4331]: <notice> Shutting down Cluster
Service Manager...
Jun  7 09:45:09 link-02 clurgmgrd[4223]: <notice> Shutting down
Jun  7 09:45:09 link-02 ccsd[4017]: Unable to write package back to sender:
Broken pipe
Jun  7 09:45:09 link-02 last message repeated 136 times
Jun  7 09:45:09 link-02 clurgmgrd[4223]: <notice> Shutdown complete, exiting
Jun  7 09:45:09 link-02 ccsd[4017]: Unable to write package back to sender:
Broken pipe
Jun  7 09:45:39 link-02 last message repeated 1417523 times

Comment 2 Corey Marthaler 2006-06-07 20:20:55 UTC

Here is what the init script did:

[root@link-01 ~]# pidof clurgmgrd
4084 4083
[root@link-01 ~]# kill -TERM 4084 4083

Jun  7 10:13:00 link-01 clurgmgrd[4084]: <notice> Shutting down
Jun  7 10:13:00 link-01 ccsd[3966]: Unable to write package back to sender:
Broken pipe
Jun  7 10:13:00 link-01 last message repeated 10 times
Jun  7 10:13:00 link-01 clurgmgrd[4084]: <notice> Shutdown complete, exiting
Jun  7 10:13:00 link-01 ccsd[3966]: Unable to write package back to sender:
Broken pipe

Comment 3 Lon Hohberger 2006-06-14 14:10:53 UTC

Ok, there's at least two problems here:

(a) Some sort of hang during shutdown, and 
(b) infinite retry in a send loop on a dead connection in ccsd.


Clustat normally hangs during shutdown, because rgmanager isn't accepting
requests at that time.  I suppose I could make rgmanager send a "Sorry, shutting
down" message, but (b) needs to be solved irrespective of what I do.

Comment 4 Lon Hohberger 2006-06-14 14:12:25 UTC

Clustat usually recovers after rgmanager exits, FWIW.

Comment 5 Corey Marthaler 2006-06-14 14:39:15 UTC

Yeah, clustat hanging wasn't the problem, only a sympton of the service shutdown
hanging. I assumed that if the service shutdown properly, that a clustat would
also not hang. :) As mentioned in comment #2, it was the kill of the clurgmgrd
processes, which the stop init script does, that was hangning and thus causing
the other probems with clustat and such.

Comment 6 Lon Hohberger 2006-06-15 20:59:47 UTC

I have a fix which solves the hang, but I don't know what caused it in the first
place.

Here's what the log messages look like w/ the patch:

Jun 15 17:05:31 red clurgmgrd[19323]: <notice> Shutting down
Jun 15 17:05:31 red ccsd[16783]: Unable to write package back to sender: Broken pipe
Jun 15 17:05:31 red ccsd[16783]: Error while processing request: Operation not
permitted
Jun 15 17:05:36 red clurgmgrd[19323]: <info> Event (1:1:1) Processed

Comment 7 Lon Hohberger 2006-06-15 21:02:09 UTC

Created attachment 131001 [details]
Fixes infinite loop in ccsd which causes rgmanager (and other processes) to hang in read

Comment 8 Lon Hohberger 2006-06-15 21:13:35 UTC

Irrespective of the cause of the bad socket problem that the above patch works
around, this particular problem is not unique to rgmanager - anything using ccsd
can be affected.

Somewhere in process_connect, the file descriptor 'afd' is getting messed up,
causing write(2) to return -1/EPIPE (even though the other side of it has not
been closed).  The fd is still in /proc/<pid>/fd, and the descriptor is valid
(and in the descriptor table), and the message is valid.

If we include the above patch for U4 (which we should definitely do), this
bugzilla can be removed from the blocker list, but remain open - until we find
the root cause.

Comment 9 Lon Hohberger 2006-06-16 20:14:43 UTC

Fixes in CVS

Comment 12 Red Hat Bugzilla 2006-08-10 21:16:30 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0554.html