Bug 194361
Summary: | deadlock with 'service rgmanager stop' | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | ccs | Assignee: | Lon Hohberger <lhh> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4 | CC: | cluster-maint, lhh | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RHBA-2006-0554 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2006-08-10 21:16:28 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Corey Marthaler
2006-06-07 15:00:05 UTC
This is reproducable and looks to be an rgmanager issue. All I had to do was get a cman cluster up, start rgmanger on all in cluster, and then stop rgmanager one node at a time. [root@link-02 ~]# service rgmanager stop Shutting down Cluster Service Manager... Services are stopped. [HANG] Jun 7 09:44:55 link-02 clurgmgrd[4223]: <info> Magma Event: Membership Change Jun 7 09:44:55 link-02 clurgmgrd[4223]: <info> State change: link-08 DOWN Jun 7 09:44:55 link-02 clurgmgrd[4223]: <info> Event (0:1:0) Processed Jun 7 09:45:09 link-02 rgmanager: [4331]: <notice> Shutting down Cluster Service Manager... Jun 7 09:45:09 link-02 clurgmgrd[4223]: <notice> Shutting down Jun 7 09:45:09 link-02 ccsd[4017]: Unable to write package back to sender: Broken pipe Jun 7 09:45:09 link-02 last message repeated 136 times Jun 7 09:45:09 link-02 clurgmgrd[4223]: <notice> Shutdown complete, exiting Jun 7 09:45:09 link-02 ccsd[4017]: Unable to write package back to sender: Broken pipe Jun 7 09:45:39 link-02 last message repeated 1417523 times Here is what the init script did: [root@link-01 ~]# pidof clurgmgrd 4084 4083 [root@link-01 ~]# kill -TERM 4084 4083 Jun 7 10:13:00 link-01 clurgmgrd[4084]: <notice> Shutting down Jun 7 10:13:00 link-01 ccsd[3966]: Unable to write package back to sender: Broken pipe Jun 7 10:13:00 link-01 last message repeated 10 times Jun 7 10:13:00 link-01 clurgmgrd[4084]: <notice> Shutdown complete, exiting Jun 7 10:13:00 link-01 ccsd[3966]: Unable to write package back to sender: Broken pipe Ok, there's at least two problems here: (a) Some sort of hang during shutdown, and (b) infinite retry in a send loop on a dead connection in ccsd. Clustat normally hangs during shutdown, because rgmanager isn't accepting requests at that time. I suppose I could make rgmanager send a "Sorry, shutting down" message, but (b) needs to be solved irrespective of what I do. Clustat usually recovers after rgmanager exits, FWIW. Yeah, clustat hanging wasn't the problem, only a sympton of the service shutdown hanging. I assumed that if the service shutdown properly, that a clustat would also not hang. :) As mentioned in comment #2, it was the kill of the clurgmgrd processes, which the stop init script does, that was hangning and thus causing the other probems with clustat and such. I have a fix which solves the hang, but I don't know what caused it in the first place. Here's what the log messages look like w/ the patch: Jun 15 17:05:31 red clurgmgrd[19323]: <notice> Shutting down Jun 15 17:05:31 red ccsd[16783]: Unable to write package back to sender: Broken pipe Jun 15 17:05:31 red ccsd[16783]: Error while processing request: Operation not permitted Jun 15 17:05:36 red clurgmgrd[19323]: <info> Event (1:1:1) Processed Created attachment 131001 [details]
Fixes infinite loop in ccsd which causes rgmanager (and other processes) to hang in read
Irrespective of the cause of the bad socket problem that the above patch works around, this particular problem is not unique to rgmanager - anything using ccsd can be affected. Somewhere in process_connect, the file descriptor 'afd' is getting messed up, causing write(2) to return -1/EPIPE (even though the other side of it has not been closed). The fd is still in /proc/<pid>/fd, and the descriptor is valid (and in the descriptor table), and the message is valid. If we include the above patch for U4 (which we should definitely do), this bugzilla can be removed from the blocker list, but remain open - until we find the root cause. Fixes in CVS An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0554.html |