Description of problem: We wanted to stop 3node cluster. During it we received such message: Jan 17 11:33:47 node3 ccsd[2532]: Unable to write package back to sender: Broken pipe Jan 17 11:33:47 node3 ccsd[2532]: Error while processing request: Operation not permitted Version-Release number of selected component (if applicable): ccs-1.0.7-0.i686.rpm How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: as in log message Expected results: Explanation why it's happening and is there a way to avoid it? and if it's a problem for my cluster? Additional info:Comment #8 I've found Bug 194361: deadlock with 'service rgmanager stop' that has similiar log message. I'm not reopening that bug becouse it doesn't look like I have deadlock. /log/message: Jan 17 11:33:44 node1 rgmanager: [24802]: <notice> Shutting down Cluster Service Manager... Jan 17 11:33:44 node1 clurgmgrd[5221]: <notice> Shutting down Jan 17 11:33:44 node1 clurgmgrd[5221]: <notice> Shutdown complete, exiting Jan 17 11:33:44 node1 rgmanager: [24802]: <notice> Cluster Service Manager is stopped. Jan 17 11:33:45 node2 rgmanager: [1027]: <notice> Shutting down Cluster Service Manager... Jan 17 11:33:45 node2 clurgmgrd[4986]: <notice> Shutting down Jan 17 11:33:45 node2 clurgmgrd[4986]: <notice> Shutdown complete, exiting Jan 17 11:33:45 node2 rgmanager: [1027]: <notice> Cluster Service Manager is stopped. Jan 17 11:33:47 node3 rgmanager: [1847]: <notice> Shutting down Cluster Service Manager... Jan 17 11:33:47 node3 clurgmgrd[4823]: <notice> Shutting down Jan 17 11:33:47 node3 clurgmgrd[4823]: <notice> Shutdown complete, exiting Jan 17 11:33:47 node3 ccsd[2532]: Unable to write package back to sender: Broken pipe Jan 17 11:33:47 node3 ccsd[2532]: Error while processing request: Operation not permitted Jan 17 11:33:47 node3 rgmanager: [1847]: <notice> Cluster Service Manager is stopped. Jan 17 11:33:55 node1 gfs: Unmounting GFS filesystems: succeeded Jan 17 11:34:07 node1 fenced: shutdown succeeded Jan 17 11:34:08 node2 fenced: shutdown succeeded Jan 17 11:34:15 node2 kernel: CMAN: removing node node1 from the cluster : Removed Jan 17 11:34:15 node1 ccsd[2833]: Cluster manager shutdown. Attemping to reconnect... Jan 17 11:34:17 node3 kernel: CMAN: removing node node2 from the cluster : Removed Jan 17 11:34:17 node2 ccsd[2553]: Cluster manager shutdown. Attemping to reconnect... Jan 17 11:34:18 node3 ccsd[2532]: Cluster manager shutdown. Attemping to reconnect... Jan 17 11:34:19 node1 cman: shutdown succeeded Jan 17 11:34:20 node2 cman: shutdown succeeded Jan 17 11:34:21 node3 cman: shutdown succeeded Jan 17 11:34:28 node1 ccsd[2833]: Stopping ccsd, SIGTERM received. Jan 17 11:34:29 node1 ccsd: shutdown succeeded Jan 17 11:34:29 node2 ccsd[2553]: Stopping ccsd, SIGTERM received. Jan 17 11:34:30 node2 ccsd: shutdown succeeded Jan 17 11:34:31 node3 ccsd[2532]: Stopping ccsd, SIGTERM received. Jan 17 11:34:32 node3 ccsd: shutdown succeeded
Hi, Any ideas how to avoid this problems? (we would like to join this system into production, so answer to this bug becoming critical...) Thanks
lon, any ideas on this?
We'd need more information about where the socket originated, but the error looks benign. Are there any other symptoms are there besides a log message? I suppose there's a chance that we're exiting while a thread has a CCS descriptor open. In that case, it's most certainly not a problem - though it would be nice to clean up the message at some point.
I've found another appearance of this message (another cluster, just 2nodes): NODE1: Jan 17 17:10:38 node1 clurgmgrd[5779]: <notice> Stopping service sv-o Jan 17 17:10:38 node1 clurgmgrd: [5779]: <info> Executing /opt/o.init stop Jan 17 17:10:38 node1 clurgmgrd[5779]: <notice> stop on script "o.init" returned 19 (unspecified) Jan 17 17:10:38 node1 clurgmgrd: [5779]: <debug> 10.4.1.10 is not configured Jan 17 17:10:38 node1 clurgmgrd: [5779]: <info> /dev/sda1 is not mounted Jan 17 17:10:43 node1 clurgmgrd: [5779]: <info> /dev/sdb1 is not mounted Jan 17 17:10:48 node1 clurgmgrd: [5779]: <info> /dev/sdc1 is not mounted Jan 17 17:10:53 node1 clurgmgrd: [5779]: <info> /dev/sdd1 is not mounted Jan 17 17:10:58 node1 clurgmgrd[5779]: <crit> #12: RG sv-o failed to stop; intervention required Jan 17 17:10:58 node1 clurgmgrd[5779]: <notice> Service sv-o is failed Jan 17 17:11:33 node1 rgmanager: [24241]: <notice> Shutting down Cluster Service Manager... Jan 17 17:11:33 node1 clurgmgrd[5779]: <notice> Shutting down Jan 17 17:11:35 node1 clurgmgrd[5779]: <notice> Shutdown complete, exiting Jan 17 17:11:36 node1 rgmanager: [24241]: <notice> Cluster Service Manager is stopped. Jan 17 17:11:49 node1 fenced: shutdown succeeded Jan 17 17:11:54 node1 kernel: CMAN: we are leaving the cluster. Removed Jan 17 17:11:54 node1 kernel: WARNING: dlm_emergency_shutdown Jan 17 17:11:54 node1 kernel: WARNING: dlm_emergency_shutdown Jan 17 17:11:54 node1 ccsd[3037]: Cluster manager shutdown. Attemping to reconnect... Jan 17 17:11:57 node1 kernel: NET: Unregistered protocol family 30 Jan 17 17:11:57 node1 cman: shutdown succeeded Jan 17 17:11:59 node1 ccsd[3037]: Stopping ccsd, SIGTERM received. Jan 17 17:12:00 node1 ccsd: shutdown succeeded NODE2 Jan 17 17:11:33 node2 clurgmgrd[5532]: <info> Magma Event: Membership Change Jan 17 17:11:33 node2 clurgmgrd[5532]: <info> State change: node1 DOWN Jan 17 17:11:35 node2 rgmanager: [29875]: <notice> Shutting down Cluster Service Manager... Jan 17 17:11:35 node2 clurgmgrd[5532]: <notice> Shutting down Jan 17 17:11:35 node2 ccsd[3052]: Unable to write package back to sender: Broken pipe Jan 17 17:11:35 node2 ccsd[3052]: Error while processing request: Operation not permitted Jan 17 17:11:36 node2 clurgmgrd[5532]: <err> #34: Cannot get status for service sv-o Jan 17 17:11:36 node2 clurgmgrd[5532]: <notice> Shutdown complete, exiting Jan 17 17:11:36 node2 rgmanager: [29875]: <notice> Cluster Service Manager is stopped. Jan 17 17:11:50 node2 fenced: shutdown succeeded Jan 17 17:11:54 node2 kernel: CMAN: removing node node1 from the cluster : Removed Jan 17 17:11:55 node2 kernel: CMAN: we are leaving the cluster. Removed Jan 17 17:11:55 node2 ccsd[3052]: Cluster manager shutdown. Attemping to reconnect... Jan 17 17:11:55 node2 kernel: WARNING: dlm_emergency_shutdown Jan 17 17:11:55 node2 kernel: WARNING: dlm_emergency_shutdown Jan 17 17:11:58 node2 kernel: NET: Unregistered protocol family 30 Jan 17 17:11:58 node2 cman: shutdown succeeded Jan 17 17:12:00 node2 ccsd[3052]: Stopping ccsd, SIGTERM received. Jan 17 17:12:01 node2 ccsd: shutdown succeeded it happened during stopping both nodes, for some packages upgrade, by invoking service rgmanager stop service fenced stop service cman stop service ccsd stop error 19 due stopping sv-o service is caused by badly init script (while stopping not working service wasn't returning 0) cluster mentioned at comment #0 hadn't any services configured, GFS was mounted only on node1 --- System stopped , we were able to upgrade what we needed,we have rebooted those nodes few times, and could start cluster service again.
I've found that message on other cluster also. It looks like message appears when all nodes are rebooted in short time period one after another, or when all cluster services (rgmanager,cman,ccsd) are stopped on all nodes in almost the same time.
What appears to be happening is happening is that ccs receives a request from rgmanager, then rgmanager exits before ccs can send a response. This results in an EPIPE (broken pipe) error, since the other end of the socket has gone away. This isn't really an error, so the ccs code will not handle an EPIPE error gracefully.