Bug 614104

Summary: Starting or stopping corosync blocks cman from starting or stopping - corosync part
Product: Red Hat Enterprise Linux 6 Reporter: Steven Dake <sdake>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: agk, cluster-maint, djansa, edamato, fdinitto, jkortus, mkelly, nstraz, sdake
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: corosync-1.2.3-23.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 613870
: 617234 (view as bug list) Environment:
Last Closed: 2011-05-19 14:24:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 613870    
Bug Blocks: 617234    
Attachments:
Description Flags
Proposed patch for first part of problem
none
Proposed patch for first part - take 2
none
Proposed patch for second problem none

Description Steven Dake 2010-07-13 17:04:00 UTC
+++ This bug was initially created as a clone of Bug #613870 +++

Description of problem:

Two variants on the same issue:

First:

If you start corosync manually and then try to start cman, it fails with the error "Starting cman... corosync died: Error, reason code is 1 [FAILED]". If you then stop corosync and again try to start cman, it starts properly.

Second:

Perhaps most seriously; If cman is already started and you try to restart or stop corosync, corosync will sit there endlessly "Waiting for corosync services to unload:...". Hitting ctrl+c to stop it *appears* to abort the corosync restart. However, anytime there after, trying to stop or restart cman will fail with "Stopping cman... Timed-out waiting for cluster [FAILED]". 

Running 'ps aux | grep corosync' shows "root 4262 0.4 1.9 440156 34728 ? SLsl 22:57 0:01 corosync -f". This process can only be killed with '-9'. Once dead though, cman will restart successfully.


Version-Release number of selected component (if applicable):

- cman-3.0.12-2.fc13.x86_64
- corosync-1.2.3-1.fc13.x86_64

How reproducible:

Appears to be 100%.

Steps to Reproduce:
1. Start corosync, then start cman
2. Start cman, stop|restart corosync, stop|restart cman
3.
  
Actual results:

- cman won't stop/start when corosync is running or restarted.
- corosync won't stop/restart when cman is running and then blocks cman from starting/stopping.

Expected results:

- cman should detect when cman is already running and provide more useful feedback, if not stop corosync itself.
- corosync should detect when cman is available and not start with an error telling the user to use cman instead.

Additional info:

I've got a disposable test cluster. I can run any tests the developers would like me to try.

--- Additional comment from sdake on 2010-07-13 13:03:31 EDT ---

Thanks for the bug report

The common POSIX solution (missing from current corosync) is to have corosync create a file in LOCALSTATEDIR/lock/corosync then use the flock(2) call ie:
fd = open (LOCALSTATEDIR"/lock/corosync)
retry_flock;
res = flock (fd, LOCK_EX|LOCK_NB);
if (res == -1) {
  switch (errno) {
     case EINTR:
           goto retry_flock
           break;
     case EWOULDBLOCK:
           print error that corosync is already active and exit
           break;
     default
           print error that flock couldn't be obtained and exit
           break;
  }
}

The flock is GCed on process exit by POSIX allowing a new start of corosync to grab the lock.

Comment 3 Jan Friesse 2010-07-22 14:50:42 UTC
This bug will be for corosync part

Comment 4 Jan Friesse 2010-07-22 15:45:41 UTC
Created attachment 433738 [details]
Proposed patch for first part of problem

Uses solution described by Steve

Comment 5 Jan Friesse 2010-07-28 14:10:34 UTC
Created attachment 435023 [details]
Proposed patch for first part - take 2

Better version of patch. It also includes change in initscript to NOT create pid file (corosync itself now does).

Comment 6 Jan Friesse 2010-07-28 14:12:08 UTC
Created attachment 435026 [details]
Proposed patch for second problem

This patch fixes second problem in initscript. If corosync was run by cman, initsript refuses to exit.

Comment 11 errata-xmlrpc 2011-05-19 14:24:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0764.html