Bug 614104 - Starting or stopping corosync blocks cman from starting or stopping - corosync part
Starting or stopping corosync blocks cman from starting or stopping - corosyn...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: Jan Friesse
Cluster QE
:
Depends On: 613870
Blocks: 617234
  Show dependency treegraph
 
Reported: 2010-07-13 13:04 EDT by Steven Dake
Modified: 2016-04-26 09:39 EDT (History)
9 users (show)

See Also:
Fixed In Version: corosync-1.2.3-23.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 613870
: 617234 (view as bug list)
Environment:
Last Closed: 2011-05-19 10:24:03 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Proposed patch for first part of problem (2.62 KB, patch)
2010-07-22 11:45 EDT, Jan Friesse
no flags Details | Diff
Proposed patch for first part - take 2 (5.42 KB, patch)
2010-07-28 10:10 EDT, Jan Friesse
no flags Details | Diff
Proposed patch for second problem (978 bytes, patch)
2010-07-28 10:12 EDT, Jan Friesse
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0764 normal SHIPPED_LIVE corosync bug fix update 2011-05-18 14:08:44 EDT

  None (edit)
Description Steven Dake 2010-07-13 13:04:00 EDT
+++ This bug was initially created as a clone of Bug #613870 +++

Description of problem:

Two variants on the same issue:

First:

If you start corosync manually and then try to start cman, it fails with the error "Starting cman... corosync died: Error, reason code is 1 [FAILED]". If you then stop corosync and again try to start cman, it starts properly.

Second:

Perhaps most seriously; If cman is already started and you try to restart or stop corosync, corosync will sit there endlessly "Waiting for corosync services to unload:...". Hitting ctrl+c to stop it *appears* to abort the corosync restart. However, anytime there after, trying to stop or restart cman will fail with "Stopping cman... Timed-out waiting for cluster [FAILED]". 

Running 'ps aux | grep corosync' shows "root 4262 0.4 1.9 440156 34728 ? SLsl 22:57 0:01 corosync -f". This process can only be killed with '-9'. Once dead though, cman will restart successfully.


Version-Release number of selected component (if applicable):

- cman-3.0.12-2.fc13.x86_64
- corosync-1.2.3-1.fc13.x86_64

How reproducible:

Appears to be 100%.

Steps to Reproduce:
1. Start corosync, then start cman
2. Start cman, stop|restart corosync, stop|restart cman
3.
  
Actual results:

- cman won't stop/start when corosync is running or restarted.
- corosync won't stop/restart when cman is running and then blocks cman from starting/stopping.

Expected results:

- cman should detect when cman is already running and provide more useful feedback, if not stop corosync itself.
- corosync should detect when cman is available and not start with an error telling the user to use cman instead.

Additional info:

I've got a disposable test cluster. I can run any tests the developers would like me to try.

--- Additional comment from sdake@redhat.com on 2010-07-13 13:03:31 EDT ---

Thanks for the bug report

The common POSIX solution (missing from current corosync) is to have corosync create a file in LOCALSTATEDIR/lock/corosync then use the flock(2) call ie:
fd = open (LOCALSTATEDIR"/lock/corosync)
retry_flock;
res = flock (fd, LOCK_EX|LOCK_NB);
if (res == -1) {
  switch (errno) {
     case EINTR:
           goto retry_flock
           break;
     case EWOULDBLOCK:
           print error that corosync is already active and exit
           break;
     default
           print error that flock couldn't be obtained and exit
           break;
  }
}

The flock is GCed on process exit by POSIX allowing a new start of corosync to grab the lock.
Comment 3 Jan Friesse 2010-07-22 10:50:42 EDT
This bug will be for corosync part
Comment 4 Jan Friesse 2010-07-22 11:45:41 EDT
Created attachment 433738 [details]
Proposed patch for first part of problem

Uses solution described by Steve
Comment 5 Jan Friesse 2010-07-28 10:10:34 EDT
Created attachment 435023 [details]
Proposed patch for first part - take 2

Better version of patch. It also includes change in initscript to NOT create pid file (corosync itself now does).
Comment 6 Jan Friesse 2010-07-28 10:12:08 EDT
Created attachment 435026 [details]
Proposed patch for second problem

This patch fixes second problem in initscript. If corosync was run by cman, initsript refuses to exit.
Comment 11 errata-xmlrpc 2011-05-19 10:24:03 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0764.html

Note You need to log in before you can comment on or make changes to this bug.