Bug 614104 - Starting or stopping corosync blocks cman from starting or stopping - corosync part
Summary: Starting or stopping corosync blocks cman from starting or stopping - corosyn...
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync
Version: 6.0
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
Keywords:
Depends On: 613870
Blocks: 617234
TreeView+ depends on / blocked
 
Reported: 2010-07-13 17:04 UTC by Steven Dake
Modified: 2016-04-26 13:39 UTC (History)
9 users (show)

(edit)
Clone Of: 613870
: 617234 (view as bug list)
(edit)
Last Closed: 2011-05-19 14:24:03 UTC


Attachments (Terms of Use)
Proposed patch for first part of problem (2.62 KB, patch)
2010-07-22 15:45 UTC, Jan Friesse
no flags Details | Diff
Proposed patch for first part - take 2 (5.42 KB, patch)
2010-07-28 14:10 UTC, Jan Friesse
no flags Details | Diff
Proposed patch for second problem (978 bytes, patch)
2010-07-28 14:12 UTC, Jan Friesse
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0764 normal SHIPPED_LIVE corosync bug fix update 2011-05-18 18:08:44 UTC

Description Steven Dake 2010-07-13 17:04:00 UTC
+++ This bug was initially created as a clone of Bug #613870 +++

Description of problem:

Two variants on the same issue:

First:

If you start corosync manually and then try to start cman, it fails with the error "Starting cman... corosync died: Error, reason code is 1 [FAILED]". If you then stop corosync and again try to start cman, it starts properly.

Second:

Perhaps most seriously; If cman is already started and you try to restart or stop corosync, corosync will sit there endlessly "Waiting for corosync services to unload:...". Hitting ctrl+c to stop it *appears* to abort the corosync restart. However, anytime there after, trying to stop or restart cman will fail with "Stopping cman... Timed-out waiting for cluster [FAILED]". 

Running 'ps aux | grep corosync' shows "root 4262 0.4 1.9 440156 34728 ? SLsl 22:57 0:01 corosync -f". This process can only be killed with '-9'. Once dead though, cman will restart successfully.


Version-Release number of selected component (if applicable):

- cman-3.0.12-2.fc13.x86_64
- corosync-1.2.3-1.fc13.x86_64

How reproducible:

Appears to be 100%.

Steps to Reproduce:
1. Start corosync, then start cman
2. Start cman, stop|restart corosync, stop|restart cman
3.
  
Actual results:

- cman won't stop/start when corosync is running or restarted.
- corosync won't stop/restart when cman is running and then blocks cman from starting/stopping.

Expected results:

- cman should detect when cman is already running and provide more useful feedback, if not stop corosync itself.
- corosync should detect when cman is available and not start with an error telling the user to use cman instead.

Additional info:

I've got a disposable test cluster. I can run any tests the developers would like me to try.

--- Additional comment from sdake@redhat.com on 2010-07-13 13:03:31 EDT ---

Thanks for the bug report

The common POSIX solution (missing from current corosync) is to have corosync create a file in LOCALSTATEDIR/lock/corosync then use the flock(2) call ie:
fd = open (LOCALSTATEDIR"/lock/corosync)
retry_flock;
res = flock (fd, LOCK_EX|LOCK_NB);
if (res == -1) {
  switch (errno) {
     case EINTR:
           goto retry_flock
           break;
     case EWOULDBLOCK:
           print error that corosync is already active and exit
           break;
     default
           print error that flock couldn't be obtained and exit
           break;
  }
}

The flock is GCed on process exit by POSIX allowing a new start of corosync to grab the lock.

Comment 3 Jan Friesse 2010-07-22 14:50:42 UTC
This bug will be for corosync part

Comment 4 Jan Friesse 2010-07-22 15:45:41 UTC
Created attachment 433738 [details]
Proposed patch for first part of problem

Uses solution described by Steve

Comment 5 Jan Friesse 2010-07-28 14:10:34 UTC
Created attachment 435023 [details]
Proposed patch for first part - take 2

Better version of patch. It also includes change in initscript to NOT create pid file (corosync itself now does).

Comment 6 Jan Friesse 2010-07-28 14:12:08 UTC
Created attachment 435026 [details]
Proposed patch for second problem

This patch fixes second problem in initscript. If corosync was run by cman, initsript refuses to exit.

Comment 11 errata-xmlrpc 2011-05-19 14:24:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0764.html


Note You need to log in before you can comment on or make changes to this bug.