613870 – Starting or stopping corosync blocks cman from starting or stopping

Bug 613870 - Starting or stopping corosync blocks cman from starting or stopping

Summary: Starting or stopping corosync blocks cman from starting or stopping

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	corosync
Sub Component:
Version:	13
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Jan Friesse
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	614104 617234
TreeView+	depends on / blocked

Reported:	2010-07-13 03:12 UTC by Madison Kelly
Modified:	2010-09-29 10:27 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Clones:	614104 (view as bug list)
Environment:
Last Closed:	2010-09-29 10:27:43 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Proposed patch for first part of problem (2.62 KB, patch) 2010-07-22 15:44 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch for first part - take 2 (5.42 KB, patch) 2010-07-28 14:12 UTC, Jan Friesse	no flags	Details \| Diff
Proposed patch for second problem (978 bytes, patch) 2010-07-28 14:13 UTC, Jan Friesse	no flags	Details \| Diff
Cman: Handle corosync exit codes (1.79 KB, patch) 2010-09-29 10:06 UTC, Jan Friesse	no flags	Details \| Diff
Cman: Handle "another instance running" error code (856 bytes, patch) 2010-09-29 10:06 UTC, Jan Friesse	no flags	Details \| Diff
Cman: test that corosync is not already running (1.09 KB, patch) 2010-09-29 10:07 UTC, Jan Friesse	no flags	Details \| Diff
Cman: Handle INT and TERM signals correctly (1.34 KB, patch) 2010-09-29 10:08 UTC, Jan Friesse	no flags	Details \| Diff
Show Obsolete (1) View All

Description Madison Kelly 2010-07-13 03:12:19 UTC

Description of problem:

Two variants on the same issue:

First:

If you start corosync manually and then try to start cman, it fails with the error "Starting cman... corosync died: Error, reason code is 1 [FAILED]". If you then stop corosync and again try to start cman, it starts properly.

Second:

Perhaps most seriously; If cman is already started and you try to restart or stop corosync, corosync will sit there endlessly "Waiting for corosync services to unload:...". Hitting ctrl+c to stop it *appears* to abort the corosync restart. However, anytime there after, trying to stop or restart cman will fail with "Stopping cman... Timed-out waiting for cluster [FAILED]".

Running 'ps aux | grep corosync' shows "root 4262 0.4 1.9 440156 34728 ? SLsl 22:57 0:01 corosync -f". This process can only be killed with '-9'. Once dead though, cman will restart successfully.

Version-Release number of selected component (if applicable):

- cman-3.0.12-2.fc13.x86_64
- corosync-1.2.3-1.fc13.x86_64

How reproducible:

Appears to be 100%.

Steps to Reproduce:
1. Start corosync, then start cman
2. Start cman, stop|restart corosync, stop|restart cman
3.

Actual results:

- cman won't stop/start when corosync is running or restarted.
- corosync won't stop/restart when cman is running and then blocks cman from starting/stopping.

Expected results:

- cman should detect when cman is already running and provide more useful feedback, if not stop corosync itself.
- corosync should detect when cman is available and not start with an error telling the user to use cman instead.

Additional info:

I've got a disposable test cluster. I can run any tests the developers would like me to try.

Comment 1 Steven Dake 2010-07-13 17:03:31 UTC

Thanks for the bug report

The common POSIX solution (missing from current corosync) is to have corosync create a file in LOCALSTATEDIR/lock/corosync then use the flock(2) call ie:
fd = open (LOCALSTATEDIR"/lock/corosync)
retry_flock;
res = flock (fd, LOCK_EX|LOCK_NB);
if (res == -1) {
  switch (errno) {
     case EINTR:
           goto retry_flock
           break;
     case EWOULDBLOCK:
           print error that corosync is already active and exit
           break;
     default
           print error that flock couldn't be obtained and exit
           break;
  }
}

The flock is GCed on process exit by POSIX allowing a new start of corosync to grab the lock.

Comment 2 Madison Kelly 2010-07-22 15:01:14 UTC

Thanks Steven,

  Will this be added in the next release? If so, I guess this ticket can be closed?

Comment 3 Jan Friesse 2010-07-22 15:44:38 UTC

Created attachment 433737 [details]
Proposed patch for first part of problem

Uses solution described by Steve

Comment 4 Jan Friesse 2010-07-28 14:12:52 UTC

Created attachment 435027 [details]
Proposed patch for first part - take 2

Better version of patch. It also includes change in initscript to NOT create
pid file (corosync itself now does).

Comment 5 Jan Friesse 2010-07-28 14:13:31 UTC

Created attachment 435028 [details]
Proposed patch for second problem

This patch fixes second problem in initscript. If corosync was run by cman,
initsript refuses to kill corosync and exit itself.

Comment 6 Jan Friesse 2010-09-29 10:06:30 UTC

Created attachment 450432 [details]
Cman: Handle corosync exit codes

Comment 7 Jan Friesse 2010-09-29 10:06:55 UTC

Created attachment 450433 [details]
Cman: Handle "another instance running" error code

Comment 8 Jan Friesse 2010-09-29 10:07:43 UTC

Created attachment 450434 [details]
Cman: test that corosync is not already running

Patch fixes init file so now before cman start is tested, if corosync is
running. If so, init script will refuse to start.

Comment 9 Jan Friesse 2010-09-29 10:08:17 UTC

Created attachment 450435 [details]
Cman: Handle INT and TERM signals correctly

Corosync signal handler (SIGINT and SIGTERM) is replaced by cman one,
and this was settting quit_threads to 1. Regular cman shutdown sequence
(cman_tool leave) tests if quit_threads is set or not. If so, it refuses
continue so it was not possible to cleanly leave cluster.

Now SIGINT and SIGTERM are ignored, and (un)intentional kill of corosync
is no longer problem.

Comment 10 Jan Friesse 2010-09-29 10:27:43 UTC

All patches together should give complete solution without any possibility to replicate bugs.

Also patches are currently included in cluster STABLE3 tree and/or corosync trunk so closing as upstream.

Note You need to log in before you can comment on or make changes to this bug.