Bug 617234 - Starting or stopping corosync blocks cman from starting or stopping - cman part
Starting or stopping corosync blocks cman from starting or stopping - cman part
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: Fabio Massimo Di Nitto
Cluster QE
:
Depends On: 613870 614104
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-22 10:49 EDT by Jan Friesse
Modified: 2016-04-26 11:43 EDT (History)
14 users (show)

See Also:
Fixed In Version: cluster-3.0.12-27.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 614104
Environment:
Last Closed: 2011-05-19 09:03:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Proposed patch for first part. Meaningful error code is displayed (674 bytes, patch)
2010-07-22 11:40 EDT, Jan Friesse
no flags Details | Diff
Proposed patch for test that corosync is not already running (663 bytes, patch)
2010-07-28 10:17 EDT, Jan Friesse
no flags Details | Diff
Proposed patch for second problem (1.23 KB, patch)
2010-09-27 09:41 EDT, Jan Friesse
no flags Details | Diff

  None (edit)
Comment 1 Jan Friesse 2010-07-22 10:50:23 EDT
This bug is for cman part
Comment 3 RHEL Product and Program Management 2010-07-22 11:18:21 EDT
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Comment 4 Jan Friesse 2010-07-22 11:40:57 EDT
Created attachment 433736 [details]
Proposed patch for first part. Meaningful error code is displayed

Use patch with conjunction of https://bugzilla.redhat.com/show_bug.cgi?id=614104 patch.
Comment 5 Jan Friesse 2010-07-22 11:57:21 EDT
Second issue is much more serious and needs to be solved in cman code.

Even part of this code wasn't changed too much from RHEL 5, main problem is in do_cmd_try_shutdown, and caused by patch fc7201e51687b6f357aa6b4dad0f37de1f5d5272.

Corosync signal handler (SIGINT and SIGTERM) is replaced by cman one, and this will set quit_threads to 1.

My proposed solution is to ignore INT and TERM signals completely, but I'm not sure what this solution can cause in different parts.

Chrissie, any opinion there?
Comment 6 Steven Dake 2010-07-22 13:44:52 EDT
Honza,

Speaking with Fabio, what we need is similar flock code, but in LOCALSTATEDIR/run/corosync.pid file instead.  Inside the corosync.pid file should be the process id for the process so it may be killed by looking at that pid value.  The PID should be that of the child after the fork.

Sorry for not knowing the details earlier.
Comment 7 Jan Friesse 2010-07-26 08:14:08 EDT
(In reply to comment #6)
> Honza,
> 
> Speaking with Fabio, what we need is similar flock code, but in
> LOCALSTATEDIR/run/corosync.pid file instead.  Inside the corosync.pid file
> should be the process id for the process so it may be killed by looking at that
> pid value.  The PID should be that of the child after the fork.
> 
> Sorry for not knowing the details earlier.    

Steve,
are you sure that comment is in the right bug? This is cman part, not corosync.
Comment 8 Jan Friesse 2010-07-28 10:17:13 EDT
Created attachment 435029 [details]
Proposed patch for test that corosync is not already running

Patch fixes init file so now before cman start is tested, if corosync is running. If so, init script will refuse to start.
Comment 9 Jan Friesse 2010-09-27 09:41:03 EDT
Created attachment 449893 [details]
Proposed patch for second problem

cman: Handle INT and TERM signals correctly

Corosync signal handler (SIGINT and SIGTERM) is replaced by cman one,
and this was settting quit_threads to 1. Regular cman shutdown sequence
(cman_tool leave) tests if quit_threads is set or not. If so, it refuses
continue so it was not possible to cleanly leave cluster.

Now SIGINT and SIGTERM are ignored, and (un)intentional kill of corosync
is no longer problem.

(We talked about this solution with Chrissie month and something ago, so this is why clearing need info)
Comment 10 David Mair 2010-09-27 16:03:36 EDT
This problem sounds pretty ugly and likely to drive support calls.  I'm flagging this for 6.0.z so we can hopefully release an errata shortly after 6.0 ships especially when it appears we have patches for the issues reported here.
Comment 12 Fabio Massimo Di Nitto 2010-09-28 00:11:00 EDT
(In reply to comment #10)
> This problem sounds pretty ugly and likely to drive support calls.  I'm
> flagging this for 6.0.z so we can hopefully release an errata shortly after 6.0
> ships especially when it appears we have patches for the issues reported here.

Actually, it's not as bad as it looks, since the documentation clearly states how to setup cluster and it was decided not to push the fix for 6.0 right away.
Comment 13 Steven Dake 2010-09-28 12:08:02 EDT
David,

I agree with Fabio - we are covered on the docs very well in this case.  But for those that don't read the documentation..

I am always happy to fix problems GSS believes could be problematic regarding support.

Before we can mark this in the done column, we need feedback from Chrissie re comment #5.

Regards
-steve
Comment 14 Christine Caulfield 2010-09-29 03:36:47 EDT
I thought I'd discussed this on IRC some time ago. quit_threads in the cman code doesn't actually do anything in RHEL6 apart from get in the way. Removing it, and anything that sets it should have no impact on operations.

All shutdown checking should be done by corosync so the signal handlers in daemon.c can go too.
Comment 15 Jan Friesse 2010-09-29 06:29:25 EDT
As noted in comment #9, I removed need info flag because we was talking about problem with Chrissie before my vacation.

Anyway, all cman patches are currently in STABLE3 git tree as e88da89f1a5cdb8eb5e1924514401dfb91c0363c, c09852206f21ed04806211e49ca9423e10fea1f9 and de0a199f499bec83774ad88765c5e7df487913e9 so moving to post.
Comment 16 Fabio Massimo Di Nitto 2010-09-30 09:17:37 EDT
I am dropping 6.0.z flag after discussing with other engineers.

The problem is not as bad as it looks, it´s well documented and the dependency chain is not straight (requires rebuild of corosync and cman with several patches).
Comment 17 Jeremy West 2010-09-30 15:41:42 EDT
Fabio,

I'm ok with dropping 6.0.z if you can provide a link here in the BZ to the documentation that explains how to resolve this.  From a GSS perspective we need to be able to quickly point in the right direction, any customers calling in this.

--jwest
Comment 22 errata-xmlrpc 2011-05-19 09:03:50 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html

Note You need to log in before you can comment on or make changes to this bug.