Bug 617234

Summary: Starting or stopping corosync blocks cman from starting or stopping - cman part
Product: Red Hat Enterprise Linux 6 Reporter: Jan Friesse <jfriesse>
Component: clusterAssignee: Fabio Massimo Di Nitto <fdinitto>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: agk, ccaulfie, cluster-maint, edamato, fdinitto, jkortus, jwest, lhh, mkelly, nstraz, rpeterso, sdake, teigland, uwe.knop
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cluster-3.0.12-27.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 614104 Environment:
Last Closed: 2011-05-19 13:03:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 613870, 614104    
Bug Blocks:    
Attachments:
Description Flags
Proposed patch for first part. Meaningful error code is displayed
none
Proposed patch for test that corosync is not already running
none
Proposed patch for second problem none

Comment 1 Jan Friesse 2010-07-22 14:50:23 UTC
This bug is for cman part

Comment 3 RHEL Program Management 2010-07-22 15:18:21 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 4 Jan Friesse 2010-07-22 15:40:57 UTC
Created attachment 433736 [details]
Proposed patch for first part. Meaningful error code is displayed

Use patch with conjunction of https://bugzilla.redhat.com/show_bug.cgi?id=614104 patch.

Comment 5 Jan Friesse 2010-07-22 15:57:21 UTC
Second issue is much more serious and needs to be solved in cman code.

Even part of this code wasn't changed too much from RHEL 5, main problem is in do_cmd_try_shutdown, and caused by patch fc7201e51687b6f357aa6b4dad0f37de1f5d5272.

Corosync signal handler (SIGINT and SIGTERM) is replaced by cman one, and this will set quit_threads to 1.

My proposed solution is to ignore INT and TERM signals completely, but I'm not sure what this solution can cause in different parts.

Chrissie, any opinion there?

Comment 6 Steven Dake 2010-07-22 17:44:52 UTC
Honza,

Speaking with Fabio, what we need is similar flock code, but in LOCALSTATEDIR/run/corosync.pid file instead.  Inside the corosync.pid file should be the process id for the process so it may be killed by looking at that pid value.  The PID should be that of the child after the fork.

Sorry for not knowing the details earlier.

Comment 7 Jan Friesse 2010-07-26 12:14:08 UTC
(In reply to comment #6)
> Honza,
> 
> Speaking with Fabio, what we need is similar flock code, but in
> LOCALSTATEDIR/run/corosync.pid file instead.  Inside the corosync.pid file
> should be the process id for the process so it may be killed by looking at that
> pid value.  The PID should be that of the child after the fork.
> 
> Sorry for not knowing the details earlier.    

Steve,
are you sure that comment is in the right bug? This is cman part, not corosync.

Comment 8 Jan Friesse 2010-07-28 14:17:13 UTC
Created attachment 435029 [details]
Proposed patch for test that corosync is not already running

Patch fixes init file so now before cman start is tested, if corosync is running. If so, init script will refuse to start.

Comment 9 Jan Friesse 2010-09-27 13:41:03 UTC
Created attachment 449893 [details]
Proposed patch for second problem

cman: Handle INT and TERM signals correctly

Corosync signal handler (SIGINT and SIGTERM) is replaced by cman one,
and this was settting quit_threads to 1. Regular cman shutdown sequence
(cman_tool leave) tests if quit_threads is set or not. If so, it refuses
continue so it was not possible to cleanly leave cluster.

Now SIGINT and SIGTERM are ignored, and (un)intentional kill of corosync
is no longer problem.

(We talked about this solution with Chrissie month and something ago, so this is why clearing need info)

Comment 10 David Mair 2010-09-27 20:03:36 UTC
This problem sounds pretty ugly and likely to drive support calls.  I'm flagging this for 6.0.z so we can hopefully release an errata shortly after 6.0 ships especially when it appears we have patches for the issues reported here.

Comment 12 Fabio Massimo Di Nitto 2010-09-28 04:11:00 UTC
(In reply to comment #10)
> This problem sounds pretty ugly and likely to drive support calls.  I'm
> flagging this for 6.0.z so we can hopefully release an errata shortly after 6.0
> ships especially when it appears we have patches for the issues reported here.

Actually, it's not as bad as it looks, since the documentation clearly states how to setup cluster and it was decided not to push the fix for 6.0 right away.

Comment 13 Steven Dake 2010-09-28 16:08:02 UTC
David,

I agree with Fabio - we are covered on the docs very well in this case.  But for those that don't read the documentation..

I am always happy to fix problems GSS believes could be problematic regarding support.

Before we can mark this in the done column, we need feedback from Chrissie re comment #5.

Regards
-steve

Comment 14 Christine Caulfield 2010-09-29 07:36:47 UTC
I thought I'd discussed this on IRC some time ago. quit_threads in the cman code doesn't actually do anything in RHEL6 apart from get in the way. Removing it, and anything that sets it should have no impact on operations.

All shutdown checking should be done by corosync so the signal handlers in daemon.c can go too.

Comment 15 Jan Friesse 2010-09-29 10:29:25 UTC
As noted in comment #9, I removed need info flag because we was talking about problem with Chrissie before my vacation.

Anyway, all cman patches are currently in STABLE3 git tree as e88da89f1a5cdb8eb5e1924514401dfb91c0363c, c09852206f21ed04806211e49ca9423e10fea1f9 and de0a199f499bec83774ad88765c5e7df487913e9 so moving to post.

Comment 16 Fabio Massimo Di Nitto 2010-09-30 13:17:37 UTC
I am dropping 6.0.z flag after discussing with other engineers.

The problem is not as bad as it looks, it´s well documented and the dependency chain is not straight (requires rebuild of corosync and cman with several patches).

Comment 17 Jeremy West 2010-09-30 19:41:42 UTC
Fabio,

I'm ok with dropping 6.0.z if you can provide a link here in the BZ to the documentation that explains how to resolve this.  From a GSS perspective we need to be able to quickly point in the right direction, any customers calling in this.

--jwest

Comment 22 errata-xmlrpc 2011-05-19 13:03:50 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html