This bug is for cman part
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.
** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Created attachment 433736 [details]
Proposed patch for first part. Meaningful error code is displayed
Use patch with conjunction of https://bugzilla.redhat.com/show_bug.cgi?id=614104 patch.
Second issue is much more serious and needs to be solved in cman code.
Even part of this code wasn't changed too much from RHEL 5, main problem is in do_cmd_try_shutdown, and caused by patch fc7201e51687b6f357aa6b4dad0f37de1f5d5272.
Corosync signal handler (SIGINT and SIGTERM) is replaced by cman one, and this will set quit_threads to 1.
My proposed solution is to ignore INT and TERM signals completely, but I'm not sure what this solution can cause in different parts.
Chrissie, any opinion there?
Speaking with Fabio, what we need is similar flock code, but in LOCALSTATEDIR/run/corosync.pid file instead. Inside the corosync.pid file should be the process id for the process so it may be killed by looking at that pid value. The PID should be that of the child after the fork.
Sorry for not knowing the details earlier.
(In reply to comment #6)
> Speaking with Fabio, what we need is similar flock code, but in
> LOCALSTATEDIR/run/corosync.pid file instead. Inside the corosync.pid file
> should be the process id for the process so it may be killed by looking at that
> pid value. The PID should be that of the child after the fork.
> Sorry for not knowing the details earlier.
are you sure that comment is in the right bug? This is cman part, not corosync.
Created attachment 435029 [details]
Proposed patch for test that corosync is not already running
Patch fixes init file so now before cman start is tested, if corosync is running. If so, init script will refuse to start.
Created attachment 449893 [details]
Proposed patch for second problem
cman: Handle INT and TERM signals correctly
Corosync signal handler (SIGINT and SIGTERM) is replaced by cman one,
and this was settting quit_threads to 1. Regular cman shutdown sequence
(cman_tool leave) tests if quit_threads is set or not. If so, it refuses
continue so it was not possible to cleanly leave cluster.
Now SIGINT and SIGTERM are ignored, and (un)intentional kill of corosync
is no longer problem.
(We talked about this solution with Chrissie month and something ago, so this is why clearing need info)
This problem sounds pretty ugly and likely to drive support calls. I'm flagging this for 6.0.z so we can hopefully release an errata shortly after 6.0 ships especially when it appears we have patches for the issues reported here.
(In reply to comment #10)
> This problem sounds pretty ugly and likely to drive support calls. I'm
> flagging this for 6.0.z so we can hopefully release an errata shortly after 6.0
> ships especially when it appears we have patches for the issues reported here.
Actually, it's not as bad as it looks, since the documentation clearly states how to setup cluster and it was decided not to push the fix for 6.0 right away.
I agree with Fabio - we are covered on the docs very well in this case. But for those that don't read the documentation..
I am always happy to fix problems GSS believes could be problematic regarding support.
Before we can mark this in the done column, we need feedback from Chrissie re comment #5.
I thought I'd discussed this on IRC some time ago. quit_threads in the cman code doesn't actually do anything in RHEL6 apart from get in the way. Removing it, and anything that sets it should have no impact on operations.
All shutdown checking should be done by corosync so the signal handlers in daemon.c can go too.
As noted in comment #9, I removed need info flag because we was talking about problem with Chrissie before my vacation.
Anyway, all cman patches are currently in STABLE3 git tree as e88da89f1a5cdb8eb5e1924514401dfb91c0363c, c09852206f21ed04806211e49ca9423e10fea1f9 and de0a199f499bec83774ad88765c5e7df487913e9 so moving to post.
I am dropping 6.0.z flag after discussing with other engineers.
The problem is not as bad as it looks, it´s well documented and the dependency chain is not straight (requires rebuild of corosync and cman with several patches).
I'm ok with dropping 6.0.z if you can provide a link here in the BZ to the documentation that explains how to resolve this. From a GSS perspective we need to be able to quickly point in the right direction, any customers calling in this.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.