Bug 499826 - master termination not stopping HA daemon acquisition
Summary: master termination not stopping HA daemon acquisition
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.1.1
Hardware: All
OS: Linux
low
medium
Target Milestone: 1.2
: ---
Assignee: Matthew Farrellee
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On:
Blocks: 527551
TreeView+ depends on / blocked
 
Reported: 2009-05-08 13:16 UTC by Matthew Farrellee
Modified: 2018-10-20 03:53 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Grid bug fix C: The master is in charge of starting HA daemons. It uses a shared lock file between multiple masters to determine which can run the HA daemon. When a master receives a signal to terminate it does not stop the process of trying to acquire an HA daemon lock and start the HA daemon. C: a master may receive a termination signal (TERM, QUIT), exit most daemons below it, but then acquire a HA lock and start an HA daemon. This prevents the master from successfully exiting, and clock reception of the termination signal for subsequent shutdown attempts. F: Corrected problem with master termination not stopping HA daemon acquisition R: The code in condor_master now prevents daemons waiting on locks from acquiring them while the condor_master is trying to shut down, for example from a TERM or QUIT signal, or from condor_off A master was able to receive a termination signal and exit most of the daemons below it, but then acquire a High Availability (HA) lock and start an HA daemon. This prevents the master from successfully exiting. The master now terminates the HA daemon acquisition successfully, and prevents daemons waiting on locks from acquiring them while the condor_master is trying to shut down.
Clone Of:
Environment:
Last Closed: 2009-12-03 09:19:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:1633 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid Version 1.2 2009-12-03 09:15:33 UTC

Description Matthew Farrellee 2009-05-08 13:16:55 UTC
Description of problem:

The master is in charge of starting HA daemons. It uses a shared lock file between multiple masters to determine which can run the HA daemon. When a master receives a signal to terminate it does not stop the process of trying to acquire an HA daemon lock and start the HA daemon. The result is a master may receive a termination signal (TERM, QUIT), exit most daemons below it, but then acquire a HA lock and start an HA daemon. This prevents the master from successfully exiting, and clock reception of the termination signal for subsequent shutdown attempts.

A workaround for this is to terminate the master with a QUIT followed a bit later by a TERM.


Version-Release number of selected component (if applicable):

All those prior to 7.3.1-0.5


How reproducible:

Non-deterministic


Steps to Reproduce:
1.Setup HA Scheduler
2.Change HA_LOCK_HOLD_TIME to 30 and HA_POLL_PERIOD to 3 (this increases the likelihood of the bug
3.Run two condor_master's
4.Observe the schedd is only started by one
5.kill -QUIT <master with schedd> && sleep 1 && kill -QUIT <master without schedd>

Notes: sleep 1 may or may not be necessary. This needs to be verified for QUIT, TERM, and condor_off -master.  

Actual results:

If the failure occurs you will see the master without the schedd actually start the schedd before exiting, e.g.

Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 20809
Got SIGQUIT.  Performing fast shutdown.
Sent SIGQUIT to STARTD (pid 20809)
Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 20890
The STARTD (pid 20809) exited with status 0

The master will then stay around ignoring future QUIT signals and managing the Schedd. A TERM will exit both.


Expected results:

The schedd is never started.

After the fix, with D_FULLDEBUG:

SCHEDD: Got HA lock (poll); starting
...
::RealStart; SCHEDD stop_state=1, ignoring

Comment 3 Irina Boverman 2009-10-22 19:16:20 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Corrected problem with master termination not stopping HA daemon acquisition (499826)

Comment 4 Martin Kudlej 2009-11-02 14:16:38 UTC
I've tested it on RHEL 5.4/4.8 i386/x86_64 with condor-7.2.2-0.9 and it doesn't work as it should.
I'm waiting for fix BZ528544.

Comment 5 Matthew Farrellee 2009-11-02 15:00:48 UTC
You can workaround BZ528544 by setting QMF_DELETE_ON_SHUTDOWN=FALSE in your config.

Comment 6 Martin Kudlej 2009-11-03 13:17:22 UTC
I've tested it on RHEL 5.4/4.8 i386/x86_64 with condor-7.4.1-0.2 with workaround from comment number 5 and it works as it excepted. -->VERIFIED

Comment 7 Lana Brindley 2009-11-26 20:43:36 UTC
So has QMF_DELETE_ON_SHUTDOWN been set to FALSE as default? If not, what was the fix?

LKB

Comment 8 Lana Brindley 2009-11-26 20:43:36 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,12 @@
+Grid bug fix
+
+C: The master is in charge of starting HA daemons. It uses a shared lock file between multiple masters to determine which can run the HA daemon. When a master receives a signal to terminate it does not stop the process of trying to acquire an HA daemon lock and start the HA daemon.
+C: a master may receive a termination signal (TERM, QUIT), exit most daemons below it, but then
+acquire a HA lock and start an HA daemon. This prevents the master from successfully exiting, and clock reception of the termination signal for
+subsequent shutdown attempts.
+F:
+R:
+
+NEED FURTHER INFO FOR RELNOTE
+
 Corrected problem with master termination not stopping HA daemon acquisition (499826)

Comment 9 Matthew Farrellee 2009-11-30 04:28:05 UTC
(In reply to comment #7)
> So has QMF_DELETE_ON_SHUTDOWN been set to FALSE as default? If not, what was
> the fix?
> 
> LKB  

This is not related to that bug in any way.

Comment 10 Matthew Farrellee 2009-11-30 04:30:29 UTC
The code in the condor_master now prevents daemons waiting on locks from acquiring them when the condor_master is trying to shutdown, say by a TERM or QUIT signal or by condor_off.

Comment 11 Pete MacKinnon 2009-12-01 15:29:46 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -9,4 +9,9 @@
 
 NEED FURTHER INFO FOR RELNOTE
 
-Corrected problem with master termination not stopping HA daemon acquisition (499826)+Corrected problem with master termination not stopping HA daemon acquisition (499826)
+
+RELEASE NOTE:
+"The code in condor_master now prevents daemons waiting on locks from
+acquiring them while the condor_master is trying to shut down, foe example by a TERM or
+QUIT signal, or by condor_off."

Comment 12 Pete MacKinnon 2009-12-01 15:39:22 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -13,5 +13,5 @@
 
 RELEASE NOTE:
 "The code in condor_master now prevents daemons waiting on locks from
-acquiring them while the condor_master is trying to shut down, foe example by a TERM or
+acquiring them while the condor_master is trying to shut down, for example from a TERM or
-QUIT signal, or by condor_off."+QUIT signal, or from condor_off."

Comment 13 Lana Brindley 2009-12-01 23:19:25 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -4,14 +4,10 @@
 C: a master may receive a termination signal (TERM, QUIT), exit most daemons below it, but then
 acquire a HA lock and start an HA daemon. This prevents the master from successfully exiting, and clock reception of the termination signal for
 subsequent shutdown attempts.
-F:
-R:
-
-NEED FURTHER INFO FOR RELNOTE
-
-Corrected problem with master termination not stopping HA daemon acquisition (499826)
-
-RELEASE NOTE:
-"The code in condor_master now prevents daemons waiting on locks from
+F: Corrected problem with master termination not stopping HA daemon acquisition
+R: The code in condor_master now prevents daemons waiting on locks from
 acquiring them while the condor_master is trying to shut down, for example from a TERM or
-QUIT signal, or from condor_off."+QUIT signal, or from condor_off
+
+A master was able to receive a termination signal and exit most of the daemons below it, but then acquire a High Availability (HA) lock and start an HA daemon. This prevents the master from successfully exiting. The master now terminates the HA daemon acquisition successfully, and prevents daemons waiting on locks from
+acquiring them while the condor_master is trying to shut down.

Comment 15 errata-xmlrpc 2009-12-03 09:19:30 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html


Note You need to log in before you can comment on or make changes to this bug.