Bug 864637 - 'condor_restart -subsystem had' causes had and negotiator to shutdown
Summary: 'condor_restart -subsystem had' causes had and negotiator to shutdown
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 2.2
Hardware: Unspecified
OS: Unspecified
low
unspecified
Target Milestone: 2.3
: ---
Assignee: Robert Rati
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On:
Blocks: 845292
TreeView+ depends on / blocked
 
Reported: 2012-10-09 20:12 UTC by Robert Rati
Modified: 2013-01-14 19:40 UTC (History)
5 users (show)

Fixed In Version: condor-7.8.6-0.2
Doc Type: Bug Fix
Doc Text:
Cause: issueing a condor_restart -subsystem had for a HACM node running the negotiator Consequence: The had and negotiator would stop, but the had would not restart Fix: Ensure the had daemon will restart Result: The had daemon will restart
Clone Of:
Environment:
Last Closed: 2013-01-14 19:40:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2013:0564 0 normal SHIPPED_LIVE Low: Red Hat Enterprise MRG Grid 2.3 security update 2013-03-06 23:37:09 UTC

Description Robert Rati 2012-10-09 20:12:59 UTC
Description of problem:
Sending a condor_restart to the had subsystem in an HACM setup seems to cause the had, replication, and negotiator to shutdown and not get restarted.

Version-Release number of selected component (if applicable):
condor-7.6.5-0.22

How reproducible:
100%

Steps to Reproduce:
1. Setup HACM with wallaby
2. On node running the negotiator execute: 'condor_restart -subsystem had'
3. Notice the had, replication, and negotiator daemons are stopped and not restarted
  
Actual results:


Expected results:


Additional info:

Comment 1 Robert Rati 2012-10-09 20:18:47 UTC
The replication actually isn't going down.  Just the had and negotiator.

Comment 5 Lubos Trilety 2013-01-09 16:59:47 UTC
I tested that on condor-7.8.8-0.1, it still takes about 6 minutes till the negotiator and HAD starts again.

# cat NegotiatorLog
...
01/09/13 10:38:37 **** condor_negotiator (condor_NEGOTIATOR) pid 24179 EXITING WITH STATUS 0
01/09/13 10:44:54 OpSysMajorVersion:  6
...

# cat HADLog
...
01/09/13 10:38:33 **** condor_had (condor_HAD) pid 24092 EXITING WITH STATUS 0
01/09/13 10:44:34 OpSysMajorVersion:  6
...

>>> assigned

Comment 6 Robert Rati 2013-01-09 20:47:51 UTC
The issue was the the negotiator and had went down and never came back up.  Since the had and negotiator daemons are restarting, it seems like things are working as expected.  I suspect the reason it is taking ~6 minutes for the had to restart is because MASTER_HAD_BACKOFF_CONSTANT = 360 (6 minutes) in the HACentralManager configuration.

Comment 7 Robert Rati 2013-01-14 19:40:50 UTC
Unable to reproduce the original issue.


Note You need to log in before you can comment on or make changes to this bug.