Bug 864637
Summary: | 'condor_restart -subsystem had' causes had and negotiator to shutdown | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Robert Rati <rrati> |
Component: | condor | Assignee: | Robert Rati <rrati> |
Status: | CLOSED WORKSFORME | QA Contact: | Lubos Trilety <ltrilety> |
Severity: | unspecified | Docs Contact: | |
Priority: | low | ||
Version: | 2.2 | CC: | esammons, ltrilety, matt, trusnak, tstclair |
Target Milestone: | 2.3 | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | condor-7.8.6-0.2 | Doc Type: | Bug Fix |
Doc Text: |
Cause: issueing a condor_restart -subsystem had for a HACM node running the negotiator
Consequence: The had and negotiator would stop, but the had would not restart
Fix: Ensure the had daemon will restart
Result: The had daemon will restart
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2013-01-14 19:40:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 845292 |
Description
Robert Rati
2012-10-09 20:12:59 UTC
The replication actually isn't going down. Just the had and negotiator. I tested that on condor-7.8.8-0.1, it still takes about 6 minutes till the negotiator and HAD starts again.
# cat NegotiatorLog
...
01/09/13 10:38:37 **** condor_negotiator (condor_NEGOTIATOR) pid 24179 EXITING WITH STATUS 0
01/09/13 10:44:54 OpSysMajorVersion: 6
...
# cat HADLog
...
01/09/13 10:38:33 **** condor_had (condor_HAD) pid 24092 EXITING WITH STATUS 0
01/09/13 10:44:34 OpSysMajorVersion: 6
...
>>> assigned
The issue was the the negotiator and had went down and never came back up. Since the had and negotiator daemons are restarting, it seems like things are working as expected. I suspect the reason it is taking ~6 minutes for the had to restart is because MASTER_HAD_BACKOFF_CONSTANT = 360 (6 minutes) in the HACentralManager configuration. Unable to reproduce the original issue. |