Red Hat Bugzilla – Bug 472320
Hung Primary Collector Causes Serious Delays
Last modified: 2016-05-24 12:48:06 EDT
Description of problem:
Extreme pauses were observed at the beginning of execution for condor_submit, condor_q and condor_status (likely all tools). The theory was the primary collector in a HA setup was not running and thus the tools had to wait a timeout before trying the secondary collector. That turned out to be only partially true. The primary collector was hung, but as soon as it was killed the cli tools began operating quickly again. Current theory is that the primary collector still had port 9618 open and was queuing connections, but not accepting/rejecting them.
Thanks to jross for noticing this.
Version-Release number of selected component (if applicable):
Unsure, likely 100%
Steps to Reproduce:
1. Hang the primary Collector in a HA setup, or maybe nc -l 9618 on the primary
2. Use command-line tools
3. Observe delays
Faster response from tools
The delay when connecting to a Collector could be set very low for the command-line tools; a means to access collectors in parallel could be implemented; ???
One unexplored, and potentially more significant issue is what happens to the Negotiator->Collector communication, is it purely a delay in the negotiation cycle? Updates from daemons should not be a problem since they are via UDP and done to all collectors at once.