472320 – Hung Primary Collector Causes Serious Delays

Bug 472320 - Hung Primary Collector Causes Serious Delays

Summary: Hung Primary Collector Causes Serious Delays

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	grid
Sub Component:
Version:	1.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	grid-maint-list
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-11-20 02:06 UTC by Matthew Farrellee
Modified:	2016-05-24 16:48 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-24 16:48:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Matthew Farrellee 2008-11-20 02:06:18 UTC

Description of problem:

Extreme pauses were observed at the beginning of execution for condor_submit, condor_q and condor_status (likely all tools). The theory was the primary collector in a HA setup was not running and thus the tools had to wait a timeout before trying the secondary collector. That turned out to be only partially true. The primary collector was hung, but as soon as it was killed the cli tools began operating quickly again. Current theory is that the primary collector still had port 9618 open and was queuing connections, but not accepting/rejecting them.

Thanks to jross for noticing this.


Version-Release number of selected component (if applicable):

7.2.0-0.2


How reproducible:

Unsure, likely 100%


Steps to Reproduce:
1. Hang the primary Collector in a HA setup, or maybe nc -l 9618 on the primary
2. Use command-line tools
3. Observe delays


Expected results:

Faster response from tools


Additional info:

The delay when connecting to a Collector could be set very low for the command-line tools; a means to access collectors in parallel could be implemented; ???

One unexplored, and potentially more significant issue is what happens to the Negotiator->Collector communication, is it purely a delay in the negotiation cycle? Updates from daemons should not be a problem since they are via UDP and done to all collectors at once.

Note You need to log in before you can comment on or make changes to this bug.