477210 – shadowd: more proactive about reconnecting to a restarted startd listening on a new port

Bug 477210 - shadowd: more proactive about reconnecting to a restarted startd listening on a new port

Summary: shadowd: more proactive about reconnecting to a restarted startd listening on...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	1.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	2.1.1
Target Release:	---
Assignee:	Timothy St. Clair
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-12-19 17:09 UTC by Pete MacKinnon
Modified:	2011-12-08 19:59 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-11-03 20:56:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Pete MacKinnon 2008-12-19 17:09:08 UTC

Description of problem:
Currently a shadowd with jobs that has become disconnected from a startd that is listening on a new port tries in vain to reconnect on the old startd port. The lease will eventually expire and shadowd will go back through a full reconnection sequence and the shadow jobs will be re-started successfully. This process could be expedited if the shadowd after a configured amount of time or failed reconnect attempts tries to get the new port from some config lookup (i.e., before the lease expires).

Version-Release number of selected component (if applicable):
$CondorVersion: 7.2.0 Dec 16 2008 BuildID: RH-7.2.0-0.13.el5 PRE-RELEASE-UWCS $
$CondorPlatform: X86_64-LINUX_RHEL5 $

How reproducible:
Should be 100%

Steps to Reproduce:
1. Start personal condor as service
2. Submit 30 sleep jobs
3. Note normal slot/job dispatch behavior in steady state
4. Restart condor services
5. Run condor_q and note jobs are in R
6. Run condor_status and note all slots are available and idle
  
Actual results:
After a condor restart, condor_q reports jobs in R state but condor_status says slots are Unclaimed. The ShadowLog reports the re-connection attempts and failures.

Expected results:
The queued jobs should be run after a condor restart assuming slots are available.

Additional info:
Couldn't this just be achieved by tweaking the lease expiry time?

Comment 1 Matthew Farrellee 2010-02-05 21:07:59 UTC

A workaround may be to use STARTD_ARGS = -p 1234 and have the Startd always come up on the same port.

Comment 3 Timothy St. Clair 2011-11-02 21:33:48 UTC

Does this issue even exist any more if one uses shared_port?

Comment 4 Pete MacKinnon 2011-11-02 21:41:41 UTC

Perhaps not.

Comment 5 Timothy St. Clair 2011-11-03 20:56:38 UTC

I can not repro the above according to the instructions (default using shared port).

Note You need to log in before you can comment on or make changes to this bug.