Description of problem: Currently a shadowd with jobs that has become disconnected from a startd that is listening on a new port tries in vain to reconnect on the old startd port. The lease will eventually expire and shadowd will go back through a full reconnection sequence and the shadow jobs will be re-started successfully. This process could be expedited if the shadowd after a configured amount of time or failed reconnect attempts tries to get the new port from some config lookup (i.e., before the lease expires). Version-Release number of selected component (if applicable): $CondorVersion: 7.2.0 Dec 16 2008 BuildID: RH-7.2.0-0.13.el5 PRE-RELEASE-UWCS $ $CondorPlatform: X86_64-LINUX_RHEL5 $ How reproducible: Should be 100% Steps to Reproduce: 1. Start personal condor as service 2. Submit 30 sleep jobs 3. Note normal slot/job dispatch behavior in steady state 4. Restart condor services 5. Run condor_q and note jobs are in R 6. Run condor_status and note all slots are available and idle Actual results: After a condor restart, condor_q reports jobs in R state but condor_status says slots are Unclaimed. The ShadowLog reports the re-connection attempts and failures. Expected results: The queued jobs should be run after a condor restart assuming slots are available. Additional info: Couldn't this just be achieved by tweaking the lease expiry time?
A workaround may be to use STARTD_ARGS = -p 1234 and have the Startd always come up on the same port.
Does this issue even exist any more if one uses shared_port?
Perhaps not.
I can not repro the above according to the instructions (default using shared port).