Bug 725990 - High latency when negotiator not available
Summary: High latency when negotiator not available
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-wallaby-base-db
Version: 2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.0.1
: ---
Assignee: Matthew Farrellee
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-07-27 09:11 UTC by Martin Kudlej
Modified: 2011-07-28 13:03 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-28 13:03:12 UTC
Target Upstream Version:


Attachments (Terms of Use)
condor logs, configuration via wallaby and diff between condor configuration with and without wallaby (60.57 KB, application/x-gzip)
2011-07-27 09:11 UTC, Martin Kudlej
no flags Details
wallaby configuration (5.09 KB, application/octet-stream)
2011-07-27 13:23 UTC, Martin Kudlej
no flags Details

Description Martin Kudlej 2011-07-27 09:11:16 UTC
Created attachment 515448 [details]
condor logs, configuration via wallaby and diff between condor configuration with and without wallaby

Description of problem:
If I configure condor via wallaby, it takes too long(~10mins) to complete simple "sleep 1" job. I think there is problem with authentication settings in wallaby db.

Configured by Wallaby:

$  time su condor -s /bin/bash -c "condor_run /bin/sleep 1"

real    10m6.166s
user    0m0.024s
sys     0m0.021s

$ cat NegotiatorLog
...
07/27/11 10:34:08 Socket to condor@_host_ (<_ip_:38659>) not in cache, creating one
07/27/11 10:34:08 SocketCache:  Found unused slot 0 <--- THIS IS NOT TRUE, BECAUSE THERE IS 2 UNCLAIMED SLOTS


07/27/11 10:34:08     Sending SEND_JOB_INFO/eom
07/27/11 10:34:08     Getting reply from schedd ...
07/27/11 10:34:08 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from schedd condor@_host_.
07/27/11 10:34:08 IO: Failed to read packet header
07/27/11 10:34:08     Failed to get reply from schedd
07/27/11 10:34:08   Error: Ignoring submitter for this cycle
07/27/11 10:34:08  resources used by condor@_host_ are 0.000000
07/27/11 10:34:08  resources used scheddUsed= 0.000000
07/27/11 10:34:08  negotiateWithGroup resources used scheddAds length 0
...

$  condor_configure_pool -g pokus -l
Group "pokus":
Group ID: 2
Name: pokus
Members:
  _host_
Features (priority: name):
  0: Master
  1: NodeAccess
  2: ExecuteNode
  3: CentralManager
  4: Scheduler
Parameters:
  ALLOW_WRITE = *
  CONDOR_HOST = 127.0.0.1
  ALLOW_READ = *

Without wallaby:
$  time su condor -s /bin/bash -c "condor_run /bin/sleep 1"

real    1m5.116s
user    0m0.014s
sys     0m0.020s

Version-Release number of selected component (if applicable):
qpid-qmf-0.10-10.el5
qpid-cpp-server-0.10-8.el5
wallaby-utils-0.10.5-6.el5
wallaby-0.10.5-6.el5
python-condorutils-1.5-4.el5
condor-wallaby-client-4.1-4.el5
qpid-cpp-client-0.10-8.el5
ruby-qpid-qmf-0.10-10.el5
python-qpid-qmf-0.10-10.el5
condor-7.6.3-0.2.el5
condor-wallaby-tools-4.1-4.el5
ruby-wallaby-0.10.5-6.el5
condor-classads-7.6.3-0.2.el5
condor-wallaby-base-db-1.14-1.el5
python-qpid-0.10-1.el5
python-wallabyclient-4.1-4.el5

How reproducible:
100%

Steps to Reproduce:
1. Setup condor via wallaby with these features:  0: Master, 1: NodeAccess, 2: ExecuteNode, 3: CentralManager, 4: Scheduler
2. run simple job "sleep 1"
3. watch logs
  
Actual results:
Run of simple jobs takes too long(~10 mins) because of wallaby configuration. Normaly it takes ~1 min to complete job.

Expected results:
Condor configured by wallaby will complete jobs in same time as condor not configured by wallaby.

Comment 1 Martin Kudlej 2011-07-27 13:02:28 UTC
I think this is connected to https://bugzilla.redhat.com/show_bug.cgi?id=652772

Comment 2 Martin Kudlej 2011-07-27 13:23:20 UTC
Created attachment 515526 [details]
wallaby configuration

Comment 3 Robert Rati 2011-07-27 18:01:40 UTC
After setting CONDOR_HOST = dhcp-37-168.lab.eng.brq.redhat.com:

time su condor -s /bin/bash -c "condor_run /bin/sleep 1"

real	0m25.333s
user	0m0.026s
sys	0m0.033s

Issue is that condor's selection of the network interface on a multi-homed system.  On the node being tested, condor is choosing eth0 as the interface whereas the configuration through wallaby was setting CONDOR_HOST=127.0.0.1 (loopback).  The security settings in condor use CONDOR_HOST as an allowed value, thus on the node under test 127.0.0.1 was allowed according to the config.  Since condor was choosing eth0, the identity used by the client was NOT 127.0.0.1 so authentication was denied.

Comment 4 Matthew Farrellee 2011-07-28 12:58:28 UTC
Given the use of CONDOR_HOST = 127.0.0.1, I'll assume you were running everything on a single system. When the Schedd could not get a match via the Negotiator it likely fell back on local claiming, SCHEDD_ASSUME_NEGOTIATOR_GONE. Look for "Negotiator gone, trying to use our local startd" and "Haven't heard from negotiator, trying to claim local startd @".


Note You need to log in before you can comment on or make changes to this bug.