Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 725990

Summary:

High latency when negotiator not available

Product:

Red Hat Enterprise MRG

Reporter:

Martin Kudlej <mkudlej>

Component:

condor-wallaby-base-db

Assignee:

Matthew Farrellee <matt>

Status:

CLOSED NOTABUG

QA Contact:

MRG Quality Engineering <mrgqe-bugs>

Severity:

high

Docs Contact:

Priority:

high

Version:

2.0

CC:

matt

Target Milestone:

2.0.1

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-07-28 13:03:12 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
condor logs, configuration via wallaby and diff between condor configuration with and without wallaby	none
wallaby configuration	none

Description Martin Kudlej 2011-07-27 09:11:16 UTC

Created attachment 515448 [details]
condor logs, configuration via wallaby and diff between condor configuration with and without wallaby

Description of problem:
If I configure condor via wallaby, it takes too long(~10mins) to complete simple "sleep 1" job. I think there is problem with authentication settings in wallaby db.

Configured by Wallaby:

$  time su condor -s /bin/bash -c "condor_run /bin/sleep 1"

real    10m6.166s
user    0m0.024s
sys     0m0.021s

$ cat NegotiatorLog
...
07/27/11 10:34:08 Socket to condor@_host_ (<_ip_:38659>) not in cache, creating one
07/27/11 10:34:08 SocketCache:  Found unused slot 0 <--- THIS IS NOT TRUE, BECAUSE THERE IS 2 UNCLAIMED SLOTS


07/27/11 10:34:08     Sending SEND_JOB_INFO/eom
07/27/11 10:34:08     Getting reply from schedd ...
07/27/11 10:34:08 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from schedd condor@_host_.
07/27/11 10:34:08 IO: Failed to read packet header
07/27/11 10:34:08     Failed to get reply from schedd
07/27/11 10:34:08   Error: Ignoring submitter for this cycle
07/27/11 10:34:08  resources used by condor@_host_ are 0.000000
07/27/11 10:34:08  resources used scheddUsed= 0.000000
07/27/11 10:34:08  negotiateWithGroup resources used scheddAds length 0
...

$  condor_configure_pool -g pokus -l
Group "pokus":
Group ID: 2
Name: pokus
Members:
  _host_
Features (priority: name):
  0: Master
  1: NodeAccess
  2: ExecuteNode
  3: CentralManager
  4: Scheduler
Parameters:
  ALLOW_WRITE = *
  CONDOR_HOST = 127.0.0.1
  ALLOW_READ = *

Without wallaby:
$  time su condor -s /bin/bash -c "condor_run /bin/sleep 1"

real    1m5.116s
user    0m0.014s
sys     0m0.020s

Version-Release number of selected component (if applicable):
qpid-qmf-0.10-10.el5
qpid-cpp-server-0.10-8.el5
wallaby-utils-0.10.5-6.el5
wallaby-0.10.5-6.el5
python-condorutils-1.5-4.el5
condor-wallaby-client-4.1-4.el5
qpid-cpp-client-0.10-8.el5
ruby-qpid-qmf-0.10-10.el5
python-qpid-qmf-0.10-10.el5
condor-7.6.3-0.2.el5
condor-wallaby-tools-4.1-4.el5
ruby-wallaby-0.10.5-6.el5
condor-classads-7.6.3-0.2.el5
condor-wallaby-base-db-1.14-1.el5
python-qpid-0.10-1.el5
python-wallabyclient-4.1-4.el5

How reproducible:
100%

Steps to Reproduce:
1. Setup condor via wallaby with these features:  0: Master, 1: NodeAccess, 2: ExecuteNode, 3: CentralManager, 4: Scheduler
2. run simple job "sleep 1"
3. watch logs
  
Actual results:
Run of simple jobs takes too long(~10 mins) because of wallaby configuration. Normaly it takes ~1 min to complete job.

Expected results:
Condor configured by wallaby will complete jobs in same time as condor not configured by wallaby.

Comment 1 Martin Kudlej 2011-07-27 13:02:28 UTC

I think this is connected to https://bugzilla.redhat.com/show_bug.cgi?id=652772

Comment 2 Martin Kudlej 2011-07-27 13:23:20 UTC

Created attachment 515526 [details]
wallaby configuration

Comment 3 Robert Rati 2011-07-27 18:01:40 UTC

After setting CONDOR_HOST = dhcp-37-168.lab.eng.brq.redhat.com:

time su condor -s /bin/bash -c "condor_run /bin/sleep 1"

real	0m25.333s
user	0m0.026s
sys	0m0.033s

Issue is that condor's selection of the network interface on a multi-homed system.  On the node being tested, condor is choosing eth0 as the interface whereas the configuration through wallaby was setting CONDOR_HOST=127.0.0.1 (loopback).  The security settings in condor use CONDOR_HOST as an allowed value, thus on the node under test 127.0.0.1 was allowed according to the config.  Since condor was choosing eth0, the identity used by the client was NOT 127.0.0.1 so authentication was denied.

Comment 4 Matthew Farrellee 2011-07-28 12:58:28 UTC

Given the use of CONDOR_HOST = 127.0.0.1, I'll assume you were running everything on a single system. When the Schedd could not get a match via the Negotiator it likely fell back on local claiming, SCHEDD_ASSUME_NEGOTIATOR_GONE. Look for "Negotiator gone, trying to use our local startd" and "Haven't heard from negotiator, trying to claim local startd @".