| Summary: | High latency when negotiator not available | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Martin Kudlej <mkudlej> | ||||||
| Component: | condor-wallaby-base-db | Assignee: | Matthew Farrellee <matt> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 2.0 | CC: | matt | ||||||
| Target Milestone: | 2.0.1 | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-07-28 13:03:12 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
I think this is connected to https://bugzilla.redhat.com/show_bug.cgi?id=652772 Created attachment 515526 [details]
wallaby configuration
After setting CONDOR_HOST = dhcp-37-168.lab.eng.brq.redhat.com: time su condor -s /bin/bash -c "condor_run /bin/sleep 1" real 0m25.333s user 0m0.026s sys 0m0.033s Issue is that condor's selection of the network interface on a multi-homed system. On the node being tested, condor is choosing eth0 as the interface whereas the configuration through wallaby was setting CONDOR_HOST=127.0.0.1 (loopback). The security settings in condor use CONDOR_HOST as an allowed value, thus on the node under test 127.0.0.1 was allowed according to the config. Since condor was choosing eth0, the identity used by the client was NOT 127.0.0.1 so authentication was denied. Given the use of CONDOR_HOST = 127.0.0.1, I'll assume you were running everything on a single system. When the Schedd could not get a match via the Negotiator it likely fell back on local claiming, SCHEDD_ASSUME_NEGOTIATOR_GONE. Look for "Negotiator gone, trying to use our local startd" and "Haven't heard from negotiator, trying to claim local startd @". |
Created attachment 515448 [details] condor logs, configuration via wallaby and diff between condor configuration with and without wallaby Description of problem: If I configure condor via wallaby, it takes too long(~10mins) to complete simple "sleep 1" job. I think there is problem with authentication settings in wallaby db. Configured by Wallaby: $ time su condor -s /bin/bash -c "condor_run /bin/sleep 1" real 10m6.166s user 0m0.024s sys 0m0.021s $ cat NegotiatorLog ... 07/27/11 10:34:08 Socket to condor@_host_ (<_ip_:38659>) not in cache, creating one 07/27/11 10:34:08 SocketCache: Found unused slot 0 <--- THIS IS NOT TRUE, BECAUSE THERE IS 2 UNCLAIMED SLOTS 07/27/11 10:34:08 Sending SEND_JOB_INFO/eom 07/27/11 10:34:08 Getting reply from schedd ... 07/27/11 10:34:08 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from schedd condor@_host_. 07/27/11 10:34:08 IO: Failed to read packet header 07/27/11 10:34:08 Failed to get reply from schedd 07/27/11 10:34:08 Error: Ignoring submitter for this cycle 07/27/11 10:34:08 resources used by condor@_host_ are 0.000000 07/27/11 10:34:08 resources used scheddUsed= 0.000000 07/27/11 10:34:08 negotiateWithGroup resources used scheddAds length 0 ... $ condor_configure_pool -g pokus -l Group "pokus": Group ID: 2 Name: pokus Members: _host_ Features (priority: name): 0: Master 1: NodeAccess 2: ExecuteNode 3: CentralManager 4: Scheduler Parameters: ALLOW_WRITE = * CONDOR_HOST = 127.0.0.1 ALLOW_READ = * Without wallaby: $ time su condor -s /bin/bash -c "condor_run /bin/sleep 1" real 1m5.116s user 0m0.014s sys 0m0.020s Version-Release number of selected component (if applicable): qpid-qmf-0.10-10.el5 qpid-cpp-server-0.10-8.el5 wallaby-utils-0.10.5-6.el5 wallaby-0.10.5-6.el5 python-condorutils-1.5-4.el5 condor-wallaby-client-4.1-4.el5 qpid-cpp-client-0.10-8.el5 ruby-qpid-qmf-0.10-10.el5 python-qpid-qmf-0.10-10.el5 condor-7.6.3-0.2.el5 condor-wallaby-tools-4.1-4.el5 ruby-wallaby-0.10.5-6.el5 condor-classads-7.6.3-0.2.el5 condor-wallaby-base-db-1.14-1.el5 python-qpid-0.10-1.el5 python-wallabyclient-4.1-4.el5 How reproducible: 100% Steps to Reproduce: 1. Setup condor via wallaby with these features: 0: Master, 1: NodeAccess, 2: ExecuteNode, 3: CentralManager, 4: Scheduler 2. run simple job "sleep 1" 3. watch logs Actual results: Run of simple jobs takes too long(~10 mins) because of wallaby configuration. Normaly it takes ~1 min to complete job. Expected results: Condor configured by wallaby will complete jobs in same time as condor not configured by wallaby.