Bug 682442

Summary: condor_* tools exhibiting 5 to 15 second stalls
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Matthew Farrellee <matt>
Status: CLOSED NOTABUG QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: Development   
Target Milestone: 2.0   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-08 19:13:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Awk script to detect gaps none

Description Matthew Farrellee 2011-03-05 15:23:04 UTC
Created attachment 482455 [details]
Awk script to detect gaps

Description of problem:

condor_q, condor_rm, condor_hold, condor_release, condor_reconfig, condor_restart, condor_reschedule, condor_qedit, condor_off, condor_on, condor_vacate, condor_submit and condor_status are exhibiting inconsistent response time behavior. It is expected that when processing no data, for instance condor_q on an empty queue, condor_q should return quickly with only potential delays during Schedd communication.

condor_q with fast communication to the Schedd is frequently taking 5 to 15 seconds to execute.

The delays are suspiciously multiples of 5.

Presumed also condor_userprio and condor_prio (no -debug option to verify)


Version-Release number of selected component (if applicable):

condor 7.6.0-0.1 for sure, likely back to 7.5.6-0.1


How reproducible:

20%


Steps to Reproduce:

1. time _CONDOR_TOOL_DEBUG=D_ALL condor_q -debug 2>&1 | awk -f LogTimeGap.awk 1

Or one of...
condor_rm -debug -a
condor_hold -debug -a
condor_release -debug -a
condor_reconfig -debug -subsys none
condor_restart -debug -subsys none
condor_reschedule
condor_qedit -debug 1 A 1
condor_off -debug -subsys none
condor_on -debug -subsys none
condor_vacate -debug -a
condor_status -debug -total
echo "cmd=/bin/true\nqueue" | condor_submit -debug

  
Actual results:

NOTE: grid, grid0 and grid1 represent fully qualified domain names, and the 10.10.10.10 is a real 10/8 IP that has been munged

Maximum allowable gap: 1 seconds
Found gap of 5 seconds:
 03/05/11 09:50:54 (fd:2) (pid:26671) IPVERIFY: allow ADMINISTRATOR: grid, grid0, grid1, grid0 (from config value ALLOW_ADMINISTRATOR)
 03/05/11 09:50:59 (fd:2) (pid:26671) IPVERIFY: allow OWNER: grid0, grid, grid0, grid1, grid0 (from config value ALLOW_OWNER)
Found gap of 10 seconds:
 03/05/11 09:50:59 (fd:2) (pid:26671) IPVERIFY: allow OWNER: grid0, grid, grid0, grid1, grid0 (from config value ALLOW_OWNER)
 03/05/11 09:51:09 (fd:2) (pid:26671) ipverify: CONFIG optimized to deny everyone

real	0m15.060s
user	0m0.055s
sys	0m0.011s


Maximum allowable gap: 1 seconds
Found gap of 5 seconds:
 03/05/11 09:44:11 (fd:2) (pid:26562) Calling gethostbyname(grid)
 03/05/11 09:44:16 (fd:2) (pid:26562) Found IP addr in hostent: 10.10.10.10

real	0m5.102s
user	0m0.050s
sys	0m0.021s


Expected results:

Maximum allowable gap: 1 seconds

real	0m0.052s
user	0m0.052s
sys	0m0.016s

Comment 1 Matthew Farrellee 2011-03-05 17:22:48 UTC
Digging showed getaddrinfo() was the culprit, and only when resolving grid. grid is actually a CNAME for grid0.

Eliminating grid from configuration, e.g. _CONDOR_CONDOR_HOST=grid0, and running 10,000 condor_reschedule's showed zero of these delays.

stracre showed the delay occurred when waiting (sendto, poll [delay], recvfrom) for a response from the DNS server.

Comment 2 Matthew Farrellee 2011-03-05 17:27:07 UTC
The run of 10,000 condor_reschedules did show,

Sat Mar  5 12:04:00 EST 2011
4656
Maximum allowable gap: 1 seconds
Found gap of 3 seconds:
 03/05/11 12:04:00 (fd:2) (pid:25170) Using port 37009 based on address "<10.10.10.10:37009>"
 03/05/11 12:04:03 (fd:2) (pid:25170) CONNECT bound to <10.10.10.10:33448> fd=3 peer=<10.10.10.10:37009>

real	0m3.019s
user	0m0.036s
sys	0m0.006s

Sat Mar  5 12:04:30 EST 2011
5398
Maximum allowable gap: 1 seconds
Found gap of 9 seconds:
 03/05/11 12:04:30 (fd:2) (pid:27417) Using port 37009 based on address "<10.10.10.10:37009>"
 03/05/11 12:04:39 (fd:2) (pid:27417) CONNECT bound to <10.10.10.10:41647> fd=3 peer=<10.10.10.10:37009>

real	0m9.020s
user	0m0.035s
sys	0m0.008s

Sat Mar  5 12:05:30 EST 2011
6799
Maximum allowable gap: 1 seconds
Found gap of 3 seconds:
 03/05/11 12:05:30 (fd:2) (pid:31686) Using port 37009 based on address "<10.10.10.10:37009>"
 03/05/11 12:05:33 (fd:2) (pid:31686) CONNECT bound to <10.10.10.10:46311> fd=3 peer=<10.10.10.10:37009>

real	0m3.020s
user	0m0.038s
sys	0m0.007s

Sat Mar  5 12:05:40 EST 2011
6992
Maximum allowable gap: 1 seconds
Found gap of 3 seconds:
 03/05/11 12:05:40 (fd:2) (pid:32269) Using port 37009 based on address "<10.10.10.10:37009>"
 03/05/11 12:05:43 (fd:2) (pid:32269) CONNECT bound to <10.10.10.10:39197> fd=3 peer=<10.10.10.10:37009>

real	0m3.020s
user	0m0.037s
sys	0m0.009s

NOTE: 37009 is the Schedd's command port

Comment 3 Matthew Farrellee 2011-03-05 18:11:37 UTC
The CONNECT delays coincide with delays in the SchedLog.

condor_reschedule,

Found gap of 3 seconds:
 03/05/11 13:07:38 (fd:2) (pid:16124) Using port 37009 based on address "<10.10.10.10:37009>"
 03/05/11 13:07:41 (fd:2) (pid:16124) CONNECT bound to <10.10.10.10:51644> fd=3 peer=<10.10.10.10:37009>

SchedLog,

Found gap of 5 seconds:
 03/05/11 13:07:36 (fd:20) (pid:27463) Calling gethostbyname(grid)
 03/05/11 13:07:41 (fd:20) (pid:27463) Found IP addr in hostent: 10.10.10.10

Comment 4 Matthew Farrellee 2011-03-05 18:24:52 UTC
With CONDOR_HOST=grid0 for the Schedd, a run of 10,000 condor_reschedules shows no CONNECT gaps.