Bug 682442
Summary: | condor_* tools exhibiting 5 to 15 second stalls | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Matthew Farrellee <matt> | ||||
Component: | condor | Assignee: | Matthew Farrellee <matt> | ||||
Status: | CLOSED NOTABUG | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | Development | ||||||
Target Milestone: | 2.0 | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-03-08 19:13:23 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Digging showed getaddrinfo() was the culprit, and only when resolving grid. grid is actually a CNAME for grid0. Eliminating grid from configuration, e.g. _CONDOR_CONDOR_HOST=grid0, and running 10,000 condor_reschedule's showed zero of these delays. stracre showed the delay occurred when waiting (sendto, poll [delay], recvfrom) for a response from the DNS server. The run of 10,000 condor_reschedules did show, Sat Mar 5 12:04:00 EST 2011 4656 Maximum allowable gap: 1 seconds Found gap of 3 seconds: 03/05/11 12:04:00 (fd:2) (pid:25170) Using port 37009 based on address "<10.10.10.10:37009>" 03/05/11 12:04:03 (fd:2) (pid:25170) CONNECT bound to <10.10.10.10:33448> fd=3 peer=<10.10.10.10:37009> real 0m3.019s user 0m0.036s sys 0m0.006s Sat Mar 5 12:04:30 EST 2011 5398 Maximum allowable gap: 1 seconds Found gap of 9 seconds: 03/05/11 12:04:30 (fd:2) (pid:27417) Using port 37009 based on address "<10.10.10.10:37009>" 03/05/11 12:04:39 (fd:2) (pid:27417) CONNECT bound to <10.10.10.10:41647> fd=3 peer=<10.10.10.10:37009> real 0m9.020s user 0m0.035s sys 0m0.008s Sat Mar 5 12:05:30 EST 2011 6799 Maximum allowable gap: 1 seconds Found gap of 3 seconds: 03/05/11 12:05:30 (fd:2) (pid:31686) Using port 37009 based on address "<10.10.10.10:37009>" 03/05/11 12:05:33 (fd:2) (pid:31686) CONNECT bound to <10.10.10.10:46311> fd=3 peer=<10.10.10.10:37009> real 0m3.020s user 0m0.038s sys 0m0.007s Sat Mar 5 12:05:40 EST 2011 6992 Maximum allowable gap: 1 seconds Found gap of 3 seconds: 03/05/11 12:05:40 (fd:2) (pid:32269) Using port 37009 based on address "<10.10.10.10:37009>" 03/05/11 12:05:43 (fd:2) (pid:32269) CONNECT bound to <10.10.10.10:39197> fd=3 peer=<10.10.10.10:37009> real 0m3.020s user 0m0.037s sys 0m0.009s NOTE: 37009 is the Schedd's command port The CONNECT delays coincide with delays in the SchedLog. condor_reschedule, Found gap of 3 seconds: 03/05/11 13:07:38 (fd:2) (pid:16124) Using port 37009 based on address "<10.10.10.10:37009>" 03/05/11 13:07:41 (fd:2) (pid:16124) CONNECT bound to <10.10.10.10:51644> fd=3 peer=<10.10.10.10:37009> SchedLog, Found gap of 5 seconds: 03/05/11 13:07:36 (fd:20) (pid:27463) Calling gethostbyname(grid) 03/05/11 13:07:41 (fd:20) (pid:27463) Found IP addr in hostent: 10.10.10.10 With CONDOR_HOST=grid0 for the Schedd, a run of 10,000 condor_reschedules shows no CONNECT gaps. |
Created attachment 482455 [details] Awk script to detect gaps Description of problem: condor_q, condor_rm, condor_hold, condor_release, condor_reconfig, condor_restart, condor_reschedule, condor_qedit, condor_off, condor_on, condor_vacate, condor_submit and condor_status are exhibiting inconsistent response time behavior. It is expected that when processing no data, for instance condor_q on an empty queue, condor_q should return quickly with only potential delays during Schedd communication. condor_q with fast communication to the Schedd is frequently taking 5 to 15 seconds to execute. The delays are suspiciously multiples of 5. Presumed also condor_userprio and condor_prio (no -debug option to verify) Version-Release number of selected component (if applicable): condor 7.6.0-0.1 for sure, likely back to 7.5.6-0.1 How reproducible: 20% Steps to Reproduce: 1. time _CONDOR_TOOL_DEBUG=D_ALL condor_q -debug 2>&1 | awk -f LogTimeGap.awk 1 Or one of... condor_rm -debug -a condor_hold -debug -a condor_release -debug -a condor_reconfig -debug -subsys none condor_restart -debug -subsys none condor_reschedule condor_qedit -debug 1 A 1 condor_off -debug -subsys none condor_on -debug -subsys none condor_vacate -debug -a condor_status -debug -total echo "cmd=/bin/true\nqueue" | condor_submit -debug Actual results: NOTE: grid, grid0 and grid1 represent fully qualified domain names, and the 10.10.10.10 is a real 10/8 IP that has been munged Maximum allowable gap: 1 seconds Found gap of 5 seconds: 03/05/11 09:50:54 (fd:2) (pid:26671) IPVERIFY: allow ADMINISTRATOR: grid, grid0, grid1, grid0 (from config value ALLOW_ADMINISTRATOR) 03/05/11 09:50:59 (fd:2) (pid:26671) IPVERIFY: allow OWNER: grid0, grid, grid0, grid1, grid0 (from config value ALLOW_OWNER) Found gap of 10 seconds: 03/05/11 09:50:59 (fd:2) (pid:26671) IPVERIFY: allow OWNER: grid0, grid, grid0, grid1, grid0 (from config value ALLOW_OWNER) 03/05/11 09:51:09 (fd:2) (pid:26671) ipverify: CONFIG optimized to deny everyone real 0m15.060s user 0m0.055s sys 0m0.011s Maximum allowable gap: 1 seconds Found gap of 5 seconds: 03/05/11 09:44:11 (fd:2) (pid:26562) Calling gethostbyname(grid) 03/05/11 09:44:16 (fd:2) (pid:26562) Found IP addr in hostent: 10.10.10.10 real 0m5.102s user 0m0.050s sys 0m0.021s Expected results: Maximum allowable gap: 1 seconds real 0m0.052s user 0m0.052s sys 0m0.016s