Description of problem: http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1316 How reproducible: 100% Steps to Reproduce: 1. add numerous dynamic slots to a large pool 2. submit a large number jobs that require little time. 3. track processing on the collector as the solts try to invalidate their adds. Actual results: condor_collector will lag in responding to the invalidates causing a backup in the response time. O(n) per remove Expected results: invalidates should be O(1)
Fix built into 7.4.3-0.9
Updated the startd finalize to add Name and MyAddress to the invalidate classad, and updated the collector to try and parse and create a hash key for removal. If it fails it will fall back to an O(n) search.
Fix actually built into 7.4.3-0.10, 0.9 had issues finding StartdIpAddr
Please, could you specify what is "large pool" and "large jobs". Is there some reproducer available?
We were doing this with 20K slots, which is difficult to reproduce in an automated way. I would recommend using condor_advertise with UPDATE_STARTD_AD (*20K) and then INVALIDATE_STARTD_ADS (*200). According to src/condor_collector.V6/collector_engine.h * remove () - attempts to construct a hashkey from a query * to remove in O(1) for INVALIDATE* vs. O(n). The query must contain * TARGET.Name && TARGET.MyAddress you will need to make sure your advertised ad will have a unique Name and MyAddress combination.
Reproduced on: $CondorVersion: 7.4.1 Dec 11 2009 BuildID: RH-7.4.1-0.7.1.el5 PRE-RELEASE $ $CondorPlatform: I386-LINUX_RHEL5 $ Hashing not implemented in this version, then no O(1) invalidate performed.
Tested on all combination of RHEL4/5 and x86/x86_64 arch using: condor-7.4.4-0.7 classads-1.0.8 Log output: 08/04 10:33:09 Got INVALIDATE_STARTD_ADS 08/04 10:33:09 **** Removed(1) ad(s): "< slot_1790@system , IP >" 08/04 10:33:09 (Invalidated 1 ads) 08/04 10:33:09 Walking tables to invalidate... O(n) 08/04 10:33:09 (Invalidated 0 ads) I was doing this with 20k slots and call invalidate for 2000. Each slot was removed with O(1). >>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. With this update, the dynamic slots parse without further failure.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,2 +1 @@ -Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. +Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. With this update, the dynamic slots parse without further failure.-With this update, the dynamic slots parse without further failure.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html