Bug 578600 - Dyanamic Slot INVALIDATE_STARTD_ADS causes collector pegging
Summary: Dyanamic Slot INVALIDATE_STARTD_ADS causes collector pegging
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.2
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Timothy St. Clair
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-03-31 18:53 UTC by Timothy St. Clair
Modified: 2010-10-14 16:10 UTC (History)
3 users (show)

Fixed In Version: 7.4.3-0.10
Doc Type: Bug Fix
Doc Text:
Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. With this update, the dynamic slots parse without further failure.
Clone Of:
Environment:
Last Closed: 2010-10-14 16:10:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 15:56:44 UTC

Description Timothy St. Clair 2010-03-31 18:53:27 UTC
Description of problem:

http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1316


How reproducible:
100%

Steps to Reproduce:
1. add numerous dynamic slots to a large pool
2. submit a large number jobs that require little time.
3. track processing on the collector as the solts try to invalidate their adds.
  
Actual results:

condor_collector will lag in responding to the invalidates causing a backup in the response time.  O(n) per remove

Expected results:

invalidates should be O(1)

Comment 1 Matthew Farrellee 2010-03-31 18:59:41 UTC
Fix built into 7.4.3-0.9

Comment 2 Timothy St. Clair 2010-03-31 19:01:40 UTC
Updated the startd finalize to add Name and MyAddress to the invalidate classad, and updated the collector to try and parse and create a hash key for removal.  If it fails it will fall back to an O(n) search.

Comment 3 Matthew Farrellee 2010-04-07 13:47:17 UTC
Fix actually built into 7.4.3-0.10, 0.9 had issues finding StartdIpAddr

Comment 4 Tomas Rusnak 2010-06-10 09:47:29 UTC
Please, could you specify what is "large pool" and "large jobs". Is there some reproducer available?

Comment 5 Matthew Farrellee 2010-06-10 11:23:48 UTC
We were doing this with 20K slots, which is difficult to reproduce in an automated way. I would recommend using condor_advertise with UPDATE_STARTD_AD (*20K) and then INVALIDATE_STARTD_ADS (*200).

According to src/condor_collector.V6/collector_engine.h

   * remove () - attempts to construct a hashkey from a query
   * to remove in O(1) for INVALIDATE* vs. O(n). The query must contain
   * TARGET.Name && TARGET.MyAddress

you will need to make sure your advertised ad will have a unique Name and MyAddress combination.

Comment 6 Tomas Rusnak 2010-08-04 09:45:10 UTC
Reproduced on:

$CondorVersion: 7.4.1 Dec 11 2009 BuildID: RH-7.4.1-0.7.1.el5 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL5 $

Hashing not implemented in this version, then no O(1) invalidate performed.

Comment 7 Tomas Rusnak 2010-08-04 14:44:29 UTC
Tested on all combination of RHEL4/5 and x86/x86_64 arch using:
condor-7.4.4-0.7
classads-1.0.8

Log output:
08/04 10:33:09 Got INVALIDATE_STARTD_ADS
08/04 10:33:09          **** Removed(1) ad(s): "< slot_1790@system , IP >"
08/04 10:33:09 (Invalidated 1 ads)
08/04 10:33:09 Walking tables to invalidate... O(n)
08/04 10:33:09 (Invalidated 0 ads)

I was doing this with 20k slots and call invalidate for 2000. Each slot was removed with O(1).

>>> VERIFIED

Comment 8 Florian Nadge 2010-10-07 16:43:56 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. 
With this update, the dynamic slots parse without further failure.

Comment 9 Florian Nadge 2010-10-07 16:44:09 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,2 +1 @@
-Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. 
+Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. With this update, the dynamic slots parse without further failure.-With this update, the dynamic slots parse without further failure.

Comment 11 errata-xmlrpc 2010-10-14 16:10:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html


Note You need to log in before you can comment on or make changes to this bug.