Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 578600 - Dyanamic Slot INVALIDATE_STARTD_ADS causes collector pegging
Dyanamic Slot INVALIDATE_STARTD_ADS causes collector pegging
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
1.2
All Linux
high Severity high
: 1.3
: ---
Assigned To: Timothy St. Clair
Tomas Rusnak
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-31 14:53 EDT by Timothy St. Clair
Modified: 2010-10-14 12:10 EDT (History)
3 users (show)

See Also:
Fixed In Version: 7.4.3-0.10
Doc Type: Bug Fix
Doc Text:
Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. With this update, the dynamic slots parse without further failure.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-10-14 12:10:58 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 11:56:44 EDT

  None (edit)
Description Timothy St. Clair 2010-03-31 14:53:27 EDT
Description of problem:

http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1316


How reproducible:
100%

Steps to Reproduce:
1. add numerous dynamic slots to a large pool
2. submit a large number jobs that require little time.
3. track processing on the collector as the solts try to invalidate their adds.
  
Actual results:

condor_collector will lag in responding to the invalidates causing a backup in the response time.  O(n) per remove

Expected results:

invalidates should be O(1)
Comment 1 Matthew Farrellee 2010-03-31 14:59:41 EDT
Fix built into 7.4.3-0.9
Comment 2 Timothy St. Clair 2010-03-31 15:01:40 EDT
Updated the startd finalize to add Name and MyAddress to the invalidate classad, and updated the collector to try and parse and create a hash key for removal.  If it fails it will fall back to an O(n) search.
Comment 3 Matthew Farrellee 2010-04-07 09:47:17 EDT
Fix actually built into 7.4.3-0.10, 0.9 had issues finding StartdIpAddr
Comment 4 Tomas Rusnak 2010-06-10 05:47:29 EDT
Please, could you specify what is "large pool" and "large jobs". Is there some reproducer available?
Comment 5 Matthew Farrellee 2010-06-10 07:23:48 EDT
We were doing this with 20K slots, which is difficult to reproduce in an automated way. I would recommend using condor_advertise with UPDATE_STARTD_AD (*20K) and then INVALIDATE_STARTD_ADS (*200).

According to src/condor_collector.V6/collector_engine.h

   * remove () - attempts to construct a hashkey from a query
   * to remove in O(1) for INVALIDATE* vs. O(n). The query must contain
   * TARGET.Name && TARGET.MyAddress

you will need to make sure your advertised ad will have a unique Name and MyAddress combination.
Comment 6 Tomas Rusnak 2010-08-04 05:45:10 EDT
Reproduced on:

$CondorVersion: 7.4.1 Dec 11 2009 BuildID: RH-7.4.1-0.7.1.el5 PRE-RELEASE $
$CondorPlatform: I386-LINUX_RHEL5 $

Hashing not implemented in this version, then no O(1) invalidate performed.
Comment 7 Tomas Rusnak 2010-08-04 10:44:29 EDT
Tested on all combination of RHEL4/5 and x86/x86_64 arch using:
condor-7.4.4-0.7
classads-1.0.8

Log output:
08/04 10:33:09 Got INVALIDATE_STARTD_ADS
08/04 10:33:09          **** Removed(1) ad(s): "< slot_1790@system , IP >"
08/04 10:33:09 (Invalidated 1 ads)
08/04 10:33:09 Walking tables to invalidate... O(n)
08/04 10:33:09 (Invalidated 0 ads)

I was doing this with 20k slots and call invalidate for 2000. Each slot was removed with O(1).

>>> VERIFIED
Comment 8 Florian Nadge 2010-10-07 12:43:56 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. 
With this update, the dynamic slots parse without further failure.
Comment 9 Florian Nadge 2010-10-07 12:44:09 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,2 +1 @@
-Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. 
+Previously, the dynamic slot INVALIDATE_STARTD_ADS caused collector pegging when dynamic slots completing jobs in a large pool of nodes. With this update, the dynamic slots parse without further failure.-With this update, the dynamic slots parse without further failure.
Comment 11 errata-xmlrpc 2010-10-14 12:10:58 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html

Note You need to log in before you can comment on or make changes to this bug.