Bug 1080491
| Summary: | Concurrency limits exceed their setting | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Lubos Trilety <ltrilety> | ||||
| Component: | condor | Assignee: | grid-maint-list <grid-maint-list> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 2.5 | CC: | eerlands, matt, sgraf | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-03-27 13:19:02 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
From the attached log, it looks like the accountant is picking up stale ads from the collector in its "(Accountant) Checking Matches" phase. The concurrency limits are being properly respected at the point where they are checked, but it is seeing 'extra' ads that are probably stale in the collector. It is not adding new matches against the cc-limits, which is the desired behavior. In the configuration above, NEGOTIATOR_INTERVAL is set to 20 seconds, which is a tight interval that can get ahead of the state changes in the startds and the collector. That can cause the negotiator's counting of various resources to get a bit out of sync. Setting NEGOTIATOR_INTERVAL to be longer (e.g. 60 seconds), and making sure to configure the startd update interval, UPDATE_INTERVAL, to be shorter than NEGOTIATOR_INTERVAL, should prevent the negotiator from getting ahead of the collector and startds, and the cc-limit accounting will stop looking out of sync. (In reply to Erik Erlandson from comment #1) I took the configuration from Bug 721110. It works in previous builds pretty well or I we were just really 'lucky'. I thought that negotiator communicates directly with startd not via collector. Anyway it happens also with update_interval set to 28 and negotiator_interval set to 30. However I was not able reproduce the bug with negotiator_interval set to 60 and update_interval to 55 or with default settings (both intervals not set). (In reply to Lubos Trilety from comment #2) > (In reply to Erik Erlandson from comment #1) > I thought that negotiator communicates directly with startd not via > collector. The negotiator gets all its information about the state of resource usage across the pool from the collector. The startds only update their state to the collector at intervals, so the negotiator technically never sees the 'true' instantaneous state of the system. For example, the negotiator may not see resources that have freed up very recently. On a related note: the negotiator sends its match information to the scheduler(s), where it gets passed to the various startds. the startds then go through the process of spinning up starters, possibly creating dynamic slots, changing slot states to 'claimed', eventually sending claim information back to the collector. So, when negotiator intervals are short, it is even possible for the negotiator to make a match, but that match information will have not yet circulated its way back to the collector when the negotiator begins its next cycle. At any rate, when testing behaviors of things like accounting groups and cc-limits, it is important to take these various propagation latencies into account. Any changes to resource usage, either using resources or freeing them up, need time to get back around to the collector before the negotiator will see them and respond. (In reply to Erik Erlandson from comment #3) Seems to me explained enough, closing as not a bug. |
Created attachment 878485 [details] NegotiatorLog Description of problem: Sometimes the negotiator matches more jobs for some concurrency limit than it's actual size of the limit. Version-Release number of selected component (if applicable): condor-7.8.9-0.8 How reproducible: 30% Steps to Reproduce: 1. set condor NUM_CPUS=30 NEGOTIATOR_INTERVAL=20 TEST_LIMIT=3 NEGOTIATOR_CYCLE_DELAY=5 NEGOTIATOR_DEBUG=D_ACCOUNTANT | D_FULLDEBUG CONCURRENCY_LIMIT_DEFAULT_small=2 CONCURRENCY_LIMIT_DEFAULT_medium=5 CONCURRENCY_LIMIT_DEFAULT=1 CONCURRENCY_LIMIT_DEFAULT_large=11 2. submit job should_transfer_files=IF_NEEDED concurrency_limits=large.test executable=/bin/sleep iwd=/tmp requirements=(FileSystemDomain =!= UNDEFINED && Arch =!= UNDEFINED) transfer_executable=False universe=vanilla arguments=6000 when_to_transfer_output=ON_EXIT queue 20 concurrency_limits=medium.test queue 20 concurrency_limits=small.test queue 20 concurrency_limits=test queue 20 concurrency_limits=undef.test queue 20 concurrency_limits=undef queue 20 concurrency_limits=medium.undef queue 20 3. see limits # condor_userprio -l | grep "ConcurrencyLimit" ConcurrencyLimit_medium_test = 5.000000 ConcurrencyLimit_small_test = 2.000000 ConcurrencyLimit_medium_undef = 5.000000 ConcurrencyLimit_large_test = 13.000000 ConcurrencyLimit_undef = 1.000000 ConcurrencyLimit_undef_test = 1.000000 ConcurrencyLimit_test = 3.000000 # condor_q -c 'JobStatus == 2' -l | grep "ConcurrencyLimits " | uniq -c 13 ConcurrencyLimits = "large.test" 5 ConcurrencyLimits = "medium.test" 2 ConcurrencyLimits = "small.test" 3 ConcurrencyLimits = "test" 1 ConcurrencyLimits = "undef.test" 1 ConcurrencyLimits = "undef" 5 ConcurrencyLimits = "medium.undef" Actual results: There is more running jobs than it should with such setting. In this case 'large.test' concurrency limit is exceeded. In other runs it's for example 'medium.test'. Expected results: No concurrency limit should exceed. Additional info: See NegotiatorLog in attachment