Bug 707335
| Summary: | Negotiator crashes with hierarchical group quotas after reconfig | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Erik Erlandson <eerlands> |
| Component: | condor | Assignee: | Erik Erlandson <eerlands> |
| Status: | CLOSED ERRATA | QA Contact: | Lubos Trilety <ltrilety> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.0 | CC: | claudiol, jneedle, jthomas, ltrilety, matt, mkudlej, tstclair, whenry |
| Target Milestone: | 2.0.1 | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | condor-7.6.2-0.1 | Doc Type: | Bug Fix |
| Doc Text: |
Cause:
Negotiator depends on dynamically allocated data structure of accounting groups, which could be re-allocated by a reconfig event mid-cycle.
Consequence:
If a reconfig event occurs mid-negotiation cycle, the negotiator's structure pointers on stack could be invalidated, causing a memory read error and crash.
Fix:
Logic was added to delay reconfig events until any current negotiation cycle completes, and then execute the reconfig directly after the cycle.
Result:
Potential for reconfig causing negotiator read error and crash is eliminated.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-09-07 16:41:33 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 723887 | ||
|
Description
Erik Erlandson
2011-05-24 17:46:27 UTC
Upstream fix is on 7.6 branch: https://condor-wiki.cs.wisc.edu/index.cgi/chngview?cn=21901 I tried to reproduce the bug on condor-7.6.1-0.4.el5 with no results. I used configuration mentioned in upstream ticket with NUM_CPUS set to 40. I submitted following description file: universe = vanilla cmd = /bin/sleep args = 30 #should_transfer_files = if_needed #when_to_transfer_output = on_exit queue 200 +AccountingGroup="group_gatekpr.user" queue 200 +AccountingGroup="group_gatekpr.prod.user" queue 200 +AccountingGroup="group_gatekpr.other.user" queue 200 +AccountingGroup="group_calibrate.user" queue 200 +AccountingGroup="group_opporA.user" queue 200 +AccountingGroup="group_opporA.ligo.user" queue 200 +AccountingGroup="group_opporA.CMS.user" queue 200 +AccountingGroup="group_opporB.user" queue 200 +AccountingGroup="group_opporB.SBGrid.user" queue 200 +AccountingGroup="group_VOgener.user" queue 200 +AccountingGroup="group_T3gen.user" queue 200 +AccountingGroup="group_prod.user" queue 200 +AccountingGroup="group_prod.hggs.user" queue 200 +AccountingGroup="group_prod.ww.user" queue 200 +AccountingGroup="group_prod.muon.user" queue 200 +AccountingGroup="group_T3gen.other.user" queue 200 +AccountingGroup="group_T3gen.eID.user" queue 200 +AccountingGroup="group_T3gen.hggs.user" queue 200 +AccountingGroup="group_T3gen.BSM.user" queue 200 +AccountingGroup="group_T3gen.general.user" queue 200 then run while true; do condor_reschedule; condor_reconfig; done > /dev/null or while true; do condor_reschedule; condor_reconfig; sleep $(( $RANDOM % 5 )); done > /dev/null I tried also other random sleep intervals, but in all cases I was not able to simulate crash of negotiator. Am I doing something wrong? Is there something else needed? (In reply to comment #3) > I tried to reproduce the bug on condor-7.6.1-0.4.el5 with no results. > I'm seeing slightly ambiguous results comparing which branches versus which tags the fix appears in. It's on the upstream 7.6.1 branch, so it might help to repro against an earlier build. > > I tried also other random sleep intervals, but in all cases I was not able to > simulate crash of negotiator. Am I doing something wrong? Is there something > else needed? Repro of this bug is timing dependent: A reconfig needs to happen during the negotiation cycle. The best bet is setting things up to create as long a negotiation cycle as possible, and attempting to specifically time the reconfig to land inside the cycle. Lots of slots, jobs, groups, etc. Also setting "GROUP_QUOTA_ROUND_ROBIN_RATE = 1" is a nice expensive thing to do that will increase the length of the cycle and make repro more likely. Successfully reproduced on: condor-7.6.0-0.3 07/22/11 12:12:28 This submitter hit its submitterLimit. Stack dump for process 14023 at timestamp 1311329548 (11 frames) condor_negotiator(dprintf_dump_stack+0x56)[0x537856] condor_negotiator[0x524682] /lib64/libpthread.so.0[0x354c60eb10] condor_negotiator(_ZN14compat_classad27ClassAdListDoesNotDeleteAds4NextEv+0xf)[0x52aa1f] condor_negotiator(_ZN10Matchmaker18negotiateWithGroupEiddRN14compat_classad27ClassAdListDoesNotDeleteAdsER9HashTableI8MyStringS4_ES2_ffPKc+0x2a9)[0x480fb9] condor_negotiator(_ZN10Matchmaker15negotiationTimeEv+0xa70)[0x4829f0] condor_negotiator(_ZN12TimerManager7TimeoutEv+0x155)[0x4a6305] condor_negotiator(_ZN10DaemonCore6DriverEv+0x248)[0x490c18] condor_negotiator(main+0xe57)[0x4a52f7] /lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994] condor_negotiator(__gxx_personality_v0+0x441)[0x468f49] Tested on:
$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: I686-RedHat_5.7 $
$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $
$CondorPlatform: X86_64-RedHat_5.7 $
$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: I686-RedHat_6.1 $
$CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $
$CondorPlatform: X86_64-RedHat_6.1 $
No crash of negotiator. Reconfig run after end of negotiator cycle.
>>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Cause:
Negotiator depends on dynamically allocated data structure of accounting groups, which could be re-allocated by a reconfig event mid-cycle.
Consequence:
If a reconfig event occurs mid-negotiation cycle, the negotiator's structure pointers on stack could be invalidated, causing a memory read error and crash.
Fix:
Logic was added to delay reconfig events until any current negotiation cycle completes, and then execute the reconfig directly after the cycle.
Result:
Potential for reconfig causing negotiator read error and crash is eliminated.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html |