Description of problem: (from upstream): What's going on here is that the matchmaker calls DaemonCore::ServiceCommandThread in the middle of the negotiation cycle. This recursively calls the DaemonCore event loop, and if we get a reconfig there we recompute all the hierarchical group data structures, throw out the old ones, and return to the negotiator, which is still using the old ones, and bad thing happen. How reproducible: "low prob" Steps to Reproduce: Submit a bunch of jobs, then run condor_reschedule ; condor_reconfig a bunch of times until the negotiator crashes Actual results: negotiator crashes Expected results: negotiator avoids reconfig in the middle of a cycle and does not crash Additional info: upstream: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2172
Upstream fix is on 7.6 branch: https://condor-wiki.cs.wisc.edu/index.cgi/chngview?cn=21901
I tried to reproduce the bug on condor-7.6.1-0.4.el5 with no results. I used configuration mentioned in upstream ticket with NUM_CPUS set to 40. I submitted following description file: universe = vanilla cmd = /bin/sleep args = 30 #should_transfer_files = if_needed #when_to_transfer_output = on_exit queue 200 +AccountingGroup="group_gatekpr.user" queue 200 +AccountingGroup="group_gatekpr.prod.user" queue 200 +AccountingGroup="group_gatekpr.other.user" queue 200 +AccountingGroup="group_calibrate.user" queue 200 +AccountingGroup="group_opporA.user" queue 200 +AccountingGroup="group_opporA.ligo.user" queue 200 +AccountingGroup="group_opporA.CMS.user" queue 200 +AccountingGroup="group_opporB.user" queue 200 +AccountingGroup="group_opporB.SBGrid.user" queue 200 +AccountingGroup="group_VOgener.user" queue 200 +AccountingGroup="group_T3gen.user" queue 200 +AccountingGroup="group_prod.user" queue 200 +AccountingGroup="group_prod.hggs.user" queue 200 +AccountingGroup="group_prod.ww.user" queue 200 +AccountingGroup="group_prod.muon.user" queue 200 +AccountingGroup="group_T3gen.other.user" queue 200 +AccountingGroup="group_T3gen.eID.user" queue 200 +AccountingGroup="group_T3gen.hggs.user" queue 200 +AccountingGroup="group_T3gen.BSM.user" queue 200 +AccountingGroup="group_T3gen.general.user" queue 200 then run while true; do condor_reschedule; condor_reconfig; done > /dev/null or while true; do condor_reschedule; condor_reconfig; sleep $(( $RANDOM % 5 )); done > /dev/null I tried also other random sleep intervals, but in all cases I was not able to simulate crash of negotiator. Am I doing something wrong? Is there something else needed?
(In reply to comment #3) > I tried to reproduce the bug on condor-7.6.1-0.4.el5 with no results. > I'm seeing slightly ambiguous results comparing which branches versus which tags the fix appears in. It's on the upstream 7.6.1 branch, so it might help to repro against an earlier build. > > I tried also other random sleep intervals, but in all cases I was not able to > simulate crash of negotiator. Am I doing something wrong? Is there something > else needed? Repro of this bug is timing dependent: A reconfig needs to happen during the negotiation cycle. The best bet is setting things up to create as long a negotiation cycle as possible, and attempting to specifically time the reconfig to land inside the cycle. Lots of slots, jobs, groups, etc. Also setting "GROUP_QUOTA_ROUND_ROBIN_RATE = 1" is a nice expensive thing to do that will increase the length of the cycle and make repro more likely.
Successfully reproduced on: condor-7.6.0-0.3 07/22/11 12:12:28 This submitter hit its submitterLimit. Stack dump for process 14023 at timestamp 1311329548 (11 frames) condor_negotiator(dprintf_dump_stack+0x56)[0x537856] condor_negotiator[0x524682] /lib64/libpthread.so.0[0x354c60eb10] condor_negotiator(_ZN14compat_classad27ClassAdListDoesNotDeleteAds4NextEv+0xf)[0x52aa1f] condor_negotiator(_ZN10Matchmaker18negotiateWithGroupEiddRN14compat_classad27ClassAdListDoesNotDeleteAdsER9HashTableI8MyStringS4_ES2_ffPKc+0x2a9)[0x480fb9] condor_negotiator(_ZN10Matchmaker15negotiationTimeEv+0xa70)[0x4829f0] condor_negotiator(_ZN12TimerManager7TimeoutEv+0x155)[0x4a6305] condor_negotiator(_ZN10DaemonCore6DriverEv+0x248)[0x490c18] condor_negotiator(main+0xe57)[0x4a52f7] /lib64/libc.so.6(__libc_start_main+0xf4)[0x354ba1d994] condor_negotiator(__gxx_personality_v0+0x441)[0x468f49]
Tested on: $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $ $CondorPlatform: I686-RedHat_5.7 $ $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el5 $ $CondorPlatform: X86_64-RedHat_5.7 $ $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $ $CondorPlatform: I686-RedHat_6.1 $ $CondorVersion: 7.6.3 Jul 13 2011 BuildID: RH-7.6.3-0.2.el6 $ $CondorPlatform: X86_64-RedHat_6.1 $ No crash of negotiator. Reconfig run after end of negotiator cycle. >>> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Negotiator depends on dynamically allocated data structure of accounting groups, which could be re-allocated by a reconfig event mid-cycle. Consequence: If a reconfig event occurs mid-negotiation cycle, the negotiator's structure pointers on stack could be invalidated, causing a memory read error and crash. Fix: Logic was added to delay reconfig events until any current negotiation cycle completes, and then execute the reconfig directly after the cycle. Result: Potential for reconfig causing negotiator read error and crash is eliminated.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1249.html