Description of problem: A bug in state management logic causes exception on daemon reconfiguring, via SetSize() method in ring_buffer<> How reproducible: 100% Steps to Reproduce: REPRO: start with: # I believe any size reduction will work, but here's my configuration STATISTICS_WINDOW_QUANTUM = 1 STATISTICS_WINDOW_SECONDS = 30 start up a pool with a schedd. Then change STATISTICS_WINDOW_SECONDS to 15 (reduce size of ring buffers) and: $ condor_reconfig -schedd You should see an exception in SchedLog: 01/07/13 15:59:22 (pid:1068) ERROR "Unexpected call to empty ring_buffer Expected results: resizing via reconfig should work properly with no exceptions
Confirmed on: $CondorVersion: 7.8.8 Dec 14 2012 BuildID: RH-7.8.8-0.1.el5 $ $CondorPlatform: X86_64-RedHat_5.9 $ SchedLog: 01/14/13 10:18:17 (pid:21385) ============ End clean_shadow_recs ============= 01/14/13 10:18:17 (pid:21385) DaemonCore: No more children processes to reap. 01/14/13 10:18:18 (pid:21385) Getting monitoring info for pid 21385 01/14/13 10:18:18 (pid:21385) ERROR "Unexpected call to empty ring_buffer " at line 290 in file /builddir/build/BUILD/condor-7.8.6/src/condor_utils/generic_stats.h Stack dump for process 21385 at timestamp 1358155098 (15 frames) /usr/lib64/libcondor_utils_7_8_8.so(dprintf_dump_stack+0x58)[0x333a1075c8] /usr/lib64/libcondor_utils_7_8_8.so[0x333a165ec2] /lib64/libpthread.so.0[0x378900ebe0] /lib64/libc.so.6(abort+0x28f)[0x3788431eaf] /usr/lib64/libcondor_utils_7_8_8.so(_EXCEPT_+0x130)[0x333a11c710] /usr/lib64/libcondor_utils_7_8_8.so(_ZN18stats_entry_recentIiE9AdvanceByEi+0x14c)[0x333a13d08c] /usr/lib64/libcondor_utils_7_8_8.so(_ZN26stats_recent_counter_timer9AdvanceByEi+0x21)[0x333a143cd1] /usr/lib64/libcondor_utils_7_8_8.so(_ZN14StatisticsPool7AdvanceEi+0x63)[0x333a13b703] /usr/lib64/libcondor_utils_7_8_8.so(_ZN10DaemonCore5Stats4TickEl+0x56)[0x333a262ad6] /usr/lib64/libcondor_utils_7_8_8.so[0x333a267820] /usr/lib64/libcondor_utils_7_8_8.so(_ZN12TimerManager7TimeoutEPiPd+0x405)[0x333a288eb5] /usr/lib64/libcondor_utils_7_8_8.so(_ZN10DaemonCore6DriverEv+0x56a)[0x333a27444a] /usr/lib64/libcondor_utils_7_8_8.so(_Z7dc_mainiPPc+0xffa)[0x333a28e14a] /lib64/libc.so.6(__libc_start_main+0xf4)[0x378841d994] condor_schedd(_ZN7CronJob6ReaperEii+0x1f9)[0x42cef9] 01/14/13 10:18:29 (pid:21458) OpSysMajorVersion: 5 01/14/13 10:18:29 (pid:21458) OpSysShortName: RedHat 01/14/13 10:18:29 (pid:21458) OpSysLongName: Red Hat Enterprise Linux Server release 5.8 (Tikanga) 01/14/13 10:18:29 (pid:21458) OpSysAndVer: RedHat5 01/14/13 10:18:29 (pid:21458) OpSysLegacy: LINUX 01/14/13 10:18:29 (pid:21458) OpSysName: RedHat 01/14/13 10:18:29 (pid:21458) OpSysVer: 508 01/14/13 10:18:29 (pid:21458) OpSys: LINUX 01/14/13 10:18:29 (pid:21458) Using processor count: 1 processors, 1 CPUs, 0 HTs 01/14/13 10:18:29 (pid:21458) Reading condor configuration from '/etc/condor/condor_config' The scheduler daemon was restarted after stackdump and working.
Retested with RH-7.8.8-0.4 on RHEL5/6, x86_64/x86 01/23/13 04:47:31 (pid:7745) Getting monitoring info for pid 7745 01/23/13 04:47:31 (pid:7745) Using processor count: 1 processors, 1 CPUs, 0 HTs 01/23/13 04:47:31 (pid:7745) Reading condor configuration from '/etc/condor/condor_config' 01/23/13 04:47:31 (pid:7745) Setting maximum accepts per cycle 8. 01/23/13 04:47:31 (pid:7745) Will use UDP to update collector node <IP:9618> 01/23/13 04:47:31 (pid:7745) Not using shared port because USE_SHARED_PORT=false 01/23/13 04:47:31 (pid:7745) Using name: node 01/23/13 04:47:31 (pid:7745) No Accountant host specified in config file 01/23/13 04:47:31 (pid:7745) History file rotation is enabled. 01/23/13 04:47:31 (pid:7745) Maximum history file size is: 2000000 bytes 01/23/13 04:47:31 (pid:7745) Number of rotated history files is: 10 01/23/13 04:47:31 (pid:7745) Count per interval for SelfDrainingQueue stop_job_queue set to 1 01/23/13 04:47:31 (pid:7745) Queue Management Super Users: 01/23/13 04:47:31 (pid:7745) root 01/23/13 04:47:31 (pid:7745) condor 01/23/13 04:47:31 (pid:7745) Failed to execute /usr/sbin/condor_shadow.std, ignoring 01/23/13 04:47:31 (pid:7745) Changing period of timer 12 (CkptWallClock) from 3600 to 3600 (added 0s to time of next scheduled call) 01/23/13 04:47:31 (pid:7745) Registering PeriodicExprHandler(), next callback in 26 seconds 01/23/13 04:47:31 (pid:7745) AutoCluster:config((null)) invoked 01/23/13 04:47:31 (pid:7745) JobsRunning = 0 01/23/13 04:47:31 (pid:7745) JobsIdle = 0 01/23/13 04:47:31 (pid:7745) JobsHeld = 0 01/23/13 04:47:31 (pid:7745) JobsRemoved = 0 01/23/13 04:47:31 (pid:7745) LocalUniverseJobsRunning = 0 01/23/13 04:47:31 (pid:7745) LocalUniverseJobsIdle = 0 01/23/13 04:47:31 (pid:7745) SchedUniverseJobsRunning = 0 01/23/13 04:47:31 (pid:7745) SchedUniverseJobsIdle = 0 01/23/13 04:47:31 (pid:7745) N_Owners = 0 01/23/13 04:47:31 (pid:7745) MaxJobsRunning = 5089 01/23/13 04:47:31 (pid:7745) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s 01/23/13 04:47:31 (pid:7745) Trying to update collector <IP:9618> >>> VERIFIED