894074 – ring_buffer<> internal state management problem with SetSize() method

Bug 894074 - ring_buffer<> internal state management problem with SetSize() method

Summary: ring_buffer<> internal state management problem with SetSize() method

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	Development
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	2.3
Target Release:	---
Assignee:	Erik Erlandson
QA Contact:	Tomas Rusnak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-01-10 16:11 UTC by Erik Erlandson
Modified:	2013-03-19 16:39 UTC (History)
CC List:	3 users (show)
Fixed In Version:	condor-7.8.8-0.4
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-03-19 16:39:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Erik Erlandson 2013-01-10 16:11:52 UTC

Description of problem:

    A bug in state management logic causes exception on daemon reconfiguring, via SetSize() method in ring_buffer<>


How reproducible:
100%

Steps to Reproduce:
    REPRO:

    start with:

    # I believe any size reduction will work, but here's my configuration
    STATISTICS_WINDOW_QUANTUM = 1
    STATISTICS_WINDOW_SECONDS = 30

    start up a pool with a schedd. Then change STATISTICS_WINDOW_SECONDS to 15 (reduce size of ring buffers) and:

    $ condor_reconfig -schedd

    You should see an exception in SchedLog:

    01/07/13 15:59:22 (pid:1068) ERROR "Unexpected call to empty ring_buffer  


Expected results:
resizing via reconfig should work properly with no exceptions

Comment 1 Tomas Rusnak 2013-01-14 15:14:32 UTC

Confirmed on:

$CondorVersion: 7.8.8 Dec 14 2012 BuildID: RH-7.8.8-0.1.el5 $
$CondorPlatform: X86_64-RedHat_5.9 $

SchedLog:
01/14/13 10:18:17 (pid:21385) ============ End clean_shadow_recs =============
01/14/13 10:18:17 (pid:21385) DaemonCore: No more children processes to reap.
01/14/13 10:18:18 (pid:21385) Getting monitoring info for pid 21385
01/14/13 10:18:18 (pid:21385) ERROR "Unexpected call to empty ring_buffer
" at line 290 in file /builddir/build/BUILD/condor-7.8.6/src/condor_utils/generic_stats.h
Stack dump for process 21385 at timestamp 1358155098 (15 frames)
/usr/lib64/libcondor_utils_7_8_8.so(dprintf_dump_stack+0x58)[0x333a1075c8]
/usr/lib64/libcondor_utils_7_8_8.so[0x333a165ec2]
/lib64/libpthread.so.0[0x378900ebe0]
/lib64/libc.so.6(abort+0x28f)[0x3788431eaf]
/usr/lib64/libcondor_utils_7_8_8.so(_EXCEPT_+0x130)[0x333a11c710]
/usr/lib64/libcondor_utils_7_8_8.so(_ZN18stats_entry_recentIiE9AdvanceByEi+0x14c)[0x333a13d08c]
/usr/lib64/libcondor_utils_7_8_8.so(_ZN26stats_recent_counter_timer9AdvanceByEi+0x21)[0x333a143cd1]
/usr/lib64/libcondor_utils_7_8_8.so(_ZN14StatisticsPool7AdvanceEi+0x63)[0x333a13b703]
/usr/lib64/libcondor_utils_7_8_8.so(_ZN10DaemonCore5Stats4TickEl+0x56)[0x333a262ad6]
/usr/lib64/libcondor_utils_7_8_8.so[0x333a267820]
/usr/lib64/libcondor_utils_7_8_8.so(_ZN12TimerManager7TimeoutEPiPd+0x405)[0x333a288eb5]
/usr/lib64/libcondor_utils_7_8_8.so(_ZN10DaemonCore6DriverEv+0x56a)[0x333a27444a]
/usr/lib64/libcondor_utils_7_8_8.so(_Z7dc_mainiPPc+0xffa)[0x333a28e14a]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x378841d994]
condor_schedd(_ZN7CronJob6ReaperEii+0x1f9)[0x42cef9]
01/14/13 10:18:29 (pid:21458) OpSysMajorVersion:  5 
01/14/13 10:18:29 (pid:21458) OpSysShortName:  RedHat 
01/14/13 10:18:29 (pid:21458) OpSysLongName:  Red Hat Enterprise Linux Server release 5.8 (Tikanga) 
01/14/13 10:18:29 (pid:21458) OpSysAndVer:  RedHat5 
01/14/13 10:18:29 (pid:21458) OpSysLegacy:  LINUX 
01/14/13 10:18:29 (pid:21458) OpSysName:  RedHat 
01/14/13 10:18:29 (pid:21458) OpSysVer:  508 
01/14/13 10:18:29 (pid:21458) OpSys:  LINUX 
01/14/13 10:18:29 (pid:21458) Using processor count: 1 processors, 1 CPUs, 0 HTs
01/14/13 10:18:29 (pid:21458) Reading condor configuration from '/etc/condor/condor_config'

The scheduler daemon was restarted after stackdump and working.

Comment 5 Tomas Rusnak 2013-01-23 09:42:57 UTC

Retested with RH-7.8.8-0.4 on RHEL5/6, x86_64/x86

01/23/13 04:47:31 (pid:7745) Getting monitoring info for pid 7745 
01/23/13 04:47:31 (pid:7745) Using processor count: 1 processors, 1 CPUs, 0 HTs
01/23/13 04:47:31 (pid:7745) Reading condor configuration from '/etc/condor/condor_config'
01/23/13 04:47:31 (pid:7745) Setting maximum accepts per cycle 8.
01/23/13 04:47:31 (pid:7745) Will use UDP to update collector node <IP:9618>
01/23/13 04:47:31 (pid:7745) Not using shared port because USE_SHARED_PORT=false
01/23/13 04:47:31 (pid:7745) Using name: node
01/23/13 04:47:31 (pid:7745) No Accountant host specified in config file
01/23/13 04:47:31 (pid:7745) History file rotation is enabled.
01/23/13 04:47:31 (pid:7745)   Maximum history file size is: 2000000 bytes
01/23/13 04:47:31 (pid:7745)   Number of rotated history files is: 10
01/23/13 04:47:31 (pid:7745) Count per interval for SelfDrainingQueue stop_job_queue set to 1
01/23/13 04:47:31 (pid:7745) Queue Management Super Users:
01/23/13 04:47:31 (pid:7745)    root
01/23/13 04:47:31 (pid:7745)    condor
01/23/13 04:47:31 (pid:7745) Failed to execute /usr/sbin/condor_shadow.std, ignoring
01/23/13 04:47:31 (pid:7745) Changing period of timer 12 (CkptWallClock) from 3600 to 3600 (added 0s to time of next scheduled call)
01/23/13 04:47:31 (pid:7745) Registering PeriodicExprHandler(), next callback in 26 seconds
01/23/13 04:47:31 (pid:7745) AutoCluster:config((null)) invoked
01/23/13 04:47:31 (pid:7745) JobsRunning = 0
01/23/13 04:47:31 (pid:7745) JobsIdle = 0
01/23/13 04:47:31 (pid:7745) JobsHeld = 0
01/23/13 04:47:31 (pid:7745) JobsRemoved = 0
01/23/13 04:47:31 (pid:7745) LocalUniverseJobsRunning = 0
01/23/13 04:47:31 (pid:7745) LocalUniverseJobsIdle = 0
01/23/13 04:47:31 (pid:7745) SchedUniverseJobsRunning = 0
01/23/13 04:47:31 (pid:7745) SchedUniverseJobsIdle = 0
01/23/13 04:47:31 (pid:7745) N_Owners = 0
01/23/13 04:47:31 (pid:7745) MaxJobsRunning = 5089
01/23/13 04:47:31 (pid:7745) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
01/23/13 04:47:31 (pid:7745) Trying to update collector <IP:9618>

>>> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.