Bug 784573

Summary: QueryServer crash on x86_64
Product: Red Hat Enterprise MRG Reporter: Martin Kudlej <mkudlej>
Component: condor-aviaryAssignee: grid-maint-list <grid-maint-list>
Status: CLOSED WONTFIX QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: DevelopmentCC: esammons, jneedle, matt, tstclair
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.8.2-0.1 Doc Type: Bug Fix
Doc Text:
Cause: Logging a warning when a DestroyClassAd event occurs in the Query Server. Consequence: Query Server crashes. Fix: Fixed a bad string format in error logging. Result: Query server doesn't crash.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-26 20:01:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
logs and configuration none

Description Martin Kudlej 2012-01-25 12:22:48 UTC
Created attachment 557440 [details]
logs and configuration

Description of problem:
I've got 4 machines with 
Features (priority: name):
  0: Master
  1: NodeAccess
  2: ExecuteNode
  3: CentralManager
  4: Scheduler
  5: QMF
  6: JobHooks
  7: QueryServer
  8: JobServer
Parameters:
  QMF_BROKER_HOST = _host_1
  SUSPEND = false
  ALLOW_WRITE = *
  START = true
  CREATE_CORE_FILES = true
  CONTINUE = true
  ALLOW_READ = *
  CONDOR_HOST = _host_1
  SCHEDD_CLUSTER_MAXIMUM_VALUE = 3

and generating of corefiles is on in OS, but I don't see any corefile.
I've periodically submit simple job and check number of jobs in queue. I also run condor_q every 2 seconds simultaneously to submitting. I see this stackdump in QueryServerLog.

01/25/12 06:00:47 HistoryFile::init:1:Failed to stat /var/lib/condor/spool//history: 2 (No such file or directory)

Stack dump for process 21195 at timestamp 1327489367 (21 frames)
aviary_query_server(dprintf_dump_stack+0x56)[0x4fd296]
aviary_query_server[0x4ff192]
/lib64/libpthread.so.0[0x3eee20eb70]
/lib64/libc.so.6(strlen+0x10)[0x3eeda79b60]
/lib64/libc.so.6(_IO_vfprintf+0x4479)[0x3eeda46cb9]
/lib64/libc.so.6(vsnprintf+0x9a)[0x3eeda699da]
aviary_query_server(vprintf_length+0x32)[0x502dc2]
aviary_query_server(vsprintf_realloc+0x52)[0x502e22]
aviary_query_server[0x4fdd23]
aviary_query_server(_condor_dprintf_va+0x313)[0x4fedb3]
aviary_query_server(dprintf+0x86)[0x4ea186]
aviary_query_server(_ZN23JobServerJobLogConsumer14DestroyClassAdEPKc+0x69)[0x461369]
aviary_query_server(_ZN16ClassAdLogReader15ProcessLogEntryEP15ClassAdLogEntryP16ClassAdLogParser+0xa2)[0x5265a2]
aviary_query_server(_ZN16ClassAdLogReader15IncrementalLoadEv+0x36)[0x5265e6]
aviary_query_server(_ZN16ClassAdLogReader4PollEv+0xbf)[0x52677f]
aviary_query_server(_ZN12JobLogMirror26TimerHandler_JobLogPollingEv+0x21)[0x4fee71]
aviary_query_server(_ZN12TimerManager7TimeoutEv+0x155)[0x48e005]
aviary_query_server(_ZN10DaemonCore6DriverEv+0x248)[0x47ae78]
aviary_query_server(main+0xed0)[0x471030]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3eeda1d994]
aviary_query_server[0x45d4f9]

I see this only on x86_64 systems.

Version-Release number of selected component (if applicable):
condor-wallaby-client-4.1.2-1.el5
qpid-qmf-devel-0.10-11.el5
condor-low-latency-1.2-2.el5
condor-ec2-enhanced-1.3.0-1.el5
condor-wallaby-base-db-1.19-1.el5
condor-kbdd-7.6.5-0.12.el5
python-qpid-qmf-0.10-11.el5
condor-job-hooks-1.5-4.el5
python-qpid-0.10-1.el5
qpid-cpp-client-0.10-9.el5
python-wallabyclient-4.1.2-1.el5
qpid-cpp-client-devel-0.10-9.el5
ruby-qpid-qmf-0.10-11.el5
condor-wallaby-tools-4.1.2-1.el5
qpid-qmf-debuginfo-0.10-11.el5
python-condorec2e-1.3.0-1.el5
condor-ec2-enhanced-hooks-1.3.0-1.el5
wallaby-utils-0.12.5-1.el5
wallaby-0.12.5-1.el5
condor-classads-7.6.5-0.12.el5
condor-aviary-7.6.5-0.12.el5
condor-debuginfo-7.6.5-0.12.el5
condor-vm-gahp-7.6.5-0.12.el5
python-condorutils-1.5-4.el5
qpid-cpp-server-0.10-9.el5
qpid-qmf-0.10-11.el5
qpid-tools-0.10-6.el5
ruby-wallaby-0.12.5-1.el5
python-wallaby-0.12.5-1.el5
condor-7.6.5-0.12.el5
condor-qmf-7.6.5-0.12.el5


How reproducible:
100%

Steps to Reproduce:
1. install condor, qmf and aviary support for condor
2. set it up as it is described above
3. service condor stop
4. rm -f /var/log/condor/*
5. rm -f /var/lib/condor/spool/*
6. service condor start
7. periodically submit simple job
8. wait till raise of stackdump
  
Actual results:
Aviary server crashes and master should start it again.

Expected results:
Aviary server won't crash.

Comment 1 Pete MacKinnon 2012-01-31 23:56:42 UTC
Bad dprintf format down a particular code path is the culprit.

UW f3604d8

Comment 3 Pete MacKinnon 2012-03-15 15:04:15 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Logging a warning when a DestroyClassAd event occurs in the Query Server.
Consequence: Query Server crashes.
Fix: Fixed a bad string format in error logging.
Result: Query server doesn't crash.