Hide Forgot
Created attachment 557440 [details] logs and configuration Description of problem: I've got 4 machines with Features (priority: name): 0: Master 1: NodeAccess 2: ExecuteNode 3: CentralManager 4: Scheduler 5: QMF 6: JobHooks 7: QueryServer 8: JobServer Parameters: QMF_BROKER_HOST = _host_1 SUSPEND = false ALLOW_WRITE = * START = true CREATE_CORE_FILES = true CONTINUE = true ALLOW_READ = * CONDOR_HOST = _host_1 SCHEDD_CLUSTER_MAXIMUM_VALUE = 3 and generating of corefiles is on in OS, but I don't see any corefile. I've periodically submit simple job and check number of jobs in queue. I also run condor_q every 2 seconds simultaneously to submitting. I see this stackdump in QueryServerLog. 01/25/12 06:00:47 HistoryFile::init:1:Failed to stat /var/lib/condor/spool//history: 2 (No such file or directory) Stack dump for process 21195 at timestamp 1327489367 (21 frames) aviary_query_server(dprintf_dump_stack+0x56)[0x4fd296] aviary_query_server[0x4ff192] /lib64/libpthread.so.0[0x3eee20eb70] /lib64/libc.so.6(strlen+0x10)[0x3eeda79b60] /lib64/libc.so.6(_IO_vfprintf+0x4479)[0x3eeda46cb9] /lib64/libc.so.6(vsnprintf+0x9a)[0x3eeda699da] aviary_query_server(vprintf_length+0x32)[0x502dc2] aviary_query_server(vsprintf_realloc+0x52)[0x502e22] aviary_query_server[0x4fdd23] aviary_query_server(_condor_dprintf_va+0x313)[0x4fedb3] aviary_query_server(dprintf+0x86)[0x4ea186] aviary_query_server(_ZN23JobServerJobLogConsumer14DestroyClassAdEPKc+0x69)[0x461369] aviary_query_server(_ZN16ClassAdLogReader15ProcessLogEntryEP15ClassAdLogEntryP16ClassAdLogParser+0xa2)[0x5265a2] aviary_query_server(_ZN16ClassAdLogReader15IncrementalLoadEv+0x36)[0x5265e6] aviary_query_server(_ZN16ClassAdLogReader4PollEv+0xbf)[0x52677f] aviary_query_server(_ZN12JobLogMirror26TimerHandler_JobLogPollingEv+0x21)[0x4fee71] aviary_query_server(_ZN12TimerManager7TimeoutEv+0x155)[0x48e005] aviary_query_server(_ZN10DaemonCore6DriverEv+0x248)[0x47ae78] aviary_query_server(main+0xed0)[0x471030] /lib64/libc.so.6(__libc_start_main+0xf4)[0x3eeda1d994] aviary_query_server[0x45d4f9] I see this only on x86_64 systems. Version-Release number of selected component (if applicable): condor-wallaby-client-4.1.2-1.el5 qpid-qmf-devel-0.10-11.el5 condor-low-latency-1.2-2.el5 condor-ec2-enhanced-1.3.0-1.el5 condor-wallaby-base-db-1.19-1.el5 condor-kbdd-7.6.5-0.12.el5 python-qpid-qmf-0.10-11.el5 condor-job-hooks-1.5-4.el5 python-qpid-0.10-1.el5 qpid-cpp-client-0.10-9.el5 python-wallabyclient-4.1.2-1.el5 qpid-cpp-client-devel-0.10-9.el5 ruby-qpid-qmf-0.10-11.el5 condor-wallaby-tools-4.1.2-1.el5 qpid-qmf-debuginfo-0.10-11.el5 python-condorec2e-1.3.0-1.el5 condor-ec2-enhanced-hooks-1.3.0-1.el5 wallaby-utils-0.12.5-1.el5 wallaby-0.12.5-1.el5 condor-classads-7.6.5-0.12.el5 condor-aviary-7.6.5-0.12.el5 condor-debuginfo-7.6.5-0.12.el5 condor-vm-gahp-7.6.5-0.12.el5 python-condorutils-1.5-4.el5 qpid-cpp-server-0.10-9.el5 qpid-qmf-0.10-11.el5 qpid-tools-0.10-6.el5 ruby-wallaby-0.12.5-1.el5 python-wallaby-0.12.5-1.el5 condor-7.6.5-0.12.el5 condor-qmf-7.6.5-0.12.el5 How reproducible: 100% Steps to Reproduce: 1. install condor, qmf and aviary support for condor 2. set it up as it is described above 3. service condor stop 4. rm -f /var/log/condor/* 5. rm -f /var/lib/condor/spool/* 6. service condor start 7. periodically submit simple job 8. wait till raise of stackdump Actual results: Aviary server crashes and master should start it again. Expected results: Aviary server won't crash.
Bad dprintf format down a particular code path is the culprit. UW f3604d8
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Logging a warning when a DestroyClassAd event occurs in the Query Server. Consequence: Query Server crashes. Fix: Fixed a bad string format in error logging. Result: Query server doesn't crash.