Bug 853945

Summary: RHHAv2 collector daemon SEGFAULT
Product: Red Hat Enterprise MRG Reporter: Tomas Rusnak <trusnak>
Component: condorAssignee: Timothy St. Clair <tstclair>
Status: CLOSED NOTABUG QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 2.2CC: matt, rrati, tstclair
Target Milestone: 2.2   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-04 17:08:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 799474, 852537    
Attachments:
Description Flags
collector daemon strace none

Description Tomas Rusnak 2012-09-03 11:36:27 UTC
Description of problem:

Condor was upgraded from 7.6.5-0.18.el6 to 7.6.5-0.22.el6. RHHAv2 with multiple schedulers setup is used. After condor restart, the collector doesn't start with message:

09/03/12 11:28:33 ******************************************************
09/03/12 11:28:33 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
09/03/12 11:28:33 ** /usr/sbin/condor_collector
09/03/12 11:28:33 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1)
09/03/12 11:28:33 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON
09/03/12 11:28:33 ** $CondorVersion: 7.6.5 Aug 30 2012 BuildID: RH-7.6.5-0.22.el6 $
09/03/12 11:28:33 ** $CondorPlatform: X86_64-RedHat_6.3 $
09/03/12 11:28:33 ** PID = 25162
09/03/12 11:28:33 ** Log last touched 9/3 11:27:20
09/03/12 11:28:33 ******************************************************
09/03/12 11:28:33 Using config source: /etc/condor/condor_config
09/03/12 11:28:33 Using local config sources: 
09/03/12 11:28:33    /etc/condor/config.d/00personal_condor.config
09/03/12 11:28:33    /etc/condor/config.d/50ha.config
09/03/12 11:28:33    /etc/condor/config.d/60condor-qmf.config
09/03/12 11:28:33    /etc/condor/config.d/61aviary.config
09/03/12 11:28:33    /etc/condor/config.d/99configd.config
09/03/12 11:28:33    /var/lib/condor/wallaby_node.config
09/03/12 11:28:33 DaemonCore: command socket at <10.34.1.106:9618>
09/03/12 11:28:33 DaemonCore: private command socket at <10.34.1.106:9618>
09/03/12 11:28:33 Setting maximum accepts per cycle 8.
09/03/12 11:28:33 In ViewServer::Init()
09/03/12 11:28:33 In CollectorDaemon::Init()
09/03/12 11:28:33 In ViewServer::Config()
09/03/12 11:28:33 In CollectorDaemon::Config()
09/03/12 11:28:33 OfflineCollectorPlugin::configure: no persistent store was defined for off-line ads.
09/03/12 11:28:33 enable: Creating stats hash table
09/03/12 11:28:33 Enabling CCB Server.
09/03/12 11:28:33 Plugin registration succeeded
09/03/12 11:28:33 Successfully loaded plugin: /usr/lib64/condor/plugins/MgmtCollectorPlugin-plugin.so
09/03/12 11:28:33 WARNING: forward resolution of localhost.localdomain doesn't match 6a01220a!
Stack dump for process 25162 at timestamp 1346664513 (17 frames)
condor_collector(dprintf_dump_stack+0x63)[0x4e87b3]
condor_collector[0x527252]
/lib64/libpthread.so.0(+0xf500)[0x7fd5f52de500]
/lib64/libc.so.6(__nss_hostname_digits_dots+0x49)[0x7fd5f50390e9]
/lib64/libc.so.6(gethostbyname+0x90)[0x7fd5f503e990]
condor_collector(_Z18verify_name_has_ipPc7in_addr+0x31)[0x4a2f91]
condor_collector(_ZN8IpVerify6VerifyE12DCpermissionPK11sockaddr_inPKcP8MyStringS7_+0x4f4)[0x4a5484]
condor_collector(_ZN10DaemonCore6VerifyEPKc12DCpermissionPK11sockaddr_inS1_+0x85)[0x47b8a5]
condor_collector(_ZN10DaemonCore9HandleReqEP6StreamS1_+0xc21)[0x48a131]
condor_collector(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x73d)[0x48d5bd]
condor_collector(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1a)[0x48d5fa]
condor_collector(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x40)[0x524e50]
condor_collector(_ZN10DaemonCore17CallSocketHandlerERib+0x135)[0x483095]
condor_collector(_ZN10DaemonCore6DriverEv+0x2012)[0x487d62]
condor_collector(main+0x116b)[0x47660b]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fd5f4f5acdd]
condor_collector[0x461879]

Version-Release number of selected component (if applicable):

$CondorVersion: 7.6.5 Aug 30 2012 BuildID: RH-7.6.5-0.22.el6 $
$CondorPlatform: X86_64-RedHat_6.3 $

How reproducible:
100%

Steps to Reproduce:
1. install/upgrade to 7.6.5-0.22
2. setup multiple scheduler on 3 nodes with 1 central manager
3. restart condor
4. take a look at /var/log/condor/CollectorLog
  
Actual results:
collector is not running and crashing

Expected results:
collector is available without crash

Additional info:

No core dump was generated even with:

ALL_DEBUG="D_FULLDEBUG"
ABORT_ON_EXCEPTION = True

Related packages:

condor-classads-7.6.5-0.22.el6.x86_64
condor-wallaby-tools-4.1.3-1.el6.noarch
python-condorutils-1.5-4.el6.noarch
condor-7.6.5-0.22.el6.x86_64
condor-qmf-7.6.5-0.22.el6.x86_64
condor-wallaby-client-4.1.3-1.el6.noarch
condor-aviary-7.6.5-0.22.el6.x86_64
condor-cluster-resource-agent-7.6.5-0.22.el6.x86_64
condor-wallaby-base-db-1.23-1.el6.noarch
wallaby-utils-0.12.5-10.el6.noarch
condor-wallaby-tools-4.1.3-1.el6.noarch
ruby-wallaby-0.12.5-10.el6.noarch
python-wallabyclient-4.1.3-1.el6.noarch
condor-wallaby-client-4.1.3-1.el6.noarch
wallaby-0.12.5-10.el6.noarch
condor-wallaby-base-db-1.23-1.el6.noarch

qpid-cpp-server-xml-0.14-21.el6_3.x86_64
qpid-java-client-0.18-1.el6.noarch
qpid-cpp-client-0.14-21.el6_3.x86_64
qpid-cpp-server-0.14-21.el6_3.x86_64
python-qpid-qmf-0.14-14.el6_3.x86_64
qpid-cpp-server-store-0.14-21.el6_3.x86_64
qpid-java-example-0.18-1.el6.noarch
qpid-cpp-client-devel-docs-0.14-21.el6_3.noarch
qpid-java-common-0.18-1.el6.noarch
qpid-cpp-client-devel-0.14-21.el6_3.x86_64
qpid-cpp-server-devel-0.14-21.el6_3.x86_64
ruby-qpid-qmf-0.14-14.el6_3.x86_64

Comment 1 Tomas Rusnak 2012-09-03 11:41:10 UTC
Created attachment 609338 [details]
collector daemon strace