Bug 705343 - condor_triggerd segfault after initialization using the condor_trigger_config
Summary: condor_triggerd segfault after initialization using the condor_trigger_config
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: Development
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.0
: ---
Assignee: Robert Rati
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On:
Blocks: 602766 693778 705722
TreeView+ depends on / blocked
 
Reported: 2011-05-17 12:30 UTC by Tomas Rusnak
Modified: 2011-06-27 15:32 UTC (History)
3 users (show)

Fixed In Version: condor-7.6.1-0.5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 705722 (view as bug list)
Environment:
Last Closed: 2011-06-27 15:32:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Tomas Rusnak 2011-05-17 12:30:39 UTC
Description of problem:
The condor_triggerd segfaults without any core dump after /usr/sbin/condor_trigger_config -i `hostname` was called.

Version-Release number of selected component (if applicable):
ruby-wallaby-0.10.5-4.el5
condor-wallaby-tools-4.0-6.el5
qpid-cpp-server-devel-0.10-6.el5
qpid-qmf-devel-0.10-6.el5
wallaby-utils-0.10.5-4.el5
qpid-qmf-0.10-6.el5
qpid-cpp-client-devel-docs-0.10-6.el5
qpid-tools-0.10-4.el5
python-condorutils-1.5-3.el5
ruby-qpid-qmf-0.10-6.el5
qpid-cpp-client-0.10-6.el5
qpid-cpp-server-cluster-0.10-6.el5
qpid-cpp-server-store-0.10-6.el5
qpid-java-common-0.10-4.el5
condor-7.6.1-0.4.el5
python-wallabyclient-4.0-6.el5
condor-wallaby-base-db-1.12-1.el5
python-qpid-qmf-0.10-6.el5
qpid-cpp-client-ssl-0.10-6.el5
qpid-cpp-server-xml-0.10-6.el5
condor-classads-7.6.1-0.4.el5
qpid-java-client-0.10-4.el5
condor-wallaby-client-4.0-6.el5
qpid-cpp-server-ssl-0.10-6.el5
qpid-java-example-0.10-4.el5
wallaby-0.10.5-4.el5
python-qpid-0.10-1.el5
qpid-cpp-server-0.10-6.el5
qpid-cpp-client-devel-0.10-6.el5
condor-qmf-7.6.1-0.4.el5

How reproducible:
100%

Steps to Reproduce:
1. set up condor for triggerd 
2. run /condor_trigger_config -i `hostname` to initialize default triggers
3. tail -f /var/log/condor/TriggerLog
  
Actual results:
condor_triggerd segfault

Expected results:
no seffault

Additional info:

Config:

CREATE_CORE_FILES=True
ABORT_ON_EXCEPTION=True

QMF_BROKER_HOST=localhost
ALL_DEBUG=D_FULLDEBUG
CONFIGD_ARGS = -d

ALLOW_WRITE = *
ALLOW_READ = *
ALLOW_NEGOTIATOR = *
ALLOW_ADMINISTRATOR_READ = *

STARTD_CRON_NAME = TRIGGER_DATA
STARTD_CRON_AUTOPUBLISH = If_Changed
TRIGGER_DATA_JOBLIST = GetData
TRIGGER_DATA_GETDATA_PREFIX = Triggerd
TRIGGER_DATA_GETDATA_EXECUTABLE = $(BIN)/get_trigger_data
TRIGGER_DATA_GETDATA_PERIOD = 5m
TRIGGER_DATA_GETDATA_RECONFIG = FALSE

DAEMON_LIST = $(DAEMON_LIST),  TRIGGERD
ENABLE_ABSENT_NODES_DETECTION=True
DC_DAEMON_LIST = $(DAEMON_LIST)

QMF_BROKER_AUTH_MECH = ANONYMOUS

qpid: list
Summary of Objects by Type:
    Package          Class                 Active  Deleted
    ========================================================
    com.redhat.grid  condortriggerservice  1       0
    com.redhat.grid  master                1       0
    com.redhat.grid  negotiator            1       0
    com.redhat.grid  collector             1       0

# condor_trigger_config -i `hostname`
Connecting to broker 'hostname'...
Initializing, adding default triggers...
Adding trigger 'High CPU Usage'...
Adding trigger 'Low Free Mem'...
Adding trigger 'Low Free Disk Space (/)'...
Adding trigger 'Busy and Swapping'...
Adding trigger 'Busy but Idle'...
Adding trigger 'Idle for long time'...
Adding trigger 'Logs with ERROR entries'...
Adding trigger 'Logs with error entries'...
Adding trigger 'Logs with DENIED entries'...
Adding trigger 'Logs with denied entries'...
Adding trigger 'Logs with WARNING entries'...
Adding trigger 'Logs with warning entries'...
Adding trigger 'dprintf Logs'...
Adding trigger 'Logs with stack dumps'...
Adding trigger 'Core Files'...

TriggerLog:
05/17/11 14:49:24 Triggerd::AddTriggerToCollection called
05/17/11 14:49:24 Triggerd::AddTriggerToCollection exited with return value 0
05/17/11 14:49:24 Triggerd::config called
05/17/11 14:49:24 Triggerd::SetInterval called
05/17/11 14:49:24 Triggerd: Registered PerformQueries() to evaluate triggers every 10 seconds
05/17/11 14:49:24 Updating collector every 300 seconds
05/17/11 14:49:24 Will use UDP to update collector rhel5_64.mrg-qe-12.lab.eng.brq.redhat.com <IP:9618>
05/17/11 14:49:24 DaemonCore: in SendAliveToParent()
05/17/11 14:49:24 Initialized the following authorization table:
05/17/11 14:49:24 Authorizations yet to be resolved:
05/17/11 14:49:24 allow ADMINISTRATOR:  */IP */IP */IP */hostname */hostname
05/17/11 14:49:24 allow OWNER:  */IP */IP */IP */IP */hostname */hostname */hostname
05/17/11 14:49:24 Completed DC_CHILDALIVE to daemon at <IP:55842>
05/17/11 14:49:24 DaemonCore: Leaving SendAliveToParent() - success
05/17/11 14:49:24 Triggerd::UpdateCollector called
05/17/11 14:49:24 Trying to update collector <IP:9618>
05/17/11 14:49:24 Attempting to send update via UDP to collector hostname <IP:9618>
05/17/11 14:49:34 Triggerd: Evaluating 15 triggers
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Parsing trigger text '$(Machine) has $(TriggerdCondorLogCapitalErrorCount) ERROR messages in the following log files: $(TriggerdCondorLogCapitalError)'
05/17/11 14:49:34 Adding text string prior to variable substitution to event text
Stack dump for process 12272 at timestamp 1305636574 (10 frames)
condor_triggerd(dprintf_dump_stack+0x56)[0x529986]
condor_triggerd[0x51f662]
/lib64/libpthread.so.0[0x353020eb10]
condor_triggerd(_ZN3com6redhat4grid8Triggerd8RemoveWSEPKc+0xc)[0x46904c]
condor_triggerd(_ZN3com6redhat4grid8Triggerd14PerformQueriesEv+0x398)[0x46b608]
condor_triggerd(_ZN12TimerManager7TimeoutEv+0x155)[0x49abb5]
condor_triggerd(_ZN10DaemonCore6DriverEv+0x248)[0x4853b8]
condor_triggerd(main+0xe57)[0x4993a7]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x352f61d994]
condor_triggerd[0x464fe9]

No core file was generated.

# ps ax | grep condor
 7433 pts/0    S+     0:00 grep condor
22730 ?        Ssl    0:03 condor_master -pidfile /var/run/condor/condor_master.pid
22734 ?        Ssl    0:01 condor_collector -f
22737 ?        Ssl    0:00 condor_negotiator -f
22738 ?        Ssl    0:00 condor_schedd -f
22739 ?        Ssl    0:00 condor_startd -f
22741 ?        S      0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 64

condor_trigged is down - MasterLog:

05/17/11 15:20:28 DaemonCore: No more children processes to reap.
05/17/11 15:20:28 The TRIGGERD (pid 20398) died due to signal 11 (Segmentation fault)

Comment 1 Robert Rati 2011-05-17 18:56:06 UTC
The triggerd would attempt to access a null pointer when processing white space
if a trigger returned class ad data from a trigger evaluation.  The issue was
introduced when the triggered was modified to handle new classads.

Fixed upstream and on:
UPSTREAM-7.6.1-BZ705343-triggerd-segfault

Comment 4 Tomas Rusnak 2011-05-18 09:36:21 UTC
Retested over RHEL5/x86,x86_64:

ruby-wallaby-0.10.5-4.el5
condor-wallaby-tools-4.0-6.el5
qpid-java-common-0.10-6.el5
qpid-qmf-devel-0.10-6.el5
wallaby-utils-0.10.5-4.el5
qpid-qmf-0.10-6.el5
qpid-cpp-client-ssl-0.10-7.el5
qpid-cpp-server-cluster-0.10-7.el5
python-condorutils-1.5-3.el5
ruby-qpid-qmf-0.10-6.el5
qpid-cpp-client-0.10-7.el5
condor-7.6.1-0.5.el5
qpid-cpp-server-ssl-0.10-7.el5
qpid-java-client-0.10-6.el5
qpid-java-example-0.10-6.el5
python-wallabyclient-4.0-6.el5
condor-wallaby-base-db-1.12-1.el5
python-qpid-qmf-0.10-6.el5
qpid-cpp-server-0.10-7.el5
qpid-cpp-client-devel-0.10-7.el5
qpid-cpp-server-store-0.10-7.el5
qpid-cpp-server-devel-0.10-7.el5
qpid-tools-0.10-5.el5
condor-wallaby-client-4.0-6.el5
condor-classads-7.6.1-0.5.el5
qpid-cpp-server-xml-0.10-7.el5
wallaby-0.10.5-4.el5
python-qpid-0.10-1.el5
condor-qmf-7.6.1-0.5.el5
qpid-cpp-client-devel-docs-0.10-7.el5

# tail -f /var/log/condor/TriggerLog 
05/18/11 12:32:46 Adding classad value to event text
05/18/11 12:32:46 Adding text string prior to variable substitution to event text
05/18/11 12:32:46 token: 'TriggerdCondorLogStackDump'
05/18/11 12:32:46 Adding classad value to event text
05/18/11 12:32:46 Triggerd: Raised event with text '"hostname" has 4507 stack dumps in the following log files: "MasterLog,ShadowLog,ShadowLog.old,TriggerLog"'
05/18/11 12:32:46 Trying to query collector <IP:9618>
05/18/11 12:32:46 Query successful.  Parsing results
05/18/11 12:32:46 Triggerd: Found 1 nodes in the pool
05/18/11 12:32:46 Triggerd: 1 nodes expected to be in the pool
05/18/11 12:32:46 Triggerd: Found 0 missing nodes
05/18/11 12:32:56 Triggerd: Evaluating 15 triggers
05/18/11 12:32:56 Trying to query collector <IP:9618>
05/18/11 12:32:56 Query successful.  Parsing results
05/18/11 12:32:56 Trying to query collector <IP:9618>
05/18/11 12:32:56 Query successful.  Parsing results
05/18/11 12:32:56 Trying to query collector <IP:9618>
05/18/11 12:32:56 Query successful.  Parsing results
05/18/11 12:32:56 Trying to query collector <IP:9618>
05/18/11 12:32:56 Query successful.  Parsing results
05/18/11 12:32:56 Trying to query collector <IP:9618>
05/18/11 12:32:56 Query successful.  Parsing results
05/18/11 12:32:56 Trying to query collector <IP:9618>
05/18/11 12:32:56 Query successful.  Parsing results
05/18/11 12:32:56 Trying to query collector <IP:9618>
05/18/11 12:32:56 Query successful.  Parsing results
05/18/11 12:32:56 Parsing trigger text '$(Machine) has $(TriggerdCondorLogCapitalErrorCount) ERROR messages in the following log files: $(TriggerdCondorLogCapitalError)'
05/18/11 12:32:56 Adding text string prior to variable substitution to event text
05/18/11 12:32:56 token: 'Machine'
05/18/11 12:32:56 Adding classad value to event text
05/18/11 12:32:56 Adding text string prior to variable substitution to event text
05/18/11 12:32:56 token: 'TriggerdCondorLogCapitalErrorCount'
05/18/11 12:32:56 Adding classad value to event text

5472 ?        Ssl    0:00 condor_triggerd -f

Daemon is still alive. No crash from condor_triggerd found.

>>> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.