Bug 602766

Summary: condor_triggerd: re-enable absent node feature
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: condorAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: medium Docs Contact:
Priority: low    
Version: 1.2CC: iboverma, mhusnain, mkudlej, trusnak
Target Milestone: 2.0Keywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.5.6-0.2 Doc Type: Enhancement
Doc Text:
C: Added the ability to detect node expected to be in the pool but aren't found (absent nodes) C: Absent nodes were not detected C: The condor_triggerd can detect absent nodes if ENABLE_ABSENT_NODES_DETECTION is set to TRUE R: If absent node detection is enabled, the condor_triggerd will raise an event for each node configured in wallaby for which a master qmf object is not detected Release Note Entry: Previously, _triggerd's C++ Console interface in Condor could not detect and report absent nodes because ENABLE_ABSENT_NODES_DETECTION was set to FALSE as a default. The ENABLE_ABSENT_NODES_DETECTION is now set to TRUE as a default in Condor, which allows _triggerd to raise an event for each node in wallaby that does not have a corresponding master qmf object.
Story Points: ---
Clone Of:
: 705325 (view as bug list) Environment:
Last Closed: 2011-06-23 15:41:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 705343, 705722    
Bug Blocks: 693778, 705325    

Description Matthew Farrellee 2010-06-10 17:30:36 UTC
The triggerd's C++ Console interface is currently disabled because it cannot communicate with v2 Agents to find the set of existing Masters.

Without the ability to locate Masters, the triggerd cannot effectively implement its feature to report on absent nodes.

Comment 2 Robert Rati 2011-03-02 17:19:54 UTC
Fixed upstream.  A configuration store (wallaby) needs to be contactable, so the feature is controlled by setting ENABLE_ABSENT_NODES_DETECTION which defaults to false in condor.

Comment 3 Robert Rati 2011-03-15 17:38:46 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: The MRG Grid 2.0 added the ability to detect node expected to be in the pool but aren't found (absent nodes)
C: Absent nodes were not detected
C: The condor_triggerd can detect absent nodes if ENABLE_ABSENT_NODES_DETECTION is set to TRUE
R: If absent node detection is enabled, the condor_triggerd will raise an event for each node configured in wallaby for which a master qmf object is not detected

Comment 4 Matthew Farrellee 2011-03-15 17:55:47 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1,4 @@
-C: The MRG Grid 2.0 added the ability to detect node expected to be in the pool but aren't found (absent nodes)
+C: Added the ability to detect node expected to be in the pool but aren't found (absent nodes)
 C: Absent nodes were not detected
 C: The condor_triggerd can detect absent nodes if ENABLE_ABSENT_NODES_DETECTION is set to TRUE
 R: If absent node detection is enabled, the condor_triggerd will raise an event for each node configured in wallaby for which a master qmf object is not detected

Comment 8 Tomas Rusnak 2011-05-17 12:31:32 UTC
New bug created for RHEL6 based on this due to depended issue with QMF:  bz705325 

Retested on RHEL5/x86_64,x86: 

ruby-wallaby-0.10.5-4.el5
condor-wallaby-tools-4.0-6.el5
qpid-cpp-server-devel-0.10-6.el5
qpid-qmf-devel-0.10-6.el5
wallaby-utils-0.10.5-4.el5
qpid-qmf-0.10-6.el5
qpid-cpp-client-devel-docs-0.10-6.el5
qpid-tools-0.10-4.el5
python-condorutils-1.5-3.el5
ruby-qpid-qmf-0.10-6.el5
qpid-cpp-client-0.10-6.el5
qpid-cpp-server-cluster-0.10-6.el5
qpid-cpp-server-store-0.10-6.el5
qpid-java-common-0.10-4.el5
condor-7.6.1-0.4.el5
python-wallabyclient-4.0-6.el5
condor-wallaby-base-db-1.12-1.el5
python-qpid-qmf-0.10-6.el5
qpid-cpp-client-ssl-0.10-6.el5
qpid-cpp-server-xml-0.10-6.el5
condor-classads-7.6.1-0.4.el5
qpid-java-client-0.10-4.el5
condor-wallaby-client-4.0-6.el5
qpid-cpp-server-ssl-0.10-6.el5
qpid-java-example-0.10-4.el5
wallaby-0.10.5-4.el5
python-qpid-0.10-1.el5
qpid-cpp-server-0.10-6.el5
qpid-cpp-client-devel-0.10-6.el5
condor-qmf-7.6.1-0.4.el5

Config:

CREATE_CORE_FILES=True
ABORT_ON_EXCEPTION=True

QMF_BROKER_HOST=localhost
ALL_DEBUG=D_FULLDEBUG
CONFIGD_ARGS = -d

ALLOW_WRITE = *
ALLOW_READ = *
ALLOW_NEGOTIATOR = *
ALLOW_ADMINISTRATOR_READ = *

STARTD_CRON_NAME = TRIGGER_DATA
STARTD_CRON_AUTOPUBLISH = If_Changed
TRIGGER_DATA_JOBLIST = GetData
TRIGGER_DATA_GETDATA_PREFIX = Triggerd
TRIGGER_DATA_GETDATA_EXECUTABLE = $(BIN)/get_trigger_data
TRIGGER_DATA_GETDATA_PERIOD = 5m
TRIGGER_DATA_GETDATA_RECONFIG = FALSE

DAEMON_LIST = $(DAEMON_LIST),  TRIGGERD
ENABLE_ABSENT_NODES_DETECTION=True
DC_DAEMON_LIST = $(DAEMON_LIST)

QMF_BROKER_AUTH_MECH = ANONYMOUS

qpid: list
Summary of Objects by Type:
    Package          Class                 Active  Deleted
    ========================================================
    com.redhat.grid  condortriggerservice  1       0
    com.redhat.grid  master                1       0
    com.redhat.grid  negotiator            1       0
    com.redhat.grid  collector             1       0

# condor_trigger_config -i `hostname`
Connecting to broker 'hostname'...
Initializing, adding default triggers...
Adding trigger 'High CPU Usage'...
Adding trigger 'Low Free Mem'...
Adding trigger 'Low Free Disk Space (/)'...
Adding trigger 'Busy and Swapping'...
Adding trigger 'Busy but Idle'...
Adding trigger 'Idle for long time'...
Adding trigger 'Logs with ERROR entries'...
Adding trigger 'Logs with error entries'...
Adding trigger 'Logs with DENIED entries'...
Adding trigger 'Logs with denied entries'...
Adding trigger 'Logs with WARNING entries'...
Adding trigger 'Logs with warning entries'...
Adding trigger 'dprintf Logs'...
Adding trigger 'Logs with stack dumps'...
Adding trigger 'Core Files'...

TriggerLog:
05/17/11 14:49:24 Triggerd::AddTriggerToCollection called
05/17/11 14:49:24 Triggerd::AddTriggerToCollection exited with return value 0
05/17/11 14:49:24 Triggerd::config called
05/17/11 14:49:24 Triggerd::SetInterval called
05/17/11 14:49:24 Triggerd: Registered PerformQueries() to evaluate triggers every 10 seconds
05/17/11 14:49:24 Updating collector every 300 seconds
05/17/11 14:49:24 Will use UDP to update collector rhel5_64.mrg-qe-12.lab.eng.brq.redhat.com <IP:9618>
05/17/11 14:49:24 DaemonCore: in SendAliveToParent()
05/17/11 14:49:24 Initialized the following authorization table:
05/17/11 14:49:24 Authorizations yet to be resolved:
05/17/11 14:49:24 allow ADMINISTRATOR:  */IP */IP */IP */hostname */hostname
05/17/11 14:49:24 allow OWNER:  */IP */IP */IP */IP */hostname */hostname */hostname
05/17/11 14:49:24 Completed DC_CHILDALIVE to daemon at <IP:55842>
05/17/11 14:49:24 DaemonCore: Leaving SendAliveToParent() - success
05/17/11 14:49:24 Triggerd::UpdateCollector called
05/17/11 14:49:24 Trying to update collector <IP:9618>
05/17/11 14:49:24 Attempting to send update via UDP to collector hostname <IP:9618>
05/17/11 14:49:34 Triggerd: Evaluating 15 triggers
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Trying to query collector <IP:9618>
05/17/11 14:49:34 Query successful.  Parsing results
05/17/11 14:49:34 Parsing trigger text '$(Machine) has $(TriggerdCondorLogCapitalErrorCount) ERROR messages in the following log files: $(TriggerdCondorLogCapitalError)'
05/17/11 14:49:34 Adding text string prior to variable substitution to event text
Stack dump for process 12272 at timestamp 1305636574 (10 frames)
condor_triggerd(dprintf_dump_stack+0x56)[0x529986]
condor_triggerd[0x51f662]
/lib64/libpthread.so.0[0x353020eb10]
condor_triggerd(_ZN3com6redhat4grid8Triggerd8RemoveWSEPKc+0xc)[0x46904c]
condor_triggerd(_ZN3com6redhat4grid8Triggerd14PerformQueriesEv+0x398)[0x46b608]
condor_triggerd(_ZN12TimerManager7TimeoutEv+0x155)[0x49abb5]
condor_triggerd(_ZN10DaemonCore6DriverEv+0x248)[0x4853b8]
condor_triggerd(main+0xe57)[0x4993a7]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x352f61d994]
condor_triggerd[0x464fe9]

No core file was generated.

# ps ax | grep condor
 7433 pts/0    S+     0:00 grep condor
22730 ?        Ssl    0:03 condor_master -pidfile /var/run/condor/condor_master.pid
22734 ?        Ssl    0:01 condor_collector -f
22737 ?        Ssl    0:00 condor_negotiator -f
22738 ?        Ssl    0:00 condor_schedd -f
22739 ?        Ssl    0:00 condor_startd -f
22741 ?        S      0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 64

condor_trigged is down - MasterLog:

05/17/11 15:20:28 DaemonCore: No more children processes to reap.
05/17/11 15:20:28 The TRIGGERD (pid 20398) died due to signal 11 (Segmentation fault)

New bugzilla created for this error and added as a blocker - bz705343

Comment 9 Tomas Rusnak 2011-05-18 09:43:58 UTC
Retested over current packages on RHEL5/x86,x86_64:

ruby-wallaby-0.10.5-4.el5
condor-wallaby-tools-4.0-6.el5
qpid-java-common-0.10-6.el5
qpid-qmf-devel-0.10-6.el5
wallaby-utils-0.10.5-4.el5
qpid-qmf-0.10-6.el5
qpid-cpp-client-ssl-0.10-7.el5
qpid-cpp-server-cluster-0.10-7.el5
python-condorutils-1.5-3.el5
ruby-qpid-qmf-0.10-6.el5
qpid-cpp-client-0.10-7.el5
condor-7.6.1-0.5.el5
qpid-cpp-server-ssl-0.10-7.el5
qpid-java-client-0.10-6.el5
qpid-java-example-0.10-6.el5
python-wallabyclient-4.0-6.el5
condor-wallaby-base-db-1.12-1.el5
python-qpid-qmf-0.10-6.el5
qpid-cpp-server-0.10-7.el5
qpid-cpp-client-devel-0.10-7.el5
qpid-cpp-server-store-0.10-7.el5
qpid-cpp-server-devel-0.10-7.el5
qpid-tools-0.10-5.el5
condor-wallaby-client-4.0-6.el5
condor-classads-7.6.1-0.5.el5
qpid-cpp-server-xml-0.10-7.el5
wallaby-0.10.5-4.el5
python-qpid-0.10-1.el5
condor-qmf-7.6.1-0.5.el5

# kill -9 `pidof condor_master`

# tail -f /var/log/condor/TriggerLog | grep -i missing
05/18/11 12:38:50 Triggerd: Found 1 missing nodes
05/18/11 12:38:50 Triggerd: Raised event with text 'hostname is missing from the pool'

>>> VERIFIED

Comment 10 Misha H. Ali 2011-05-30 05:08:24 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,4 +1,8 @@
 C: Added the ability to detect node expected to be in the pool but aren't found (absent nodes)
 C: Absent nodes were not detected
 C: The condor_triggerd can detect absent nodes if ENABLE_ABSENT_NODES_DETECTION is set to TRUE
-R: If absent node detection is enabled, the condor_triggerd will raise an event for each node configured in wallaby for which a master qmf object is not detected+R: If absent node detection is enabled, the condor_triggerd will raise an event for each node configured in wallaby for which a master qmf object is not detected
+
+Release Note Entry:
+
+Previously, _triggerd's C++ Console interface in Condor could not detect and report absent nodes because ENABLE_ABSENT_NODES_DETECTION was set to FALSE as a default. The ENABLE_ABSENT_NODES_DETECTION is now set to TRUE as a default in Condor, which allows _triggerd to raise an event for each node in wallaby that does not have a corresponding master qmf object.

Comment 11 Misha H. Ali 2011-06-06 03:24:35 UTC
Technical note can be viewed in the release notes for 2.0 at the documentation stage here:

http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2.0/html-single/MRG_Release_Notes/index.html#tabl-MRG_Release_Notes-GRID_Update_Notes-RHM_Known_Issues

Comment 12 errata-xmlrpc 2011-06-23 15:41:24 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html