The triggerd's C++ Console interface is currently disabled because it cannot communicate with v2 Agents to find the set of existing Masters. Without the ability to locate Masters, the triggerd cannot effectively implement its feature to report on absent nodes.
Fixed upstream. A configuration store (wallaby) needs to be contactable, so the feature is controlled by setting ENABLE_ABSENT_NODES_DETECTION which defaults to false in condor.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: The MRG Grid 2.0 added the ability to detect node expected to be in the pool but aren't found (absent nodes) C: Absent nodes were not detected C: The condor_triggerd can detect absent nodes if ENABLE_ABSENT_NODES_DETECTION is set to TRUE R: If absent node detection is enabled, the condor_triggerd will raise an event for each node configured in wallaby for which a master qmf object is not detected
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1,4 @@ -C: The MRG Grid 2.0 added the ability to detect node expected to be in the pool but aren't found (absent nodes) +C: Added the ability to detect node expected to be in the pool but aren't found (absent nodes) C: Absent nodes were not detected C: The condor_triggerd can detect absent nodes if ENABLE_ABSENT_NODES_DETECTION is set to TRUE R: If absent node detection is enabled, the condor_triggerd will raise an event for each node configured in wallaby for which a master qmf object is not detected
New bug created for RHEL6 based on this due to depended issue with QMF: bz705325 Retested on RHEL5/x86_64,x86: ruby-wallaby-0.10.5-4.el5 condor-wallaby-tools-4.0-6.el5 qpid-cpp-server-devel-0.10-6.el5 qpid-qmf-devel-0.10-6.el5 wallaby-utils-0.10.5-4.el5 qpid-qmf-0.10-6.el5 qpid-cpp-client-devel-docs-0.10-6.el5 qpid-tools-0.10-4.el5 python-condorutils-1.5-3.el5 ruby-qpid-qmf-0.10-6.el5 qpid-cpp-client-0.10-6.el5 qpid-cpp-server-cluster-0.10-6.el5 qpid-cpp-server-store-0.10-6.el5 qpid-java-common-0.10-4.el5 condor-7.6.1-0.4.el5 python-wallabyclient-4.0-6.el5 condor-wallaby-base-db-1.12-1.el5 python-qpid-qmf-0.10-6.el5 qpid-cpp-client-ssl-0.10-6.el5 qpid-cpp-server-xml-0.10-6.el5 condor-classads-7.6.1-0.4.el5 qpid-java-client-0.10-4.el5 condor-wallaby-client-4.0-6.el5 qpid-cpp-server-ssl-0.10-6.el5 qpid-java-example-0.10-4.el5 wallaby-0.10.5-4.el5 python-qpid-0.10-1.el5 qpid-cpp-server-0.10-6.el5 qpid-cpp-client-devel-0.10-6.el5 condor-qmf-7.6.1-0.4.el5 Config: CREATE_CORE_FILES=True ABORT_ON_EXCEPTION=True QMF_BROKER_HOST=localhost ALL_DEBUG=D_FULLDEBUG CONFIGD_ARGS = -d ALLOW_WRITE = * ALLOW_READ = * ALLOW_NEGOTIATOR = * ALLOW_ADMINISTRATOR_READ = * STARTD_CRON_NAME = TRIGGER_DATA STARTD_CRON_AUTOPUBLISH = If_Changed TRIGGER_DATA_JOBLIST = GetData TRIGGER_DATA_GETDATA_PREFIX = Triggerd TRIGGER_DATA_GETDATA_EXECUTABLE = $(BIN)/get_trigger_data TRIGGER_DATA_GETDATA_PERIOD = 5m TRIGGER_DATA_GETDATA_RECONFIG = FALSE DAEMON_LIST = $(DAEMON_LIST), TRIGGERD ENABLE_ABSENT_NODES_DETECTION=True DC_DAEMON_LIST = $(DAEMON_LIST) QMF_BROKER_AUTH_MECH = ANONYMOUS qpid: list Summary of Objects by Type: Package Class Active Deleted ======================================================== com.redhat.grid condortriggerservice 1 0 com.redhat.grid master 1 0 com.redhat.grid negotiator 1 0 com.redhat.grid collector 1 0 # condor_trigger_config -i `hostname` Connecting to broker 'hostname'... Initializing, adding default triggers... Adding trigger 'High CPU Usage'... Adding trigger 'Low Free Mem'... Adding trigger 'Low Free Disk Space (/)'... Adding trigger 'Busy and Swapping'... Adding trigger 'Busy but Idle'... Adding trigger 'Idle for long time'... Adding trigger 'Logs with ERROR entries'... Adding trigger 'Logs with error entries'... Adding trigger 'Logs with DENIED entries'... Adding trigger 'Logs with denied entries'... Adding trigger 'Logs with WARNING entries'... Adding trigger 'Logs with warning entries'... Adding trigger 'dprintf Logs'... Adding trigger 'Logs with stack dumps'... Adding trigger 'Core Files'... TriggerLog: 05/17/11 14:49:24 Triggerd::AddTriggerToCollection called 05/17/11 14:49:24 Triggerd::AddTriggerToCollection exited with return value 0 05/17/11 14:49:24 Triggerd::config called 05/17/11 14:49:24 Triggerd::SetInterval called 05/17/11 14:49:24 Triggerd: Registered PerformQueries() to evaluate triggers every 10 seconds 05/17/11 14:49:24 Updating collector every 300 seconds 05/17/11 14:49:24 Will use UDP to update collector rhel5_64.mrg-qe-12.lab.eng.brq.redhat.com <IP:9618> 05/17/11 14:49:24 DaemonCore: in SendAliveToParent() 05/17/11 14:49:24 Initialized the following authorization table: 05/17/11 14:49:24 Authorizations yet to be resolved: 05/17/11 14:49:24 allow ADMINISTRATOR: */IP */IP */IP */hostname */hostname 05/17/11 14:49:24 allow OWNER: */IP */IP */IP */IP */hostname */hostname */hostname 05/17/11 14:49:24 Completed DC_CHILDALIVE to daemon at <IP:55842> 05/17/11 14:49:24 DaemonCore: Leaving SendAliveToParent() - success 05/17/11 14:49:24 Triggerd::UpdateCollector called 05/17/11 14:49:24 Trying to update collector <IP:9618> 05/17/11 14:49:24 Attempting to send update via UDP to collector hostname <IP:9618> 05/17/11 14:49:34 Triggerd: Evaluating 15 triggers 05/17/11 14:49:34 Trying to query collector <IP:9618> 05/17/11 14:49:34 Query successful. Parsing results 05/17/11 14:49:34 Trying to query collector <IP:9618> 05/17/11 14:49:34 Query successful. Parsing results 05/17/11 14:49:34 Trying to query collector <IP:9618> 05/17/11 14:49:34 Query successful. Parsing results 05/17/11 14:49:34 Trying to query collector <IP:9618> 05/17/11 14:49:34 Query successful. Parsing results 05/17/11 14:49:34 Trying to query collector <IP:9618> 05/17/11 14:49:34 Query successful. Parsing results 05/17/11 14:49:34 Trying to query collector <IP:9618> 05/17/11 14:49:34 Query successful. Parsing results 05/17/11 14:49:34 Trying to query collector <IP:9618> 05/17/11 14:49:34 Query successful. Parsing results 05/17/11 14:49:34 Parsing trigger text '$(Machine) has $(TriggerdCondorLogCapitalErrorCount) ERROR messages in the following log files: $(TriggerdCondorLogCapitalError)' 05/17/11 14:49:34 Adding text string prior to variable substitution to event text Stack dump for process 12272 at timestamp 1305636574 (10 frames) condor_triggerd(dprintf_dump_stack+0x56)[0x529986] condor_triggerd[0x51f662] /lib64/libpthread.so.0[0x353020eb10] condor_triggerd(_ZN3com6redhat4grid8Triggerd8RemoveWSEPKc+0xc)[0x46904c] condor_triggerd(_ZN3com6redhat4grid8Triggerd14PerformQueriesEv+0x398)[0x46b608] condor_triggerd(_ZN12TimerManager7TimeoutEv+0x155)[0x49abb5] condor_triggerd(_ZN10DaemonCore6DriverEv+0x248)[0x4853b8] condor_triggerd(main+0xe57)[0x4993a7] /lib64/libc.so.6(__libc_start_main+0xf4)[0x352f61d994] condor_triggerd[0x464fe9] No core file was generated. # ps ax | grep condor 7433 pts/0 S+ 0:00 grep condor 22730 ? Ssl 0:03 condor_master -pidfile /var/run/condor/condor_master.pid 22734 ? Ssl 0:01 condor_collector -f 22737 ? Ssl 0:00 condor_negotiator -f 22738 ? Ssl 0:00 condor_schedd -f 22739 ? Ssl 0:00 condor_startd -f 22741 ? S 0:00 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 64 condor_trigged is down - MasterLog: 05/17/11 15:20:28 DaemonCore: No more children processes to reap. 05/17/11 15:20:28 The TRIGGERD (pid 20398) died due to signal 11 (Segmentation fault) New bugzilla created for this error and added as a blocker - bz705343
Retested over current packages on RHEL5/x86,x86_64: ruby-wallaby-0.10.5-4.el5 condor-wallaby-tools-4.0-6.el5 qpid-java-common-0.10-6.el5 qpid-qmf-devel-0.10-6.el5 wallaby-utils-0.10.5-4.el5 qpid-qmf-0.10-6.el5 qpid-cpp-client-ssl-0.10-7.el5 qpid-cpp-server-cluster-0.10-7.el5 python-condorutils-1.5-3.el5 ruby-qpid-qmf-0.10-6.el5 qpid-cpp-client-0.10-7.el5 condor-7.6.1-0.5.el5 qpid-cpp-server-ssl-0.10-7.el5 qpid-java-client-0.10-6.el5 qpid-java-example-0.10-6.el5 python-wallabyclient-4.0-6.el5 condor-wallaby-base-db-1.12-1.el5 python-qpid-qmf-0.10-6.el5 qpid-cpp-server-0.10-7.el5 qpid-cpp-client-devel-0.10-7.el5 qpid-cpp-server-store-0.10-7.el5 qpid-cpp-server-devel-0.10-7.el5 qpid-tools-0.10-5.el5 condor-wallaby-client-4.0-6.el5 condor-classads-7.6.1-0.5.el5 qpid-cpp-server-xml-0.10-7.el5 wallaby-0.10.5-4.el5 python-qpid-0.10-1.el5 condor-qmf-7.6.1-0.5.el5 # kill -9 `pidof condor_master` # tail -f /var/log/condor/TriggerLog | grep -i missing 05/18/11 12:38:50 Triggerd: Found 1 missing nodes 05/18/11 12:38:50 Triggerd: Raised event with text 'hostname is missing from the pool' >>> VERIFIED
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,4 +1,8 @@ C: Added the ability to detect node expected to be in the pool but aren't found (absent nodes) C: Absent nodes were not detected C: The condor_triggerd can detect absent nodes if ENABLE_ABSENT_NODES_DETECTION is set to TRUE -R: If absent node detection is enabled, the condor_triggerd will raise an event for each node configured in wallaby for which a master qmf object is not detected+R: If absent node detection is enabled, the condor_triggerd will raise an event for each node configured in wallaby for which a master qmf object is not detected + +Release Note Entry: + +Previously, _triggerd's C++ Console interface in Condor could not detect and report absent nodes because ENABLE_ABSENT_NODES_DETECTION was set to FALSE as a default. The ENABLE_ABSENT_NODES_DETECTION is now set to TRUE as a default in Condor, which allows _triggerd to raise an event for each node in wallaby that does not have a corresponding master qmf object.
Technical note can be viewed in the release notes for 2.0 at the documentation stage here: http://documentation-stage.bne.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2.0/html-single/MRG_Release_Notes/index.html#tabl-MRG_Release_Notes-GRID_Update_Notes-RHM_Known_Issues
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0889.html