If a user provides a bad value for MAX_<SUBSYS>_LOG, the <SUBSYS> may exit and not be able to log the fact that its log configuration is invalid. Currently, the only means of detection is exit status 4 from the negotiator in the MasterLog. Some options - GOOD The condor_master already attempts to check to see <SUBSYS>_LOG is defined as a precaution. The condor_master could also attempt to detect the validity of MAX_<SUBSYS>_LOG. BETTER The <SUBSYS> itself could provide a reasonable default when given an invalid configuration, at least enough to log the error and exit. In the atoi days invalid meant infinite (MAX_..._LOG=0). BEST The <SUBSYS> could log the error to dprintf_failure.<SUBSYS>
TEST: configure a daemon with max-log using an expression, e.g.: MAX_COLLECTOR_LOG = 1024*1024 Now, start up a pool: $ condor_master Prior to the fix, you will see a silent failure of the daemon (in this case the collector) to start, with a return code in the master log. After the fix, when the daemon fails, you should see an informative error in $(LOG)/dprintf_failure.<SUBSYS>: $ ls CollectorLog dprintf_failure.COLLECTOR InstanceLock MasterLog $ cat dprintf_failure.COLLECTOR dprintf() had a fatal error in pid 11418 Invalid config MAX_COLLECTOR_LOG = 1024*1024: MAX_COLLECTOR_LOG must be an integer literal >= 0 errno: 22 (Invalid argument) euid: 501, ruid: 0
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: dprintf configuration reads MAX_<SUBSYS>_LOG as an integer literal to bypass the classad expression limitation of 32-bit integers. Consequence: Configuring MAX_<SUBSYS>_LOG with an expression, such as "1024*1024", caused a call to EXCEPT(), which was intercepted by parent condor_master process and made failure hard to diagnose. Fix: The call to EXCEPT() was replaced with a call to _condor_dprintf_exit(), which leaves an informative message in the dprintf_failure.<SUBSYS> file in the log directory. Result: Configuration failures for MAX_<SUBSYS>_LOG are now more informative.
This feature doesn't work with procd, using this configuration: ~~~ PROCD_LOG = $(LOG)/ProcLog MAX_PROCD_LOG = 1024*1024 ~~~ condor procd doesn't create any dprintf_failure.PROCD file and MAX_PROCD_LOG is ignored. Tested on both condor-7.8.8-0.3.el5 and condor-7.8.8-0.3.el6.i686. On the other hand I have verified that following components works: MAX_COLLECTOR_LOG MAX_MASTER_LOG MAX_NEGOTIATOR_LOG MAX_SCHEDD_LOG MAX_STARTER_LOG MAX_STARTD_LOG MAX_SHADOW_LOG MAX_NEGOTIATOR_LOG_MATCH Was this behaviour implemented for each condor subystem? Is the following subsystem list correct for this release (already verified subsystems are not included)? C_GAHP CKPT_SERVER DBMSD DEFRAG EC2_GAHP GRIDMANAGER HAD HDFS JOB_ROUTER KBDD LEASEMANAGER QUILL REPLICATION ROOSTER SUBMIT TOOL TRANSFERER and TRIGGERD VM_GAHP Note: list of subsystems comes from documentation: http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#sec:Pre-Defined-Macros
The PROCD is a separate entity that is not part of daemoncore. Also it's logging should be disabled by default.
This is expected only to work for DC-based processes. PROCD and TRIGGERD are not DC-based. Some of the subsystems listed in comment 6 do not exist for us: DBMSD, HDFS, LEASEMANAGER, QUILL, C_GAHP Priority should be to test a DC-based process, low priority for covering the full set.
Bad value for MAX_MASTER_LOG: # cat /var/log/condor/dprintf_failure.MASTER dprintf() had a fatal error in pid 17578 Invalid config MAX_MASTER_LOG = 300*1024*1024: MAX_MASTER_LOG must be an integer literal >= 0 errno: 22 (Invalid argument) euid: 64, ruid: 0 The MASTER failed to start Bad value for MAX_COLLECTOR_LOG: # cat dprintf_failure.COLLECTOR dprintf() had a fatal error in pid 17627 Invalid config MAX_COLLECTOR_LOG = 300*1024*1024: MAX_COLLECTOR_LOG must be an integer literal >= 0 errno: 22 (Invalid argument) euid: 64, ruid: 0 # ps ax | grep condor 17597 ? Ss 0:00 condor_master -pidfile /var/run/condor/condor_master.pid MasterLog: 01/16/13 04:15:30 The COLLECTOR (pid 17694) exited with status 44 01/16/13 04:15:30 ProcAPI::buildFamily failed: parent 17694 not found on system. 01/16/13 04:15:30 restarting /usr/sbin/condor_collector in 3600 seconds 01/16/13 04:15:30 enter Daemons::UpdateCollector 01/16/13 04:15:30 Trying to update collector <IP:9618> 01/16/13 04:15:30 Attempting to send update via UDP to collector tr-rhel6-64.usersys.redhat.com <IP:9618> 01/16/13 04:15:30 exit Daemons::UpdateCollector 01/16/13 04:15:30 attempt to connect to <IP:9618> failed: Connection refused (connect errno = 111). 01/16/13 04:15:30 ERROR: SECMAN:2004:Failed to create security session to <IP:9618> with TCP. |SECMAN:2003:TCP connection to <IP:9618> failed. 01/16/13 04:15:30 Failed to start non-blocking update to <IP:9618 The MASTER daemon periodically tries to restart COLLECTOR even if configuration is broken. Is it working as we are expecting? Should be better to wait for daemons to start and let the master fail, too, in case of failure of deamoncore services due to wrong configuration. In current situation the init script reports that services was started, which is not true.
Due to LSB the init scripts should be atomic in start() and MASTER was started. But alone MASTER deamon without any other one (COLLECTOR in this case) is useless. That's why I'm little bit confused.
(In reply to comment #9) > > The MASTER daemon periodically tries to restart COLLECTOR even if > configuration is broken. Is it working as we are expecting? That is the expected behavior. Condor doesn't distinguish reasons for a daemon halting, with respect to making a decision if it should continue trying to restart. > > Should be better to wait for daemons to start and let the master fail, too, > in case of failure of deamoncore services due to wrong configuration. > > In current situation the init script reports that services was started, > which is not true.
Retested on: RH-7.8.8-0.3 All daemoncores services are reporting wrong settings in MAX_<subsystem>_LOG variable, now. >>> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0564.html