755765 – RFE: Gracefully handle MAX_..._LOG configuration errors

Bug 755765 - RFE: Gracefully handle MAX_..._LOG configuration errors

Summary: RFE: Gracefully handle MAX_..._LOG configuration errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	condor
Sub Component:
Version:	Development
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	2.3
Target Release:	---
Assignee:	Erik Erlandson
QA Contact:	Tomas Rusnak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	850563
TreeView+	depends on / blocked

Reported:	2011-11-22 01:01 UTC by Matthew Farrellee
Modified:	2013-03-06 18:39 UTC (History)
CC List:	6 users (show)
Fixed In Version:	condor-7.8.2-0.1
Doc Type:	Enhancement
Doc Text:	Cause: dprintf configuration reads MAX_<SUBSYS>_LOG as an integer literal to bypass the classad expression limitation of 32-bit integers. Consequence: Configuring MAX_<SUBSYS>_LOG with an expression, such as "1024*1024", caused a call to EXCEPT(), which was intercepted by parent condor_master process and made failure hard to diagnose. Fix: The call to EXCEPT() was replaced with a call to _condor_dprintf_exit(), which leaves an informative message in the dprintf_failure.<SUBSYS> file in the log directory. Result: Configuration failures for MAX_<SUBSYS>_LOG are now more informative.
Clone Of:
Environment:
Last Closed:	2013-03-06 18:39:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Condor	2836	0	None	None	None	Never
Red Hat Product Errata	RHSA-2013:0564	0	normal	SHIPPED_LIVE	Low: Red Hat Enterprise MRG Grid 2.3 security update	2013-03-06 23:37:09 UTC

Internal Links: 752355

Description Matthew Farrellee 2011-11-22 01:01:22 UTC

If a user provides a bad value for MAX_<SUBSYS>_LOG, the <SUBSYS> may exit and not be able to log the fact that its log configuration is invalid. Currently, the only means of detection is exit status 4 from the negotiator in the MasterLog.

Some options -

 GOOD The condor_master already attempts to check to see <SUBSYS>_LOG is defined as a precaution. The condor_master could also attempt to detect the validity of MAX_<SUBSYS>_LOG.
 BETTER The <SUBSYS> itself could provide a reasonable default when given an invalid configuration, at least enough to log the error and exit. In the atoi days invalid meant infinite (MAX_..._LOG=0).
 BEST The <SUBSYS> could log the error to dprintf_failure.<SUBSYS>

Comment 1 Erik Erlandson 2012-02-24 20:54:59 UTC

TEST:

configure a daemon with max-log using an expression, e.g.:

MAX_COLLECTOR_LOG = 1024*1024


Now, start up a pool:

$ condor_master


Prior to the fix, you will see a silent failure of the daemon (in this case the collector) to start, with a return code in the master log.

After the fix, when the daemon fails, you should see an informative error in $(LOG)/dprintf_failure.<SUBSYS>:

$ ls
CollectorLog  dprintf_failure.COLLECTOR  InstanceLock  MasterLog
$ cat dprintf_failure.COLLECTOR
dprintf() had a fatal error in pid 11418
Invalid config MAX_COLLECTOR_LOG = 1024*1024: MAX_COLLECTOR_LOG must be an integer literal >= 0
errno: 22 (Invalid argument)
euid: 501, ruid: 0

Comment 2 Erik Erlandson 2012-02-24 20:59:48 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause:
dprintf configuration reads MAX_<SUBSYS>_LOG as an integer literal to bypass the classad expression limitation of 32-bit integers.

Consequence:
Configuring MAX_<SUBSYS>_LOG with an expression, such as "1024*1024", caused a call to EXCEPT(), which was intercepted by parent condor_master process and made failure hard to diagnose.

Fix:
The call to EXCEPT() was replaced with a call to _condor_dprintf_exit(), which leaves an informative message in the dprintf_failure.<SUBSYS> file in the log directory.

Result:
Configuration failures for MAX_<SUBSYS>_LOG are now more informative.

Comment 6 Martin Bukatovic 2013-01-11 18:34:43 UTC

This feature doesn't work with procd, using this configuration:

~~~
PROCD_LOG = $(LOG)/ProcLog
MAX_PROCD_LOG = 1024*1024
~~~

condor procd doesn't create any dprintf_failure.PROCD file and MAX_PROCD_LOG is ignored.

Tested on both condor-7.8.8-0.3.el5 and condor-7.8.8-0.3.el6.i686.

On the other hand I have verified that following components works:

MAX_COLLECTOR_LOG
MAX_MASTER_LOG
MAX_NEGOTIATOR_LOG
MAX_SCHEDD_LOG
MAX_STARTER_LOG
MAX_STARTD_LOG
MAX_SHADOW_LOG
MAX_NEGOTIATOR_LOG_MATCH

Was this behaviour implemented for each condor subystem?
Is the following subsystem list correct for this release (already verified subsystems are not included)?

C_GAHP
CKPT_SERVER
DBMSD
DEFRAG
EC2_GAHP
GRIDMANAGER
HAD
HDFS
JOB_ROUTER
KBDD
LEASEMANAGER
QUILL
REPLICATION
ROOSTER
SUBMIT
TOOL
TRANSFERER

and

TRIGGERD
VM_GAHP

Note: list of subsystems comes from documentation: http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#sec:Pre-Defined-Macros

Comment 7 Timothy St. Clair 2013-01-11 18:41:36 UTC

The PROCD is a separate entity that is not part of daemoncore.  

Also it's logging should be disabled by default.

Comment 8 Matthew Farrellee 2013-01-15 14:16:14 UTC

This is expected only to work for DC-based processes.

PROCD and TRIGGERD are not DC-based.

Some of the subsystems listed in comment 6 do not exist for us: DBMSD, HDFS, LEASEMANAGER, QUILL, C_GAHP

Priority should be to test a DC-based process, low priority for covering the full set.

Comment 9 Tomas Rusnak 2013-01-16 09:13:28 UTC

Bad value for MAX_MASTER_LOG:

# cat /var/log/condor/dprintf_failure.MASTER 
dprintf() had a fatal error in pid 17578
Invalid config MAX_MASTER_LOG = 300*1024*1024: MAX_MASTER_LOG must be an integer literal >= 0
errno: 22 (Invalid argument)
euid: 64, ruid: 0

The MASTER failed to start

Bad value for MAX_COLLECTOR_LOG:

# cat dprintf_failure.COLLECTOR
dprintf() had a fatal error in pid 17627
Invalid config MAX_COLLECTOR_LOG = 300*1024*1024: MAX_COLLECTOR_LOG must be an integer literal >= 0
errno: 22 (Invalid argument)
euid: 64, ruid: 0

# ps ax | grep condor
17597 ?        Ss     0:00 condor_master -pidfile /var/run/condor/condor_master.pid

MasterLog:
01/16/13 04:15:30 The COLLECTOR (pid 17694) exited with status 44
01/16/13 04:15:30 ProcAPI::buildFamily failed: parent 17694 not found on system.
01/16/13 04:15:30 restarting /usr/sbin/condor_collector in 3600 seconds
01/16/13 04:15:30 enter Daemons::UpdateCollector
01/16/13 04:15:30 Trying to update collector <IP:9618>
01/16/13 04:15:30 Attempting to send update via UDP to collector tr-rhel6-64.usersys.redhat.com <IP:9618>
01/16/13 04:15:30 exit Daemons::UpdateCollector
01/16/13 04:15:30 attempt to connect to <IP:9618> failed: Connection refused (connect errno = 111).
01/16/13 04:15:30 ERROR: SECMAN:2004:Failed to create security session to <IP:9618> with TCP.
|SECMAN:2003:TCP connection to <IP:9618> failed.
01/16/13 04:15:30 Failed to start non-blocking update to <IP:9618

The MASTER daemon periodically tries to restart COLLECTOR even if configuration is broken. Is it working as we are expecting?

Should be better to wait for daemons to start and let the master fail, too, in case of failure of deamoncore services due to wrong configuration.

In current situation the init script reports that services was started, which is not true.

Comment 10 Tomas Rusnak 2013-01-16 09:39:43 UTC

Due to LSB the init scripts should be atomic in start() and MASTER was started. But alone MASTER deamon without any other one (COLLECTOR in this case) is useless. That's why I'm little bit confused.

Comment 11 Erik Erlandson 2013-01-16 14:57:08 UTC

(In reply to comment #9)

> 
> The MASTER daemon periodically tries to restart COLLECTOR even if
> configuration is broken. Is it working as we are expecting?

That is the expected behavior.   Condor doesn't distinguish reasons for a daemon halting, with respect to making a decision if it should continue trying to restart.


> 
> Should be better to wait for daemons to start and let the master fail, too,
> in case of failure of deamoncore services due to wrong configuration.
> 
> In current situation the init script reports that services was started,
> which is not true.

Comment 12 Tomas Rusnak 2013-01-18 13:39:02 UTC

Retested on:
RH-7.8.8-0.3

All daemoncores services are reporting wrong settings in MAX_<subsystem>_LOG variable, now.

>>> VERIFIED

Comment 14 errata-xmlrpc 2013-03-06 18:39:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html

Note You need to log in before you can comment on or make changes to this bug.