Bug 178436 - network service interruption can cause initgroups() to delay cron job execution by more than one minute
Summary: network service interruption can cause initgroups() to delay cron job executi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: vixie-cron
Version: 3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jason Vas Dias
QA Contact: Brock Organ
URL:
Whiteboard:
Depends On:
Blocks: 168424
TreeView+ depends on / blocked
 
Reported: 2006-01-20 15:51 UTC by Jason Vas Dias
Modified: 2018-12-02 16:08 UTC (History)
1 user (show)

Fixed In Version: RHSA-2006-0117
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-15 15:31:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0117 0 qe-ready SHIPPED_LIVE Low: vixie-cron security update 2006-03-15 05:00:00 UTC

Description Jason Vas Dias 2006-01-20 15:51:25 UTC
Description of problem:

During errata QA testing of vixie-cron-4.1-8.EL3, it was found that in 4 days
of running crond, bug 159800 could occur ( the same per-minute job could be
run more than once per minute ) if the initgroups() call done in the job
process is delayed, eg. by an interruption in network service if the 
/etc/nsswitch.conf file contained a network lookup method for the groups
database.

When a run of a job to be run every minute is delayed for more than one minute
by authentication / setuid / setgid / initgroups, crond should skip that run
of the per-minute job.

Here's an example of this happening during QA testing of errata RHSA-2006:0117:

crond performs these steps when running a job:

1. Forks "minder" process, which opens the PAM session, and PAM emits messages
of the form:

   Jan 13 21:22:01 bender crond(pam_unix)[26072]: session opened for user root
by (uid=0)

   to /var/log/messages

2. "minder" process forks the "job" process, which performs the LAuS logging (if
LAuS is enabled),
   and then emits a message of the form:

   Jan 13 21:22:01 bender crond[26073]: (root) CMD (/bin/date +%s >>
/tmp/159800.test.log)

   to /var/log/cron .

3. "job" process does  'setgid(...);  initgroups(); ... setuid(...);' and then
execve()s the job command.

4. "minder" process waits for "job" process to exit, and then closes the PAM
session, which makes PAM
   emit messages of the form:

   Jan 13 21:25:10 bender crond(pam_unix)[26072]: session closed for user root

   to /var/log/messages.

As shown with the messages above, when the failure occurs, a job that reaches
step (2) in one minute, is delayed
for more than a minute before the execve() is done - so it must be the
    'setgid(...);  initgroups(); ... setuid(...);'
that is potentially delaying the job for more than one minute.

All the jobs which fail to run in separate minutes complete Step(2) above at the
correct time:

 Failure 1:
 FAILED at line 860: 1137205474 (Fri Jan 13 21:24:34 2006) is in same minute as
1137205462 (Fri Jan 13 21:24:22 2006)

   Jan 13 21:23:01 bender crond(pam_unix)[26074]: session opened for user root
by (uid=0)
   Jan 13 21:23:01 bender crond[26075]: (root) CMD (/bin/date +%s >>
/tmp/159800.test.log)
   Jan 13 21:24:01 bender crond(pam_unix)[26076]: session opened for user root
by (uid=0)
   Jan 13 21:24:01 bender crond[26077]: (root) CMD (/bin/date +%s >>
/tmp/159800.test.log)
   (
       26077 job delayed by 21 seconds, and actually runs here, logging 1137205462
   )
   Jan 13 21:24:22 bender crond(pam_unix)[26076]: session closed for user root
   (
       26075 job delayed by 1min,23seconds, now runs, logging 1137205474
   )
   Jan 13 21:24:34 bender crond(pam_unix)[26074]: session closed for user root

   Times logged:
   Fri Jan 13 21:24:22 2006
   Fri Jan 13 21:24:34 2006

 Failure 2:
 FAILED at line 862: 1137205510 (Fri Jan 13 21:25:10 2006) is in same minute as
1137205501 (Fri Jan 13 21:25:01 2006)

   Jan 13 21:22:01 bender crond(pam_unix)[26072]: session opened for user root
by (uid=0)
   Jan 13 21:22:01 bender crond[26073]: (root) CMD (/bin/date +%s >>
/tmp/159800.test.log)
   (
       Meanwhile, Failure(1) occurs, and the job for 21:25 is scheduled and run
normally,
       as is the mrtg job
   )
   Jan 13 21:25:01 bender crond(pam_unix)[26090]: session opened for user root
by (uid=0)
   Jan 13 21:25:01 bender crond[26093]: (root) CMD (/bin/date +%s >>
/tmp/159800.test.log)
   Jan 13 21:25:02 bender crond(pam_unix)[26090]: session closed for user root
   (
       job 26073 actually runs here, and logs 1137205510 (Fri Jan 13 21:25:10
2006) -
       this job was meant to run at 21:22:01, but was delayed more than 3 minutes!
   )
   Jan 13 21:25:10 bender crond(pam_unix)[26072]: session closed for user root

   Times logged:
   Fri Jan 13 21:25:01 2006
   Fri Jan 13 21:25:10 2006


Version-Release number of selected component (if applicable):
vixie-cron-4.1-8.EL3

How reproducible:
Very rarely

Steps to Reproduce:
Make network authentication services time out during run of a cron job
  
Actual results:
A per-minute job is delayed into the next minute

Expected results:
A per-minute cron job that gets delayed into the next minute by network
timeouts should be skipped .

Comment 1 Jason Vas Dias 2006-01-20 16:09:12 UTC
fixed with vixie-cron-4.1-10.EL3

Comment 7 Red Hat Bugzilla 2006-03-15 15:31:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0117.html


Comment 8 Need Real Name 2006-03-31 14:10:26 UTC
The fix for this bug does not take account of situations where the crontab entry
is like this:

*/10 * * * * root /usr/local/sbin/CSFsure >/dev/null

With such entries the cron job is still occasionally suppressed; here is output
of the command "grep delay /var/log/cron":

Mar 28 07:51:34 pat crond[2272]: (root) error: Job execution of per-minute job
scheduled for 07:50 delayed into subsequent minute 07:51. Skipping job run.
Mar 29 10:33:10 pat crond[22412]: (root) error: Job execution of per-minute job
scheduled for 10:30 delayed into subsequent minute 10:33. Skipping job run.
Mar 29 10:33:10 pat crond[22413]: (root) error: Job execution of per-minute job
scheduled for 10:30 delayed into subsequent minute 10:33. Skipping job run.

I have looked at the source and think that the problem lies in the test in
routine do.command.c which works for all cron commands with * in the crontab
minutes field and does not exclude those that have a step count as well.

Comment 9 Jason Vas Dias 2006-04-06 23:47:31 UTC
RE: Comment #8 :

Sorry that some of your cron jobs were skipped.

But, the rationale was, jobs that are scheduled to be run at any specific minute
should not be run if execution is delayed by eg. network authentication until
the next minute. Many programs, eg. logwatch / logrotate, have problems if they
are meant to be run in distinct minutes, but are run in the same minute . 

I've just re-tested 'skip minute' jobs (ie. '*/2...' '*/10...'), and they all 
ran fine for me over a 48 hour period - none were skipped.

Your skipped jobs were delayed by network authentication for over one
minute (and in one case 3 minutes) . The time interval measurement in
the code is quite precise - it measures the time between authentication
start and the job being run, and for jobs scheduled to be run at a 
specific minute, if the run cannot occur until the next minute, the
run is skipped. There is nothing else in the code that could cause
such a delay - only a network authentication method (eg. NIS, LDAP, Kerberos)
experiencing a timeout could do this. If your skip had been '*/2', then
the job that was delayed by 3 minutes could have been run in the same minute
as a subsequent job run - this can cause real problems for some cron jobs.
To make the code second guess if a 'skip minute' job will conflict or not
would be horrendously complex and wrong - if jobs are to be run in a specific
minute, they must be run in that minute or not at all.

I am working on a better solution, to enhance cron to provide a class of
'exclusive' jobs that cannot be run if a previous instance is running; then
we could remove the potential to skip per-minute jobs if they are delayed
past the minute for which they were scheduled.

I think it is better for cron not to run a job scheduled to be run in
a specific minute if the job cannot be run in that minute . 

To avoid the potential for network timeouts to cause cron jobs to be skipped,
start crond when /etc/nsswitch.conf contains only :
 'passwd: files
  group:  files
 '
Then cron will use only the 'files' lookup method for users and groups, even
if /etc/nsswitch.conf is later changed to use network lookup methods.

Comment 10 Need Real Name 2006-05-04 10:27:26 UTC
Further to your last comment, it seems to me that if it is important only one
instance of a cron job is run at any one time, then the cron job itself should
have some sort of check to prevent this happening, e.g. a lock file of some sort.

We have worked around the problem introduced by your recent change by using
syntax like 0-59/5 in the minutes field of the crontab entry rather than */5,
but we could have used 0,5,10... etc.  This shows that a check to prevent more
than one instance of a cron job running is (as you say) very diffcult to generalise.


Note You need to log in before you can comment on or make changes to this bug.