Description of problem: During errata QA testing of vixie-cron-4.1-8.EL3, it was found that in 4 days of running crond, bug 159800 could occur ( the same per-minute job could be run more than once per minute ) if the initgroups() call done in the job process is delayed, eg. by an interruption in network service if the /etc/nsswitch.conf file contained a network lookup method for the groups database. When a run of a job to be run every minute is delayed for more than one minute by authentication / setuid / setgid / initgroups, crond should skip that run of the per-minute job. Here's an example of this happening during QA testing of errata RHSA-2006:0117: crond performs these steps when running a job: 1. Forks "minder" process, which opens the PAM session, and PAM emits messages of the form: Jan 13 21:22:01 bender crond(pam_unix)[26072]: session opened for user root by (uid=0) to /var/log/messages 2. "minder" process forks the "job" process, which performs the LAuS logging (if LAuS is enabled), and then emits a message of the form: Jan 13 21:22:01 bender crond[26073]: (root) CMD (/bin/date +%s >> /tmp/159800.test.log) to /var/log/cron . 3. "job" process does 'setgid(...); initgroups(); ... setuid(...);' and then execve()s the job command. 4. "minder" process waits for "job" process to exit, and then closes the PAM session, which makes PAM emit messages of the form: Jan 13 21:25:10 bender crond(pam_unix)[26072]: session closed for user root to /var/log/messages. As shown with the messages above, when the failure occurs, a job that reaches step (2) in one minute, is delayed for more than a minute before the execve() is done - so it must be the 'setgid(...); initgroups(); ... setuid(...);' that is potentially delaying the job for more than one minute. All the jobs which fail to run in separate minutes complete Step(2) above at the correct time: Failure 1: FAILED at line 860: 1137205474 (Fri Jan 13 21:24:34 2006) is in same minute as 1137205462 (Fri Jan 13 21:24:22 2006) Jan 13 21:23:01 bender crond(pam_unix)[26074]: session opened for user root by (uid=0) Jan 13 21:23:01 bender crond[26075]: (root) CMD (/bin/date +%s >> /tmp/159800.test.log) Jan 13 21:24:01 bender crond(pam_unix)[26076]: session opened for user root by (uid=0) Jan 13 21:24:01 bender crond[26077]: (root) CMD (/bin/date +%s >> /tmp/159800.test.log) ( 26077 job delayed by 21 seconds, and actually runs here, logging 1137205462 ) Jan 13 21:24:22 bender crond(pam_unix)[26076]: session closed for user root ( 26075 job delayed by 1min,23seconds, now runs, logging 1137205474 ) Jan 13 21:24:34 bender crond(pam_unix)[26074]: session closed for user root Times logged: Fri Jan 13 21:24:22 2006 Fri Jan 13 21:24:34 2006 Failure 2: FAILED at line 862: 1137205510 (Fri Jan 13 21:25:10 2006) is in same minute as 1137205501 (Fri Jan 13 21:25:01 2006) Jan 13 21:22:01 bender crond(pam_unix)[26072]: session opened for user root by (uid=0) Jan 13 21:22:01 bender crond[26073]: (root) CMD (/bin/date +%s >> /tmp/159800.test.log) ( Meanwhile, Failure(1) occurs, and the job for 21:25 is scheduled and run normally, as is the mrtg job ) Jan 13 21:25:01 bender crond(pam_unix)[26090]: session opened for user root by (uid=0) Jan 13 21:25:01 bender crond[26093]: (root) CMD (/bin/date +%s >> /tmp/159800.test.log) Jan 13 21:25:02 bender crond(pam_unix)[26090]: session closed for user root ( job 26073 actually runs here, and logs 1137205510 (Fri Jan 13 21:25:10 2006) - this job was meant to run at 21:22:01, but was delayed more than 3 minutes! ) Jan 13 21:25:10 bender crond(pam_unix)[26072]: session closed for user root Times logged: Fri Jan 13 21:25:01 2006 Fri Jan 13 21:25:10 2006 Version-Release number of selected component (if applicable): vixie-cron-4.1-8.EL3 How reproducible: Very rarely Steps to Reproduce: Make network authentication services time out during run of a cron job Actual results: A per-minute job is delayed into the next minute Expected results: A per-minute cron job that gets delayed into the next minute by network timeouts should be skipped .
fixed with vixie-cron-4.1-10.EL3
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0117.html
The fix for this bug does not take account of situations where the crontab entry is like this: */10 * * * * root /usr/local/sbin/CSFsure >/dev/null With such entries the cron job is still occasionally suppressed; here is output of the command "grep delay /var/log/cron": Mar 28 07:51:34 pat crond[2272]: (root) error: Job execution of per-minute job scheduled for 07:50 delayed into subsequent minute 07:51. Skipping job run. Mar 29 10:33:10 pat crond[22412]: (root) error: Job execution of per-minute job scheduled for 10:30 delayed into subsequent minute 10:33. Skipping job run. Mar 29 10:33:10 pat crond[22413]: (root) error: Job execution of per-minute job scheduled for 10:30 delayed into subsequent minute 10:33. Skipping job run. I have looked at the source and think that the problem lies in the test in routine do.command.c which works for all cron commands with * in the crontab minutes field and does not exclude those that have a step count as well.
RE: Comment #8 : Sorry that some of your cron jobs were skipped. But, the rationale was, jobs that are scheduled to be run at any specific minute should not be run if execution is delayed by eg. network authentication until the next minute. Many programs, eg. logwatch / logrotate, have problems if they are meant to be run in distinct minutes, but are run in the same minute . I've just re-tested 'skip minute' jobs (ie. '*/2...' '*/10...'), and they all ran fine for me over a 48 hour period - none were skipped. Your skipped jobs were delayed by network authentication for over one minute (and in one case 3 minutes) . The time interval measurement in the code is quite precise - it measures the time between authentication start and the job being run, and for jobs scheduled to be run at a specific minute, if the run cannot occur until the next minute, the run is skipped. There is nothing else in the code that could cause such a delay - only a network authentication method (eg. NIS, LDAP, Kerberos) experiencing a timeout could do this. If your skip had been '*/2', then the job that was delayed by 3 minutes could have been run in the same minute as a subsequent job run - this can cause real problems for some cron jobs. To make the code second guess if a 'skip minute' job will conflict or not would be horrendously complex and wrong - if jobs are to be run in a specific minute, they must be run in that minute or not at all. I am working on a better solution, to enhance cron to provide a class of 'exclusive' jobs that cannot be run if a previous instance is running; then we could remove the potential to skip per-minute jobs if they are delayed past the minute for which they were scheduled. I think it is better for cron not to run a job scheduled to be run in a specific minute if the job cannot be run in that minute . To avoid the potential for network timeouts to cause cron jobs to be skipped, start crond when /etc/nsswitch.conf contains only : 'passwd: files group: files ' Then cron will use only the 'files' lookup method for users and groups, even if /etc/nsswitch.conf is later changed to use network lookup methods.
Further to your last comment, it seems to me that if it is important only one instance of a cron job is run at any one time, then the cron job itself should have some sort of check to prevent this happening, e.g. a lock file of some sort. We have worked around the problem introduced by your recent change by using syntax like 0-59/5 in the minutes field of the crontab entry rather than */5, but we could have used 0,5,10... etc. This shows that a check to prevent more than one instance of a cron job running is (as you say) very diffcult to generalise.