Bug 144133
Summary: | crond in loop creating new crond processes and hanging the system. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Celso Medina Kern <celso.kern> |
Component: | vixie-cron | Assignee: | Marcela Mašláňová <mmaslano> |
Status: | CLOSED CANTFIX | QA Contact: | Brock Organ <borgan> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 3.0 | CC: | celso.kern |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | vixie-cron-4.1-1_EL3 | Doc Type: | Enhancement |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-10-30 09:42:47 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Celso Medina Kern
2005-01-04 19:33:36 UTC
Please upgrade your cron to the latest version for RHEL-3, which is vixie-cron-4.1-1_EL3 - this version does not have this problem, and should be in RHEL-3-U5. Meanwhile, you can download it from: http://people.redhat.com/~jvdias/cron/RHEL-3/ By doing an RHN update to RHEL-3-U4, you would obtain vixie-cron-3.0.1-75.1, which I think might also fix this problem - please try vixie-cron-4.1-1_EL3 and let me know if it works OK. We installed vixie-cron 4.1.1 in january, 7th. In january, 11th the system has hung again. This time we could not collect SysRq dump, because serial console was hung either. The only thing customer could do was to switch from graphical console to virtual console F1, but system was not responsive after that. We intend to disable cron at all and monitor to see if we get rid of hangs. It would appear that this problem still can occur with the latest cron releases - I just got another report of it today: > I was never able to reproduce this bug here - I just > > suggested that people try the latest version, and it > > seemed to fix the problem - now it appears not . > > Ah. Well it's not re-produceable here. It just seems to take down oracle > machines from time to time. > > > It sounds like you have a cron job that is executed > > frequently but which never completes . The parent > > cron process will wait for completion of the cron > > job child; if this never occurs, then a situation > > as you describe could result. The problem is with > > the cron job that never completes. I'm also working > > on some major enhancements for cron - one of them > > should be that if the process from a previous run > > of the job is still active, it should not initiate > > another run of the job - this would be a major > > change in behavior from all previous cron releases, > > and needs extensive testing. > > Yeh. I found a lot of instances of the mailman qrunner running. I just > got hold of a top -d output from the machine, from the day it crashed. > > > Please can you send me: > > - The compressed /var/log/cron file from the system > > and your cron configuration > > # tar -cpf - /var/log/cron /etc/cron.d /etc/crontab /var/spool/cron > /tmp/cron.tar.gz > > The latest cron version for RHEL-3 is vixie-cron-4.1-6_EL3, > > available from: > > http://people.redhat.com/~jvdias/cron/RHEL-3 > > If possible, please try out this version and let me know > > if you can reproduce the problem with it. > Yes, it is possible to create an ever increasing number of crond processes, eg. with this job: * * * * * root while /bin/true; do sleep 62; done The problem is with the job that never completes. If cron finds a previous job run still running when it comes time start the next run, what should it do ? o kill the previous process - but what if there is a process that depends on being run at periodic intervals, but sometimes takes a bit longer than its interval ? o not run the next process - again, some processes might really depend on being kicked off at regular intervals. So I don't think it should be the default for all cron jobs to be treated this way. The best way of fixing this might be to create an explicit tag in the cron job file, such as : ?* * * * * root while /bin/true; do sleep 62; done meaning "Don't run this job if a previous instance is still running" and ?!* * * * * root while /bin/true; do sleep 62; done meaning "Kill the previous job instance and run the next instance" . I'll investigate such an enhancement for the next cron version - it would need extensive testing, as it would be a major departure from the behaviour of all previous cron releases. Really, the best short-term solution is to make cron jobs ensure that they complete. NOTE: There is a problem with some configurations of the "mailman" system which can cause this problem to occur. The mailman cron job SHOULD NOT contain this line: " * * * * * /usr/bin/python -S /var/mailman/cron/qrunner " Versions of mailman that install this crontab have never been shipped by Red Hat for RHEL-3 and are not supported by Red Hat . The qrunner process should be run from the mailman controller daemon, not from cron . Please ensure that you have a supported version of mailman installed, ( > 2.1 ), eg: mailman-2.1.5-25.rhel3 . Since there are insufficient details provided in this report for us to investigate the issue further, and we have not received the feedback we requested, we will assume the problem was not reproduceable or has been fixed in a later update for this product. |