Hide Forgot
Description of problem: "Job 'cron.daily' locked by another anacron - skipping". As a result logwatch cannot deliver. This has been reported before and closed without remedy but remains a problem. Version-Release number of selected component (if applicable): anacron 2.3 F11 How reproducible: Steps to Reproduce: 1. always 2. 3. Actual results: failure to run cron.daily Expected results: Additional info:
Log entries: Aug 13 15:00:01 mercury CROND[4811]: (root) CMD (/usr/lib/sa/sa1 -d 1 1) Aug 13 15:01:01 mercury CROND[6926]: (root) CMD (run-parts /etc/cron.hourly) Aug 13 15:01:01 mercury run-parts(/etc/cron.hourly)[6926]: starting 0anacron Aug 13 15:01:01 mercury anacron[6937]: Anacron 2.3 started on 2009-08-13 Aug 13 15:01:01 mercury run-parts(/etc/cron.hourly)[6939]: finished 0anacron Aug 13 15:01:01 mercury anacron[6937]: Job `cron.daily' locked by another anacron - skipping Aug 13 15:01:01 mercury anacron[6937]: Normal exit (0 jobs run)
I suppose you have still regular jobs in /etc/crontab which should be empty.
The /etc/crontab contains only this: SHELL=/bin/bash PATH=/sbin:/bin:/usr/sbin:/usr/bin MAILTO=root HOME=/ # .---------------- minute (0 - 59) # | .------------- hour (0 - 23) # | | .---------- day of month (1 - 31) # | | | .------- month (1 - 12) OR jan,feb,mar,apr ... # | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR #sun,mon,tue,wed,thu,fri,sat # | | | | | # * * * * * command to be executed
I can't reproduce it. Could you give me more information? Locks: cat /var/spool/anacron/cron.* /var/log/cron at least part where anacron is trying run. Does the job finish sometimes or does this message show everyday? Is it your computer always on, do you suspend/hibernate or switching off?
Created attachment 357618 [details] cron log (part)
Created attachment 357619 [details] cron.daily from /var/spool/
Created attachment 357620 [details] cron.monthly
Created attachment 357621 [details] cron.weekly
Hm, I don't see any reason why shouldn't it work. cat /etc/cron.d/0hourly cat /etc/anacrontab
cron.d/0hourly --------------- SHELL=/bin/bash PATH=/sbin:/bin:/usr/sbin:/usr/bin MAILTO=root HOME=/ 01 * * * * root run-parts /etc/cron.hourly anacrontab ----------- # /etc/anacrontab: configuration file for anacron # See anacron(8) and anacrontab(5) for details. SHELL=/bin/sh PATH=/sbin:/bin:/usr/sbin:/usr/bin MAILTO=root # the maximal random delay added to the base delay of the jobs RANDOM_DELAY=45 # the jobs will be started during the following hours only START_HOURS_RANGE=3-22 #period in days delay in minutes job-identifier command 1 5 cron.daily nice run-parts /etc/cron.daily 7 25 cron.weekly nice run-parts /etc/cron.weekly @monthly 45 cron.monthly nice run-parts /etc/cron.monthly
For information. If I yum reinstall cronie it all works fine again.......but only for a few weeks then it starts to fail. Just thought you should know that.
Could you provide output of: ps axu | grep anacron lslk /var/spool/anacron/*
This information may not be useful to you as I have reinstalled cronie to get past these problems. [root@mercury ~]# ps axu|grep anacron root 13342 0.0 0.0 4196 704 pts/0 R+ 15:36 0:00 grep anacron [root@mercury ~]# there is no lslk command. Did you mean ls -l /var/spool/anacron/* ? If so result below. -rw------- 1 root root 9 2009-08-18 03:28 /var/spool/anacron/cron.daily -rw------- 1 root root 9 2009-08-14 03:53 /var/spool/anacron/cron.monthly -rw------- 1 root root 9 2009-08-18 03:48 /var/spool/anacron/cron.weekly
lslk is a package which list local locks. We are interested in current locks which exist at the moment on your system.
Ah Ok thanks. I got the package. There was no output from the command that was requested.
This was the output from using lslk with no parameters. The system has been rebooting in the last hour after a new kernel and updates were installed. SRC PID DEV INUM SZ TY M ST WH END LEN NAME zebra 1875 253,0 131332 5 w 0 0 0 0 0 /var/run/quagga/zebra.pid ripd 1883 253,0 131334 5 w 0 0 0 0 0 /var/run/quagga/ripd.pid mysqld 2364 8,33 12288002 5242880 w 0 0 0 0 0 /disc-1/MySQL_live/ib_logfile1 mysqld 2364 8,33 12288003 5242880 w 0 0 0 0 0 /disc-1/MySQL_live/ib_logfile0 mysqld 2364 8,33 12288004 w 0 0 0 0 0 /disc-1 (/dev/sdc1) postgrey 2408 253,0 118693 0 w 0 0 0 0 0 /var/spool/postfix/postgrey/postgrey.lock master 2471 253,0 124393 17 w 0 0 0 0 0 /var/lib/postfix/master.lock master 2471 253,0 124392 17 w 0 0 0 0 0 /var/spool/postfix/pid/master.pid nmbd 2592 253,0 131199 32768 r 0 4 0 4 0 /var/lib/samba/unexpected.tdb nmbd 2592 253,0 131178 110592 r 0 4 0 4 0 /var/lib/samba/connections.tdb nmbd 2592 253,0 13920 5 w 0 0 0 0 0 /var/run/nmbd.pid smbd 2600 253,0 131180 98304 r 0 4 0 4 0 /var/lib/samba/locking.tdb smbd 2600 253,0 131179 40200 r 0 4 0 4 0 /var/lib/samba/brlock.tdb smbd 2600 253,0 131178 110592 r 0 4 0 4 0 /var/lib/samba/connections.tdb smbd 2600 253,0 131177 200704 r 0 4 0 4 0 /var/lib/samba/sessionid.tdb smbd 2600 253,0 14036 5 w 0 0 0 0 0 /var/run/smbd.pid atd 2612 253,0 14163 5 w 0 0 0 0 0 /var/run/atd.pid amavisd 2808 253,0 131362 0 w 0 0 0 0 0 /var/run/amavisd/amavisd.lock smbd 11007 253,0 131221 16384 r 0 4 0 4 0 /var/lib/samba/notify.tdb smbd 13360 253,0 131221 16384 r 0 4 0 4 0 /var/lib/samba/notify.tdb
Can you run the lslk command above as soon as the system gets to the state when the jobs are not getting run? And also attach the /var/log/cron from the time the jobs run OK for the last time and when the first 'Job 'cron.daily' locked by another anacron - skipping' appeared?
Hi. I will send the data you ask for I just need to collect it up. Of interest? I have observed the same behaviour 3 times in the logs - this may help you. It would appear that all is well until there is a log rotation after which the very next time anacron starts it reports this error every time it runs. Perhaps the restart in logrotate needs attention? Not clean?
Created attachment 358434 [details] lslk, ps aux and 3 log files zipped files incl: lslk, ps aux and 3 log files each of which show the problem at rotation and thereafter. Thanks. Ian
Thanks for logs. Now we saw where it failed but we have no clue why. Could you try: strace -o logfile -p <pid-of-locked-anacron>
You could also try to obtain a backtrace from the hang process with gdb to see the exact place where the hang process is waiting. Use 'gdb anacron <pid-of-the-hang-anacron>' and command bt full for that and please install anacron-debuginfo and glibc-debuginfo packages before that.
I will try this later. New kernel and reboot has undone the environment. I will post later.
Here are the results: backtrace: gdb anacron 953 GNU gdb (GDB) Fedora (6.8.50.20090302-38.fc11) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i586-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Attaching to program: /usr/sbin/anacron, process 953 Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.10.1.so.debug...done. done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.10.1.so.debug...done. done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...Reading symbols from /usr/lib/debug/lib/libnss_files-2.10.1.so.debug...done. done. Loaded symbols for /lib/libnss_files.so.2 0x00e38424 in __kernel_vsyscall () (gdb) bt full #0 0x00e38424 in __kernel_vsyscall () No symbol table info available. #1 0x00bcab97 in do_sigsuspend (set=<value optimized out>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:63 resultvar = <value optimized out> #2 *__GI___sigsuspend (set=<value optimized out>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:74 oldtype = <value optimized out> result = <value optimized out> #3 0x0804a2f8 in wait_signal () at main.c:296 ss = {__val = {0, 0, 3217548600, 134521592, 3217548464, 10, 3217548464, 0 <repeats 25 times>}} #4 0x0804a36d in wait_children () at main.c:305 No locals. #5 0x0804a9cc in main (argc=2, argv=0xbfc7e654) at main.c:525 j = 1 cwd = <value optimized out> dropped_jobs = 0 tv = {tv_sec = 1252202461, tv_usec = 617594} tz = {tz_minuteswest = -60, tz_dsttime = 0} (gdb) strace: rt_sigsuspend([] --end--
Do you still have the hang anacron there? Do you see that it has any child processes? pstree $(pidof anacron) This is still really weird - looking at the anacron code I don't see any race condition although the code is playing with sigsuspends() + waitpid() calls rather unconventionally. And if the race is there it's strange that nobody else reports it.
No I'm sorry but the process has gone after a new kernel 2.6.30.5-43 this morning followed by a re boot. Should it happen again I will rty agin to get the information you sought last time plus the pstree. I can say it was the only anacron process present so no children of the same name existed. Thanks
Thank you for your cooperation. I've also wondered what kind of mail are you using?
Mail? I use postfix and dovecot on two linux servers with MS Outlook clients or evolution or if away (on iphone or web ) squirrelmail. The usual postgrey, clamav and amavis also of course. I run my own domains - very small scale hoster for me and a few buddies/family. Does that help??
I have the condition again. The output from pstree is: anacron───run-parts───awk That's all. I will try and leave the anacron there in case you wish to try anything else. Thanks
So from the comment 28 it is obvious that some script in the /etc/cron.daily causes that. Can you find the execution of scripts in /etc/cron.daily in /var/log/cron* which fully complete and the run which got stuck which was run by the anacron PID from the comment above? Can you compare the parts of the logs to find out which script is stuck? Or attach the logs here. You could also strace the awk process from above to find on which syscall it got stuck however I am not sure it will give us much information.
After a cronie reinstall and up to 13th Sept anacron runs fine. On 13th at 03:39:09 anacron starts, runs through some scripts and gets as far as starting logrotate. There is no entry to say logrotate finshes in the log or next log (if the syslog gets rotated) nor do any of the scripts after logrotate start. From this point on anacron says it cannot run until I reinstall cronie. The logs I have sent you already show this. This is as I described in comment 18 above. My only other observation is that from the point of failure onwards after cron hourly starts cron daily attempts to start before cron hourly completes. This differs from other behaviour where during a cron hourly run cron daily gets scheduled for the future rather than run. Well thats how it looks as I have not seen the coode to establish exactly what goes on there. Hope that helps
Here is the awk trace. This GDB was configured as "i586-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Attaching to program: /usr/bin/awk, process 7389 Reading symbols from /lib/libdl.so.2...Reading symbols from /usr/lib/debug/lib/libdl-2.10.1.so.debug...done. done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libm.so.6...Reading symbols from /usr/lib/debug/lib/libm-2.10.1.so.debug...done. done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.10.1.so.debug...done. done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.10.1.so.debug...done. done. Loaded symbols for /lib/ld-linux.so.2 0x00b65424 in __kernel_vsyscall () (gdb) bt full #0 0x00b65424 in __kernel_vsyscall () No symbol table info available. #1 0x00538cf3 in __read_nocancel () from /lib/libc.so.6 No symbol table info available. #2 0x0806c91f in read (__fd=<value optimized out>, __buf=<value optimized out>, __nbytes=<value optimized out>) at /usr/include/bits/unistd.h:45 No locals. #3 get_a_record (__fd=<value optimized out>, __buf=<value optimized out>, __nbytes=<value optimized out>) at io.c:3007 recm = {start = 0x8069d3b "\241", len = 142786112, rt_start = 0x882c2c0 "Starting up pp_dns daemon: [ OK ]\r\n\n0: echo: write error: Broken pipe\n\n", rt_len = 36} state = <value optimized out> ret = <value optimized out> retval = <value optimized out> rtval = <value optimized out> lastmatchrec = 0x806da00 <rs1scan> __PRETTY_FUNCTION__ = "get_a_record" #4 0x0806cd8e in inrec (iop=<value optimized out>) at io.c:388 begin = 0x809be00 "\tP\t\b" cnt = <value optimized out> retval = <value optimized out> #5 0x08070c9c in do_input () at io.c:498 iop = 0x809be00 rval1 = <value optimized out> rval2 = <value optimized out> rval3 = -512 #6 0x08072ecf in main (argc=4, argv=0xbf9506c4) at main.c:630 c = <value optimized out> scan = <value optimized out> stopped_early = 4 old_optind = <value optimized out> i = <value optimized out> stdio_problem = 0 (gdb) Thanks again
Reassigning to logrotate. It should not get stuck waiting on read of something. If we did not wait for the logrotate finish in anacron it would mean that the hung logrotate jobs would be increasing on the machine. On the other hand we should discuss the possibility to remove run-parts from the call sequence when running the .daily/.weekly etc. jobs. If anacron tracked each job on its own one hung job in cron.daily would not prevent the remaining jobs in cron.daily from running appropriately.
Ian, is there a logrotate process stuck anywhere? ps ax | grep logrotate
Daniel, There is the awk process. 7389 ? SN 0:00 awk -v progname /etc/cron.daily/logrotate progname {????? print progname ":\n"????? progname="";???? }???? { print; } Looks suspicious! Thanks again for you help. Ian
/usr/bin/run-parts contains: for i in $(LC_ALL=C; echo $1/*[^~,]) ; do #... $i 2>&1 | awk -v "progname=$i" \ 'progname { print progname ":\n" progname=""; } { print; }' the "awk" is there just to print the program's name before the program's output when "$i" gets stuck, poor awk on the other side of the pipe just waits... "$i" is logrotate, in this case, can you show me your /etc/logrotate.conf ?
Here it is: # see "man logrotate" for details # rotate log files weekly weekly # keep 4 weeks worth of backlogs rotate 4 # create new (empty) log files after rotating old ones create # use date as a suffix of the rotated file dateext # uncomment this if you want your log files compressed #compress # RPM packages drop log rotation information into this directory include /etc/logrotate.d # no packages own wtmp and btmp -- we'll rotate them here /var/log/wtmp { monthly create 0664 root utmp minsize 1M rotate 1 } /var/log/btmp { missingok monthly create 0600 root utmp rotate 1 } # system-specific logs may be also be configured here.
The source of the problem is probably the following. The write end of the pipe is still opened by some previously child process of the logrotate. This child process is probably a daemon which was spawned by the logrotate when it is restarting the rotated services. logrotate then exits but the awk is not getting EOF on the read from the stdin because the pipe is still open by the daemon. The solution is either to find out which daemon it is and daemonize it properly - so it opens /dev/null for its standard descriptors. http://www-theorie.physik.unizh.ch/~dpotter/howto/daemonize The other possibility is to change logrotate (or the rotate script of the daemon) to always redirect the output and stderr when restarting the rotated services to /dev/null.
I shall change the logrotate script to: #!/bin/sh /usr/sbin/logrotate /etc/logrotate.conf > /dev/null 2>&1 EXITVALUE=$? if [ $EXITVALUE != 0 ]; then /usr/bin/logger -t logrotate "ALERT exited abnormally with [$EXITVALUE]" fi exit 0 Thanks again. Ian
Ian, Tomas: OK, I will change the rotate script /etc/cron.daily/logrotate (which is owned by the logrotate package) to redirect logrotate output to /dev/null, this will fix the bug
fixed in rawhide: logrotate-3.7.8-4.fc12
Well, it will workaround the problem in one of the daemons which are being restarted by logrotate. However I agree that the fix/workaround should be applied.
oops sorry for accidental reopen.
Created attachment 361467 [details] config file patch btw the patch is here
Sorry for opening this bug again, but I would really like to try to find out which daemon exactly is causing that and eventually fix it. I presume you don't have that machine still running, right? If you do, could you send the "ls /etc/logrotate.d/" output please?
I've hit this due to stall NFS handle blocking df - init-+-anacron---run-parts-+-0logwatch---logwatch-+-perl---df
The question is what should be done in such situation - should the stalled job be killed (if possible at all)?
I think that's question for anacron maintainer, I will add proper NEEDINFO.
Tomas is working with me on anacron/cronie, he's comaintainer. He has valid point. The question is what you would suggest to do with stalled jobs? Cron or anacron execute the job, but if the job is "wrong", how should I decide what to do next?
I recently ran into this bug on a CentOS 6.3 webserver. The symptom was that logs were not rotating on the server, causing disk to fill up. Luckily we caught it before disk filled up too much. In checking the cron logs, I found that on Feb 25th, anacron stopped running correctly: (09:18:20) ○ [jcuzella.com] ~ → cat /etc/redhat-release CentOS release 6.3 (Final) (09:18:21) ○ [jcuzella.com] ~ → sudo zgrep 'anacron' -C3 -in /var/log/cron* /var/log/cron-20150317:Feb 24 20:01:01 web-00 run-parts(/etc/cron.hourly)[8551]: finished 0anacron /var/log/cron-20150317:Feb 24 21:01:01 web-00 run-parts(/etc/cron.hourly)[10133]: starting 0anacron /var/log/cron-20150317:Feb 24 21:01:01 web-00 run-parts(/etc/cron.hourly)[10142]: finished 0anacron /var/log/cron-20150317:Feb 24 22:01:01 web-00 run-parts(/etc/cron.hourly)[11733]: starting 0anacron /var/log/cron-20150317:Feb 24 22:01:01 web-00 run-parts(/etc/cron.hourly)[11742]: finished 0anacron /var/log/cron-20150317:Feb 24 23:01:01 web-00 run-parts(/etc/cron.hourly)[13304]: starting 0anacron /var/log/cron-20150317:Feb 24 23:01:01 web-00 run-parts(/etc/cron.hourly)[13313]: finished 0anacron /var/log/cron-20150317:Feb 25 00:01:01 web-00 run-parts(/etc/cron.hourly)[14885]: starting 0anacron /var/log/cron-20150317:Feb 25 00:01:01 web-00 anacron[14896]: Anacron started on 2015-02-25 /var/log/cron-20150317:Feb 25 00:01:01 web-00 anacron[14896]: Will run job `cron.daily' in 0 min. /var/log/cron-20150317:Feb 25 00:01:01 web-00 anacron[14896]: Jobs will be executed sequentially /var/log/cron-20150317:Feb 25 00:01:01 web-00 run-parts(/etc/cron.hourly)[14898]: finished 0anacron /var/log/cron-20150317:Feb 25 00:01:01 web-00 anacron[14896]: Job `cron.daily' started /var/log/cron-20150317:Feb 25 01:01:01 web-00 run-parts(/etc/cron.hourly)[16696]: starting 0anacron /var/log/cron-20150317:Feb 25 01:01:01 web-00 anacron[16707]: Anacron started on 2015-02-25 /var/log/cron-20150317:Feb 25 01:01:01 web-00 anacron[16707]: Job `cron.daily' locked by another anacron - skipping /var/log/cron-20150317:Feb 25 01:01:01 web-00 anacron[16707]: Normal exit (0 jobs run) /var/log/cron-20150317:Feb 25 01:01:01 web-00 run-parts(/etc/cron.hourly)[16709]: finished 0anacron /var/log/cron-20150317:Feb 25 02:01:02 web-00 run-parts(/etc/cron.hourly)[18305]: starting 0anacron /var/log/cron-20150317:Feb 25 02:01:02 web-00 anacron[18316]: Anacron started on 2015-02-25 /var/log/cron-20150317:Feb 25 02:01:02 web-00 anacron[18316]: Job `cron.daily' locked by another anacron - skipping /var/log/cron-20150317:Feb 25 02:01:02 web-00 anacron[18316]: Normal exit (0 jobs run) [...SNIP...] /var/log/cron-20150317:Mar 17 08:01:01 web-00 run-parts(/etc/cron.hourly)[22854]: starting 0anacron /var/log/cron-20150317:Mar 17 08:01:01 web-00 anacron[22865]: Anacron started on 2015-03-17 /var/log/cron-20150317:Mar 17 08:01:01 web-00 anacron[22865]: Job `cron.daily' locked by another anacron - skipping /var/log/cron-20150317:Mar 17 08:01:01 web-00 anacron[22865]: Normal exit (0 jobs run) /var/log/cron-20150317:Mar 17 08:01:01 web-00 run-parts(/etc/cron.hourly)[22867]: finished 0anacron (09:23:19) ○ [jcuzella.com] ~ → ps auxww | grep cron | grep -v grep root 1749 0.0 0.0 115348 768 ? Ss 2013 4:09 crond root 14896 0.0 0.0 17508 712 ? Ss Feb25 0:00 /usr/sbin/anacron -s root 14899 0.0 0.0 9228 1188 ? SN Feb25 0:00 /bin/bash /usr/bin/run-parts /etc/cron.daily root 15071 0.0 0.0 9228 1112 ? SN Feb25 0:00 /bin/sh /etc/cron.daily/mlocate.cron root 15072 0.0 0.0 9084 776 ? SN Feb25 0:00 awk -v progname=/etc/cron.daily/mlocate.cron progname {????? print progname ":\n"????? progname="";???? }???? { print; } (09:45:05) ○ [jcuzella.com] ~ → pstree -p $(pidof anacron) anacron(14896)───run-parts(14899)─┬─awk(15072) └─mlocate.cron(15071)───updatedb(15077) (09:57:45) ○ [jcuzella.com] ~ → sudo lsof -p 15072 [sudo] password for jcuzella: lsof: WARNING: can't stat() nfs file system /kickstart Output information may be incomplete. ^C (09:58:09) ○ [jcuzella.com] ~ → sudo lsof -p 15071 lsof: WARNING: can't stat() nfs file system /kickstart Output information may be incomplete. ^C (10:05:49) ○ [jcuzella.com] ~ → sudo lsof -n lsof: WARNING: can't stat() nfs file system /kickstart Output information may be incomplete. ^C (10:18:17) ○ [jcuzella.com] ~ → cat /proc/mounts | grep kickstart 10.123.213.123:/kickstart /kickstart nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.123.213.123,mountvers=3,mountport=684,mountproto=udp,local_lock=none,addr=10.123.213.123 0 0 (10:18:41) ○ [jcuzella.com] ~ → ls -l /kickstart ls: cannot access /kickstart: Stale NFS file handle (10:18:19) ○ [jcuzella.com] ~ → cat /etc/cron.daily/mlocate.cron #!/bin/sh nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }') renice +19 -p $$ >/dev/null 2>&1 ionice -c2 -n7 -p $$ >/dev/null 2>&1 /usr/bin/updatedb -f "$nodevs" I believe that anacron was blocked by a stale NFS file handle. The NFS file handle for /kickstart must have become stale which caused the script: /etc/cron.daily/mlocate.cron to block, which caused anacron to stay running, which blocked the other cron.daily jobs from running, which caused the apache logs to not rotate since Feb 25th, which caused disk to start filling up. ( What a Rube Goldberg-esque chain of events! ) Any tips on how to stop mlocate.cron from blocking when an NFS mount becomes stale?
I think the proper solution for these issues lies in the kernel. There should be a timeout for the NFS stale handles so the applications that touch them do not block indefinitely but receive an error.
Was able to fix the immediate problem by manual unmount/remount, then killing off blocked processes: sudo umount /kickstart sudo mount /kickstart sudo kill $(pidof anacron) sudo kill 15072 sudo kill 15071 sudo kill 14899 sudo /etc/init.d/crond restart Then manual kickoff of anacron seemed to work & logs were rotated: bash -x /etc/cron.hourly/0anacron + test -r /var/spool/anacron/cron.daily ++ date +%Y%m%d + '[' 20150317 = '' ']' + test -x /usr/bin/on_ac_power + /usr/bin/on_ac_power + test 0 -eq 1 + /usr/sbin/anacron -s
I think there is nothing I can do from logrotate point of view right now. I'm going to close this bug again, if you think something in logrotate should be fixed, please reopen with some explanation.
See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=379645
I'm seeing this on Fedora 26 4.12.8-300.fc26.x86_64. Seems to be with logwatch. Here are some of the requested process details and gdb. lslocks -p 27094 COMMAND PID TYPE SIZE MODE M START END PATH anacron 27094 POSIX 9B WRITE 0 0 0 /var/spool/anacron/cron.daily pstree 27094 anacron---run-parts-+-0logwatch---logwatch-+-perl---df | `-sendmail `-awk Reading symbols from anacron...Reading symbols from /usr/lib/debug/usr/sbin/anacron.debug...done. done. Attaching to program: /usr/sbin/anacron, process 27094 Reading symbols from /lib64/libselinux.so.1...Reading symbols from /usr/lib/debug/usr/lib64/libselinux.so.1.debug...done. done. Reading symbols from /lib64/libpam.so.0...Reading symbols from /usr/lib/debug/usr/lib64/libpam.so.0.84.2.debug...done. done. Reading symbols from /lib64/libdl.so.2...Reading symbols from /usr/lib/debug/lib64/libdl-2.25.so.debug...done. done. Reading symbols from /lib64/libaudit.so.1...Reading symbols from /usr/lib/debug/lib64/libaudit.so.1.0.0.debug...done. done. Reading symbols from /lib64/libc.so.6...Reading symbols from /usr/lib/debug/lib64/libc-2.25.so.debug...done. done. Reading symbols from /lib64/libpcre.so.1...Reading symbols from /usr/lib/debug/usr/lib64/libpcre.so.1.2.9.debug...done. done. Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/lib64/ld-2.25.so.debug...done. done. Reading symbols from /lib64/libcap-ng.so.0...Reading symbols from /usr/lib/debug/lib64/libcap-ng.so.0.0.0.debug...done. done. Reading symbols from /lib64/libpthread.so.0...Reading symbols from /usr/lib/debug/lib64/libpthread-2.25.so.debug...done. done. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007f8f2a0cad66 in __GI___sigsuspend (set=0x7fff9d54be90) at ../sysdeps/unix/sysv/linux/sigsuspend.c:30 30 return SYSCALL_CANCEL (rt_sigsuspend, set, _NSIG / 8);
Please report this against logwatch. Also you need to find out why the sendmail or df is stuck.
(In reply to Tomas Mraz from comment #55) > Please report this against logwatch. Also you need to find out why the > sendmail or df is stuck. I opened a new bug but how do I figure how why those are stuck? sendmail is running just fine.
You show above that logwatch spawned sendmail and df and is waiting for them to finish - you need to look up these processes and if they are stuck then this is the real problem.
(In reply to Tomas Mraz from comment #57) > You show above that logwatch spawned sendmail and df and is waiting for them > to finish - you need to look up these processes and if they are stuck then > this is the real problem. Apologies for turning this into a how-to but does any of the below help? Any other commands to check? pstree| grep df |-anacron---run-parts-+-0logwatch---logwatch-+-perl---df |-logwatch-+-perl---df |-perl---df pstree| grep sendmail | | `-sendmail |-fail2ban-server-+-sh---sendmail | `-sendmail |-sendmail---sendmail |-sendmail ps -auwx | grep df dbus 5410 0.0 0.0 45512 5308 ? Ss Nov29 0:36 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only root 5910 0.0 0.0 4484 752 ? DN Nov17 0:20 df -h -x tmpfs -x udf -x iso9660 root 8379 0.0 0.0 4484 776 ? D Nov28 0:03 df -h -x tmpfs -x udf -x iso9660 df root 31576 0.0 0.0 4484 744 ? DN Nov29 0:02 df -h -x tmpfs -x udf -x iso9660 ps -auwx | grep sendmail smmsp 4867 0.0 0.0 83176 5632 ? SN Nov17 0:00 /usr/sbin/sendmail -t root 14643 0.0 0.0 111600 4688 ? Ss Nov29 0:04 sendmail: accepting connections smmsp 14653 0.0 0.0 83172 3388 ? Ss Nov29 0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue smmsp 30299 0.0 0.0 83388 8096 ? S 10:50 0:00 sendmail: ./vAUFor3s030299 [127.0.0.1]: client DATA status root 30318 0.0 0.0 111732 7244 ? D 10:50 0:00 sendmail: vAUFosMc030318 localhost [127.0.0.1]: DATA sendmail smmsp 30657 0.0 0.0 83176 8216 ? SN Nov29 0:00 /usr/sbin/sendmail -t cat /proc/27094/task/27094/status Name: anacron Umask: 0022 State: S (sleeping) Tgid: 27094 Ngid: 0 Pid: 27094 PPid: 1 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 64 Groups: 0 NStgid: 27094 NSpid: 27094 NSpgid: 27094 NSsid: 27094 VmPeak: 129680 kB VmSize: 129680 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 2176 kB VmRSS: 2176 kB RssAnon: 232 kB RssFile: 1944 kB RssShmem: 0 kB VmData: 284 kB VmStk: 132 kB VmExe: 32 kB VmLib: 2900 kB VmPTE: 68 kB VmPMD: 12 kB VmSwap: 0 kB HugetlbPages: 0 kB Threads: 1 SigQ: 3/96296 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000180012200 CapInh: 0000000000000000 CapPrm: 0000003fffffffff CapEff: 0000003fffffffff CapBnd: 0000003fffffffff CapAmb: 0000000000000000 NoNewPrivs: 0 Seccomp: 0 Cpus_allowed: ffffff Cpus_allowed_list: 0-23 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 Mems_allowed_list: 0-1 voluntary_ctxt_switches: 36 nonvoluntary_ctxt_switches: 0
(In reply to Tomas Mraz from comment #57) > You show above that logwatch spawned sendmail and df and is waiting for them > to finish - you need to look up these processes and if they are stuck then > this is the real problem. Indeed the df command from logwatch "df -h -x tmpfs -x udf -x iso9660", just hangs when run on its own. Tried on another server it responds right away. I can't reboot at the moment but any reason why a df command would hang for a long time? Any service(s) I can restart? The load is a bit high, around 4, but no process(es) stick out.
It's most probably blocked on some device. Stracing the process can show which operation on which descriptor is blocked. Anyway this bugzilla is not a support platform and this is not a problem of anacron, or even logwatch.