Bug 517321

Summary:

Job 'cron.daily' locked by another anacron - skipping

Product:

[Fedora] Fedora

Reporter:

Ian <ian.tighe>

Component:

logrotate

Assignee:

Jan Kaluža <jkaluza>

Status:

CLOSED RAWHIDE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

low

Version:

CC:

bug, dnovotny, ian.tighe, james.cuzella, jkaluza, mmaslano, moshiro, redhat-e27, rkudyba, rmarko, tmraz, tsmetana

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

742206 (view as bug list)

Environment:

Last Closed:

2015-07-22 14:45:14 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
cron log (part)	none
cron.daily from /var/spool/	none
cron.monthly	none
cron.weekly	none
lslk, ps aux and 3 log files	none
config file patch	none

Description Ian 2009-08-13 13:53:04 UTC

Description of problem:

"Job 'cron.daily' locked by another anacron - skipping". As a result logwatch cannot deliver. This has been reported before and closed without remedy but remains a problem.

Version-Release number of selected component (if applicable):

anacron 2.3
F11

How reproducible:


Steps to Reproduce:
1. always
2.
3.
  
Actual results:

failure to run cron.daily

Expected results:


Additional info:

Comment 1 Ian 2009-08-13 14:17:17 UTC

Log entries:

Aug 13 15:00:01 mercury CROND[4811]: (root) CMD (/usr/lib/sa/sa1 -d 1 1)
Aug 13 15:01:01 mercury CROND[6926]: (root) CMD (run-parts /etc/cron.hourly)
Aug 13 15:01:01 mercury run-parts(/etc/cron.hourly)[6926]: starting 0anacron
Aug 13 15:01:01 mercury anacron[6937]: Anacron 2.3 started on 2009-08-13
Aug 13 15:01:01 mercury run-parts(/etc/cron.hourly)[6939]: finished 0anacron
Aug 13 15:01:01 mercury anacron[6937]: Job `cron.daily' locked by another anacron - skipping
Aug 13 15:01:01 mercury anacron[6937]: Normal exit (0 jobs run)

Comment 2 Marcela Mašláňová 2009-08-14 12:47:05 UTC

I suppose you have still regular jobs in /etc/crontab which should be empty.

Comment 3 Ian 2009-08-14 13:23:55 UTC

The /etc/crontab contains only this: 

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/
# .---------------- minute (0 - 59) 
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ... 
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7)  OR
#sun,mon,tue,wed,thu,fri,sat 
# |  |  |  |  |
# *  *  *  *  *  command to be executed

Comment 4 Marcela Mašláňová 2009-08-17 08:35:51 UTC

I can't reproduce it.
Could you give me more information?

Locks:
cat /var/spool/anacron/cron.*

/var/log/cron at least part where anacron is trying run.

Does the job finish sometimes or does this message show everyday? Is it your computer always on, do you suspend/hibernate or switching off?

Comment 5 Ian 2009-08-17 09:38:00 UTC

Created attachment 357618 [details]
cron log (part)

Comment 6 Ian 2009-08-17 09:39:55 UTC

Created attachment 357619 [details]
cron.daily from /var/spool/

Comment 7 Ian 2009-08-17 09:40:53 UTC

Created attachment 357620 [details]
cron.monthly

Comment 8 Ian 2009-08-17 09:41:30 UTC

Created attachment 357621 [details]
cron.weekly

Comment 9 Marcela Mašláňová 2009-08-17 10:09:03 UTC

Hm, I don't see any reason why shouldn't it work.
cat /etc/cron.d/0hourly
cat /etc/anacrontab

Comment 10 Ian 2009-08-17 11:31:03 UTC

cron.d/0hourly
---------------

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/
01 * * * * root run-parts /etc/cron.hourly

anacrontab
-----------

# /etc/anacrontab: configuration file for anacron

# See anacron(8) and anacrontab(5) for details.

SHELL=/bin/sh
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
# the maximal random delay added to the base delay of the jobs
RANDOM_DELAY=45
# the jobs will be started during the following hours only
START_HOURS_RANGE=3-22

#period in days   delay in minutes   job-identifier   command
1	5	cron.daily		nice run-parts /etc/cron.daily
7	25	cron.weekly		nice run-parts /etc/cron.weekly
@monthly 45	cron.monthly		nice run-parts /etc/cron.monthly

Comment 11 Ian 2009-08-17 20:24:23 UTC

For information. 

If I yum reinstall cronie it all works fine again.......but only for a few weeks then it starts to fail. Just thought you should know that.

Comment 12 Tomas Mraz 2009-08-18 08:46:59 UTC

Could you provide output of:

ps axu | grep anacron

lslk /var/spool/anacron/*

Comment 13 Ian 2009-08-18 14:40:13 UTC

This information may not be useful to you as I have reinstalled cronie to get past these problems.


[root@mercury ~]# ps axu|grep anacron
root     13342  0.0  0.0   4196   704 pts/0    R+   15:36   0:00 grep anacron
[root@mercury ~]# 


there is no lslk command. Did you mean ls -l /var/spool/anacron/*  ? If so result below.

-rw------- 1 root root 9 2009-08-18 03:28 /var/spool/anacron/cron.daily
-rw------- 1 root root 9 2009-08-14 03:53 /var/spool/anacron/cron.monthly
-rw------- 1 root root 9 2009-08-18 03:48 /var/spool/anacron/cron.weekly

Comment 14 Marcela Mašláňová 2009-08-18 15:00:55 UTC

lslk is a package which list local locks. We are interested in current locks which exist at the moment on your system.

Comment 15 Ian 2009-08-18 15:09:56 UTC

Ah Ok thanks. I got the package. There was no output from the command that was requested.

Comment 16 Ian 2009-08-18 15:12:57 UTC

This was the output from using lslk with no parameters. The system has been rebooting in the last hour after a new kernel and updates were installed.

SRC        PID   DEV     INUM      SZ TY M ST WH END LEN NAME
zebra     1875 253,0   131332       5  w 0  0  0   0   0 /var/run/quagga/zebra.pid
ripd      1883 253,0   131334       5  w 0  0  0   0   0 /var/run/quagga/ripd.pid
mysqld    2364  8,33 12288002 5242880  w 0  0  0   0   0 /disc-1/MySQL_live/ib_logfile1
mysqld    2364  8,33 12288003 5242880  w 0  0  0   0   0 /disc-1/MySQL_live/ib_logfile0
mysqld    2364  8,33 12288004          w 0  0  0   0   0 /disc-1 (/dev/sdc1)
postgrey  2408 253,0   118693       0  w 0  0  0   0   0 /var/spool/postfix/postgrey/postgrey.lock
master    2471 253,0   124393      17  w 0  0  0   0   0 /var/lib/postfix/master.lock
master    2471 253,0   124392      17  w 0  0  0   0   0 /var/spool/postfix/pid/master.pid
nmbd      2592 253,0   131199   32768  r 0  4  0   4   0 /var/lib/samba/unexpected.tdb
nmbd      2592 253,0   131178  110592  r 0  4  0   4   0 /var/lib/samba/connections.tdb
nmbd      2592 253,0    13920       5  w 0  0  0   0   0 /var/run/nmbd.pid
smbd      2600 253,0   131180   98304  r 0  4  0   4   0 /var/lib/samba/locking.tdb
smbd      2600 253,0   131179   40200  r 0  4  0   4   0 /var/lib/samba/brlock.tdb
smbd      2600 253,0   131178  110592  r 0  4  0   4   0 /var/lib/samba/connections.tdb
smbd      2600 253,0   131177  200704  r 0  4  0   4   0 /var/lib/samba/sessionid.tdb
smbd      2600 253,0    14036       5  w 0  0  0   0   0 /var/run/smbd.pid
atd       2612 253,0    14163       5  w 0  0  0   0   0 /var/run/atd.pid
amavisd   2808 253,0   131362       0  w 0  0  0   0   0 /var/run/amavisd/amavisd.lock
smbd     11007 253,0   131221   16384  r 0  4  0   4   0 /var/lib/samba/notify.tdb
smbd     13360 253,0   131221   16384  r 0  4  0   4   0 /var/lib/samba/notify.tdb

Comment 17 Tomas Mraz 2009-08-18 15:35:17 UTC

Can you run the lslk command above as soon as the system gets to the state when the jobs are not getting run? And also attach the /var/log/cron from the time the jobs run OK for the last time and when the first 'Job 'cron.daily' locked by another anacron - skipping' appeared?

Comment 18 Ian 2009-08-24 10:20:04 UTC

Hi. I will send the data you ask for I just need to collect it up. 

Of interest? I have observed the same behaviour 3 times in the logs - this may help you. It would appear that all is well until there is a log rotation after which the very next time anacron starts it reports  this error every time it runs. Perhaps the restart in logrotate needs attention? Not clean?

Comment 19 Ian 2009-08-24 10:38:46 UTC

Created attachment 358434 [details]
lslk, ps aux and 3 log files

zipped files incl: lslk, ps aux and 3 log files each of which show the problem at rotation and thereafter. Thanks. Ian

Comment 20 Marcela Mašláňová 2009-08-28 11:43:26 UTC

Thanks for logs. Now we saw where it failed but we have no clue why.

Could you try:
strace -o logfile -p <pid-of-locked-anacron>

Comment 21 Tomas Mraz 2009-08-28 12:13:54 UTC

You could also try to obtain a backtrace from the hang process with gdb to see the exact place where the hang process is waiting. Use 'gdb anacron <pid-of-the-hang-anacron>' and command bt full for that and please install anacron-debuginfo and glibc-debuginfo packages before that.

Comment 22 Ian 2009-08-28 14:57:55 UTC

I will try this later. New kernel and reboot has undone the environment. I will post later.

Comment 23 Ian 2009-09-07 19:17:33 UTC

Here are the results:

backtrace:

gdb anacron 953
GNU gdb (GDB) Fedora (6.8.50.20090302-38.fc11)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i586-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Attaching to program: /usr/sbin/anacron, process 953
Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.10.1.so.debug...done.
done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.10.1.so.debug...done.
done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...Reading symbols from /usr/lib/debug/lib/libnss_files-2.10.1.so.debug...done.
done.
Loaded symbols for /lib/libnss_files.so.2
0x00e38424 in __kernel_vsyscall ()
(gdb) bt full
#0  0x00e38424 in __kernel_vsyscall ()
No symbol table info available.
#1  0x00bcab97 in do_sigsuspend (set=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:63
        resultvar = <value optimized out>
#2  *__GI___sigsuspend (set=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/sigsuspend.c:74
        oldtype = <value optimized out>
        result = <value optimized out>
#3  0x0804a2f8 in wait_signal () at main.c:296
        ss = {__val = {0, 0, 3217548600, 134521592, 3217548464, 10, 
            3217548464, 0 <repeats 25 times>}}
#4  0x0804a36d in wait_children () at main.c:305
No locals.
#5  0x0804a9cc in main (argc=2, argv=0xbfc7e654) at main.c:525
        j = 1
        cwd = <value optimized out>
        dropped_jobs = 0
        tv = {tv_sec = 1252202461, tv_usec = 617594}
        tz = {tz_minuteswest = -60, tz_dsttime = 0}
(gdb) 


strace:

rt_sigsuspend([]


--end--

Comment 24 Tomas Mraz 2009-09-08 09:26:51 UTC

Do you still have the hang anacron there? Do you see that it has any child processes?

pstree $(pidof anacron)

This is still really weird - looking at the anacron code I don't see any race condition although the code is playing with sigsuspends() + waitpid() calls rather unconventionally. And if the race is there it's strange that nobody else reports it.

Comment 25 Ian 2009-09-08 13:27:38 UTC

No I'm sorry but the process has gone after a new kernel 2.6.30.5-43 this morning followed by a re boot. Should it happen again I will rty agin to get the information you sought last time plus the pstree. I can say it was the only anacron process present so no children of the same name existed. Thanks

Comment 26 Marcela Mašláňová 2009-09-14 14:27:26 UTC

Thank you for your cooperation. I've also wondered what kind of mail are you using?

Comment 27 Ian 2009-09-14 18:48:26 UTC

Mail? I use postfix and dovecot on two linux servers with MS Outlook clients or evolution or if away (on iphone or web ) squirrelmail. The usual postgrey, clamav and amavis also of course. I run my own domains - very small scale hoster for me and a few buddies/family. Does that help??

Comment 28 Ian 2009-09-14 19:02:33 UTC

I have the condition again. The output from pstree is:

anacron───run-parts───awk


That's all. I will try and leave the anacron there in case you wish to try anything else.

Thanks

Comment 29 Tomas Mraz 2009-09-14 19:35:56 UTC

So from the comment 28 it is obvious that some script in the /etc/cron.daily causes that. Can you find the execution of scripts in /etc/cron.daily in /var/log/cron* which fully complete and the run which got stuck which was run by the anacron PID from the comment above? Can you compare the parts of the logs to find out which script is stuck? Or attach the logs here. You could also strace the awk process from above to find on which syscall it got stuck however I am not sure it will give us much information.

Comment 30 Ian 2009-09-14 23:02:31 UTC

After a cronie reinstall and up to 13th Sept anacron runs fine. On 13th at 03:39:09 anacron starts, runs through some scripts and gets as far as starting logrotate. There is no entry to say logrotate finshes in the log or next log  (if the syslog gets rotated) nor do any of the scripts after logrotate start. From this point on anacron says it cannot run until I reinstall cronie. 

The logs I have sent you already show this. This is as I described in comment 18 above. 

My only other observation is that from the point of failure onwards after cron hourly starts cron daily attempts to start before cron hourly completes. This differs from other behaviour where during a cron hourly run cron daily gets scheduled for the future rather than run. Well thats how it looks as I have not seen the coode to establish exactly what goes on there.


Hope that helps

Comment 31 Ian 2009-09-14 23:13:08 UTC

Here is the awk trace.


This GDB was configured as "i586-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Attaching to program: /usr/bin/awk, process 7389
Reading symbols from /lib/libdl.so.2...Reading symbols from /usr/lib/debug/lib/libdl-2.10.1.so.debug...done.
done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...Reading symbols from /usr/lib/debug/lib/libm-2.10.1.so.debug...done.
done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.10.1.so.debug...done.
done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.10.1.so.debug...done.
done.
Loaded symbols for /lib/ld-linux.so.2
0x00b65424 in __kernel_vsyscall ()
(gdb) bt full
#0  0x00b65424 in __kernel_vsyscall ()
No symbol table info available.
#1  0x00538cf3 in __read_nocancel () from /lib/libc.so.6
No symbol table info available.
#2  0x0806c91f in read (__fd=<value optimized out>, 
    __buf=<value optimized out>, __nbytes=<value optimized out>)
    at /usr/include/bits/unistd.h:45
No locals.
#3  get_a_record (__fd=<value optimized out>, __buf=<value optimized out>, 
    __nbytes=<value optimized out>) at io.c:3007
        recm = {start = 0x8069d3b "\241", len = 142786112, 
          rt_start = 0x882c2c0 "Starting up pp_dns daemon: [  OK  ]\r\n\n0: echo: write error: Broken pipe\n\n", rt_len = 36}
        state = <value optimized out>
        ret = <value optimized out>
        retval = <value optimized out>
        rtval = <value optimized out>
        lastmatchrec = 0x806da00 <rs1scan>
        __PRETTY_FUNCTION__ = "get_a_record"
#4  0x0806cd8e in inrec (iop=<value optimized out>) at io.c:388
        begin = 0x809be00 "\tP\t\b"
        cnt = <value optimized out>
        retval = <value optimized out>
#5  0x08070c9c in do_input () at io.c:498
        iop = 0x809be00
        rval1 = <value optimized out>
        rval2 = <value optimized out>
        rval3 = -512
#6  0x08072ecf in main (argc=4, argv=0xbf9506c4) at main.c:630
        c = <value optimized out>
        scan = <value optimized out>
        stopped_early = 4
        old_optind = <value optimized out>
        i = <value optimized out>
        stdio_problem = 0
(gdb) 

Thanks again

Comment 32 Tomas Mraz 2009-09-15 07:16:14 UTC

Reassigning to logrotate. It should not get stuck waiting on read of something.

If we did not wait for the logrotate finish in anacron it would mean that the hung logrotate jobs would be increasing on the machine.

On the other hand we should discuss the possibility to remove run-parts from the call sequence when running the .daily/.weekly etc. jobs. If anacron tracked each job on its own one hung job in cron.daily would not prevent the remaining jobs in cron.daily from running appropriately.

Comment 33 Daniel Novotny 2009-09-15 13:17:30 UTC

Ian, is there a logrotate process stuck anywhere?

ps ax | grep logrotate

Comment 34 Ian 2009-09-15 13:29:14 UTC

Daniel, 

There is the awk process.

 7389 ?        SN     0:00 awk -v progname /etc/cron.daily/logrotate progname {?????   print progname ":\n"?????   progname="";????       }????       { print; }

Looks suspicious!

Thanks again for you help. Ian

Comment 35 Daniel Novotny 2009-09-15 14:13:55 UTC

/usr/bin/run-parts contains:

for i in $(LC_ALL=C; echo $1/*[^~,]) ; do
#...
 $i 2>&1 | awk -v "progname=$i" \
                              'progname {
                                   print progname ":\n"
                                   progname="";
                               }
                               { print; }'

the "awk" is there just to print the program's name before the program's output
when "$i" gets stuck, poor awk on the other side of the pipe just waits...

"$i" is logrotate, in this case, can you show me your /etc/logrotate.conf ?

Comment 36 Ian 2009-09-15 14:34:48 UTC

Here it is:


# see "man logrotate" for details
# rotate log files weekly
weekly

# keep 4 weeks worth of backlogs
rotate 4

# create new (empty) log files after rotating old ones
create

# use date as a suffix of the rotated file
dateext

# uncomment this if you want your log files compressed
#compress

# RPM packages drop log rotation information into this directory
include /etc/logrotate.d

# no packages own wtmp and btmp -- we'll rotate them here
/var/log/wtmp {
    monthly
    create 0664 root utmp
	minsize 1M
    rotate 1
}

/var/log/btmp {
    missingok
    monthly
    create 0600 root utmp
    rotate 1
}

# system-specific logs may be also be configured here.

Comment 37 Tomas Mraz 2009-09-15 14:43:07 UTC

The source of the problem is probably the following. The write end of the pipe is still opened by some previously child process of the logrotate. This child process is probably a daemon which was spawned by the logrotate when it is restarting the rotated services. logrotate then exits but the awk is not getting EOF on the read from the stdin because the pipe is still open by the daemon.

The solution is either to find out which daemon it is and daemonize it properly - so it opens /dev/null for its standard descriptors.
http://www-theorie.physik.unizh.ch/~dpotter/howto/daemonize

The other possibility is to change logrotate (or the rotate script of the daemon) to always redirect the output and stderr when restarting the rotated services to /dev/null.

Comment 38 Ian 2009-09-15 15:07:05 UTC

I shall change the logrotate script to:

#!/bin/sh

/usr/sbin/logrotate /etc/logrotate.conf > /dev/null 2>&1 
EXITVALUE=$?
if [ $EXITVALUE != 0 ]; then
    /usr/bin/logger -t logrotate "ALERT exited abnormally with [$EXITVALUE]"
fi
exit 0

Thanks again.
Ian

Comment 39 Daniel Novotny 2009-09-17 09:27:32 UTC

Ian, Tomas: OK, I will change the rotate script /etc/cron.daily/logrotate (which is owned by the logrotate package) to redirect logrotate output to /dev/null, this will fix the bug

Comment 40 Daniel Novotny 2009-09-17 09:46:14 UTC

fixed in rawhide: logrotate-3.7.8-4.fc12

Comment 41 Tomas Mraz 2009-09-17 09:54:46 UTC

Well, it will workaround the problem in one of the daemons which are being restarted by logrotate. However I agree that the fix/workaround should be applied.

Comment 42 Tomas Mraz 2009-09-17 09:56:03 UTC

oops sorry for accidental reopen.

Comment 43 Daniel Novotny 2009-09-17 10:00:06 UTC

Created attachment 361467 [details]
config file patch

btw the patch is here

Comment 44 Jan Kaluža 2011-12-09 09:41:28 UTC

Sorry for opening this bug again, but I would really like to try to find out which daemon exactly is causing that and eventually fix it.

I presume you don't have that machine still running, right? If you do, could you send the "ls /etc/logrotate.d/" output please?

Comment 45 Richard Marko 2013-01-03 19:24:46 UTC

I've hit this due to stall NFS handle blocking df -

init-+-anacron---run-parts-+-0logwatch---logwatch-+-perl---df

Comment 46 Tomas Mraz 2013-01-03 22:31:05 UTC

The question is what should be done in such situation - should the stalled job be killed (if possible at all)?

Comment 47 Jan Kaluža 2013-02-04 11:35:40 UTC

I think that's question for anacron maintainer, I will add proper NEEDINFO.

Comment 48 Marcela Mašláňová 2013-02-04 13:48:31 UTC

Tomas is working with me on anacron/cronie, he's comaintainer. He has valid point. The question is what you would suggest to do with stalled jobs? Cron or anacron execute the job, but if the job is "wrong", how should I decide what to do next?

Comment 49 James 2015-03-17 16:38:13 UTC

I recently ran into this bug on a CentOS 6.3 webserver.  The symptom was that logs were not rotating on the server, causing disk to fill up.  Luckily we caught it before disk filled up too much.

In checking the cron logs, I found that on Feb 25th, anacron stopped running correctly:

(09:18:20) ○ [jcuzella.com] ~
→ cat /etc/redhat-release
CentOS release 6.3 (Final)

(09:18:21) ○ [jcuzella.com] ~
→ sudo zgrep 'anacron' -C3 -in /var/log/cron*
/var/log/cron-20150317:Feb 24 20:01:01 web-00 run-parts(/etc/cron.hourly)[8551]: finished 0anacron
/var/log/cron-20150317:Feb 24 21:01:01 web-00 run-parts(/etc/cron.hourly)[10133]: starting 0anacron
/var/log/cron-20150317:Feb 24 21:01:01 web-00 run-parts(/etc/cron.hourly)[10142]: finished 0anacron
/var/log/cron-20150317:Feb 24 22:01:01 web-00 run-parts(/etc/cron.hourly)[11733]: starting 0anacron
/var/log/cron-20150317:Feb 24 22:01:01 web-00 run-parts(/etc/cron.hourly)[11742]: finished 0anacron
/var/log/cron-20150317:Feb 24 23:01:01 web-00 run-parts(/etc/cron.hourly)[13304]: starting 0anacron
/var/log/cron-20150317:Feb 24 23:01:01 web-00 run-parts(/etc/cron.hourly)[13313]: finished 0anacron
/var/log/cron-20150317:Feb 25 00:01:01 web-00 run-parts(/etc/cron.hourly)[14885]: starting 0anacron
/var/log/cron-20150317:Feb 25 00:01:01 web-00 anacron[14896]: Anacron started on 2015-02-25
/var/log/cron-20150317:Feb 25 00:01:01 web-00 anacron[14896]: Will run job `cron.daily' in 0 min.
/var/log/cron-20150317:Feb 25 00:01:01 web-00 anacron[14896]: Jobs will be executed sequentially
/var/log/cron-20150317:Feb 25 00:01:01 web-00 run-parts(/etc/cron.hourly)[14898]: finished 0anacron
/var/log/cron-20150317:Feb 25 00:01:01 web-00 anacron[14896]: Job `cron.daily' started
/var/log/cron-20150317:Feb 25 01:01:01 web-00 run-parts(/etc/cron.hourly)[16696]: starting 0anacron
/var/log/cron-20150317:Feb 25 01:01:01 web-00 anacron[16707]: Anacron started on 2015-02-25
/var/log/cron-20150317:Feb 25 01:01:01 web-00 anacron[16707]: Job `cron.daily' locked by another anacron - skipping
/var/log/cron-20150317:Feb 25 01:01:01 web-00 anacron[16707]: Normal exit (0 jobs run)
/var/log/cron-20150317:Feb 25 01:01:01 web-00 run-parts(/etc/cron.hourly)[16709]: finished 0anacron
/var/log/cron-20150317:Feb 25 02:01:02 web-00 run-parts(/etc/cron.hourly)[18305]: starting 0anacron
/var/log/cron-20150317:Feb 25 02:01:02 web-00 anacron[18316]: Anacron started on 2015-02-25
/var/log/cron-20150317:Feb 25 02:01:02 web-00 anacron[18316]: Job `cron.daily' locked by another anacron - skipping
/var/log/cron-20150317:Feb 25 02:01:02 web-00 anacron[18316]: Normal exit (0 jobs run)
[...SNIP...]
/var/log/cron-20150317:Mar 17 08:01:01 web-00 run-parts(/etc/cron.hourly)[22854]: starting 0anacron
/var/log/cron-20150317:Mar 17 08:01:01 web-00 anacron[22865]: Anacron started on 2015-03-17
/var/log/cron-20150317:Mar 17 08:01:01 web-00 anacron[22865]: Job `cron.daily' locked by another anacron - skipping
/var/log/cron-20150317:Mar 17 08:01:01 web-00 anacron[22865]: Normal exit (0 jobs run)
/var/log/cron-20150317:Mar 17 08:01:01 web-00 run-parts(/etc/cron.hourly)[22867]: finished 0anacron

(09:23:19) ○ [jcuzella.com] ~
→ ps auxww | grep cron | grep -v grep
root      1749  0.0  0.0 115348   768 ?        Ss    2013   4:09 crond
root     14896  0.0  0.0  17508   712 ?        Ss   Feb25   0:00 /usr/sbin/anacron -s
root     14899  0.0  0.0   9228  1188 ?        SN   Feb25   0:00 /bin/bash /usr/bin/run-parts /etc/cron.daily
root     15071  0.0  0.0   9228  1112 ?        SN   Feb25   0:00 /bin/sh /etc/cron.daily/mlocate.cron
root     15072  0.0  0.0   9084   776 ?        SN   Feb25   0:00 awk -v progname=/etc/cron.daily/mlocate.cron progname {?????   print progname ":\n"?????   progname="";????       }????       { print; }

(09:45:05) ○ [jcuzella.com] ~
→ pstree -p $(pidof anacron)
anacron(14896)───run-parts(14899)─┬─awk(15072)
                                  └─mlocate.cron(15071)───updatedb(15077)

(09:57:45) ○ [jcuzella.com] ~
→ sudo lsof -p 15072
[sudo] password for jcuzella:
lsof: WARNING: can't stat() nfs file system /kickstart
      Output information may be incomplete.
^C

(09:58:09) ○ [jcuzella.com] ~
→ sudo lsof -p 15071
lsof: WARNING: can't stat() nfs file system /kickstart
      Output information may be incomplete.

^C

(10:05:49) ○ [jcuzella.com] ~
→ sudo lsof -n
lsof: WARNING: can't stat() nfs file system /kickstart
      Output information may be incomplete.

^C

(10:18:17) ○ [jcuzella.com] ~
→ cat /proc/mounts  | grep kickstart
10.123.213.123:/kickstart /kickstart nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.123.213.123,mountvers=3,mountport=684,mountproto=udp,local_lock=none,addr=10.123.213.123 0 0
(10:18:41) ○ [jcuzella.com] ~
→ ls -l /kickstart
ls: cannot access /kickstart: Stale NFS file handle

(10:18:19) ○ [jcuzella.com] ~
→ cat /etc/cron.daily/mlocate.cron
#!/bin/sh
nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }')
renice +19 -p $$ >/dev/null 2>&1
ionice -c2 -n7 -p $$ >/dev/null 2>&1
/usr/bin/updatedb -f "$nodevs"


I believe that anacron was blocked by a stale NFS file handle.  The NFS file handle for /kickstart must have become stale which caused the script: /etc/cron.daily/mlocate.cron to block, which caused anacron to stay running, which blocked the other cron.daily jobs from running, which caused the apache logs to not rotate since Feb 25th, which caused disk to start filling up.  ( What a Rube Goldberg-esque chain of events! )

Any tips on how to stop mlocate.cron from blocking when an NFS mount becomes stale?

Comment 50 Tomas Mraz 2015-03-17 16:46:44 UTC

I think the proper solution for these issues lies in the kernel. There should be a timeout for the NFS stale handles so the applications that touch them do not block indefinitely but receive an error.

Comment 51 James 2015-03-18 01:33:58 UTC

Was able to fix the immediate problem by manual unmount/remount, then killing off blocked processes:

sudo umount /kickstart
sudo mount /kickstart



sudo kill $(pidof anacron)
sudo kill 15072
sudo kill 15071
sudo kill 14899

sudo /etc/init.d/crond restart


Then manual kickoff of anacron seemed to work & logs were rotated:

bash -x /etc/cron.hourly/0anacron
+ test -r /var/spool/anacron/cron.daily
++ date +%Y%m%d
+ '[' 20150317 = '' ']'
+ test -x /usr/bin/on_ac_power
+ /usr/bin/on_ac_power
+ test 0 -eq 1
+ /usr/sbin/anacron -s

Comment 52 Jan Kaluža 2015-07-22 14:45:14 UTC

I think there is nothing I can do from logrotate point of view right now. I'm going to close this bug again, if you think something in logrotate should be fixed, please reopen with some explanation.

Comment 53 David Zambonini 2015-10-27 14:13:53 UTC

See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=379645

Comment 54 RobbieTheK 2017-11-29 17:18:44 UTC

I'm seeing this on Fedora 26 4.12.8-300.fc26.x86_64. Seems to be with logwatch. Here are some of the requested process details and gdb.

lslocks -p 27094
COMMAND PID TYPE SIZE MODE M START END PATH
anacron 27094 POSIX 9B WRITE 0 0 0 /var/spool/anacron/cron.daily

pstree 27094
anacron---run-parts-+-0logwatch---logwatch-+-perl---df
| `-sendmail
`-awk

Reading symbols from anacron...Reading symbols from /usr/lib/debug/usr/sbin/anacron.debug...done.
done.
Attaching to program: /usr/sbin/anacron, process 27094
Reading symbols from /lib64/libselinux.so.1...Reading symbols from /usr/lib/debug/usr/lib64/libselinux.so.1.debug...done.
done.
Reading symbols from /lib64/libpam.so.0...Reading symbols from /usr/lib/debug/usr/lib64/libpam.so.0.84.2.debug...done.
done.
Reading symbols from /lib64/libdl.so.2...Reading symbols from /usr/lib/debug/lib64/libdl-2.25.so.debug...done.
done.
Reading symbols from /lib64/libaudit.so.1...Reading symbols from /usr/lib/debug/lib64/libaudit.so.1.0.0.debug...done.
done.
Reading symbols from /lib64/libc.so.6...Reading symbols from /usr/lib/debug/lib64/libc-2.25.so.debug...done.
done.
Reading symbols from /lib64/libpcre.so.1...Reading symbols from /usr/lib/debug/usr/lib64/libpcre.so.1.2.9.debug...done.
done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/lib64/ld-2.25.so.debug...done.
done.
Reading symbols from /lib64/libcap-ng.so.0...Reading symbols from /usr/lib/debug/lib64/libcap-ng.so.0.0.0.debug...done.
done.
Reading symbols from /lib64/libpthread.so.0...Reading symbols from /usr/lib/debug/lib64/libpthread-2.25.so.debug...done.
done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f8f2a0cad66 in __GI___sigsuspend (set=0x7fff9d54be90) at ../sysdeps/unix/sysv/linux/sigsuspend.c:30
30 return SYSCALL_CANCEL (rt_sigsuspend, set, _NSIG / 8);

Comment 55 Tomas Mraz 2017-11-30 08:36:00 UTC

Please report this against logwatch. Also you need to find out why the sendmail or df is stuck.

Comment 56 RobbieTheK 2017-11-30 15:39:25 UTC

(In reply to Tomas Mraz from comment #55)
> Please report this against logwatch. Also you need to find out why the
> sendmail or df is stuck.

I opened a new bug but how do I figure how why those are stuck? sendmail is running just fine.

Comment 57 Tomas Mraz 2017-11-30 15:41:50 UTC

You show above that logwatch spawned sendmail and df and is waiting for them to finish - you need to look up these processes and if they are stuck then this is the real problem.

Comment 58 RobbieTheK 2017-11-30 15:53:36 UTC

(In reply to Tomas Mraz from comment #57)
> You show above that logwatch spawned sendmail and df and is waiting for them
> to finish - you need to look up these processes and if they are stuck then
> this is the real problem.

Apologies for turning this into a how-to but does any of the below help? Any other commands to check?

pstree| grep df
        |-anacron---run-parts-+-0logwatch---logwatch-+-perl---df
        |-logwatch-+-perl---df
        |-perl---df

pstree| grep sendmail
        |                     |                      `-sendmail
        |-fail2ban-server-+-sh---sendmail
        |          `-sendmail
        |-sendmail---sendmail
        |-sendmail

ps -auwx | grep df
dbus      5410  0.0  0.0  45512  5308 ?        Ss   Nov29   0:36 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root      5910  0.0  0.0   4484   752 ?        DN   Nov17   0:20 df -h -x tmpfs -x udf -x iso9660
root      8379  0.0  0.0   4484   776 ?        D    Nov28   0:03 df -h -x tmpfs -x udf -x iso9660
df
root     31576  0.0  0.0   4484   744 ?        DN   Nov29   0:02 df -h -x tmpfs -x udf -x iso9660

ps -auwx | grep sendmail
smmsp     4867  0.0  0.0  83176  5632 ?        SN   Nov17   0:00 /usr/sbin/sendmail -t
root     14643  0.0  0.0 111600  4688 ?        Ss   Nov29   0:04 sendmail: accepting connections
smmsp    14653  0.0  0.0  83172  3388 ?        Ss   Nov29   0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
smmsp    30299  0.0  0.0  83388  8096 ?        S    10:50   0:00 sendmail: ./vAUFor3s030299 [127.0.0.1]: client DATA status
root     30318  0.0  0.0 111732  7244 ?        D    10:50   0:00 sendmail: vAUFosMc030318 localhost [127.0.0.1]: DATA
sendmail
smmsp    30657  0.0  0.0  83176  8216 ?        SN   Nov29   0:00 /usr/sbin/sendmail -t

cat /proc/27094/task/27094/status
Name:	anacron
Umask:	0022
State:	S (sleeping)
Tgid:	27094
Ngid:	0
Pid:	27094
PPid:	1
TracerPid:	0
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	64
Groups:	0 
NStgid:	27094
NSpid:	27094
NSpgid:	27094
NSsid:	27094
VmPeak:	  129680 kB
VmSize:	  129680 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	    2176 kB
VmRSS:	    2176 kB
RssAnon:	     232 kB
RssFile:	    1944 kB
RssShmem:	       0 kB
VmData:	     284 kB
VmStk:	     132 kB
VmExe:	      32 kB
VmLib:	    2900 kB
VmPTE:	      68 kB
VmPMD:	      12 kB
VmSwap:	       0 kB
HugetlbPages:	       0 kB
Threads:	1
SigQ:	3/96296
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	0000000180012200
CapInh:	0000000000000000
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000
NoNewPrivs:	0
Seccomp:	0
Cpus_allowed:	ffffff
Cpus_allowed_list:	0-23
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:	0-1
voluntary_ctxt_switches:	36
nonvoluntary_ctxt_switches:	0

Comment 59 RobbieTheK 2017-11-30 16:58:54 UTC

(In reply to Tomas Mraz from comment #57)
> You show above that logwatch spawned sendmail and df and is waiting for them
> to finish - you need to look up these processes and if they are stuck then
> this is the real problem.

Indeed the df command from logwatch "df -h -x tmpfs -x udf -x iso9660", just hangs when run on its own. Tried on another server it responds right away. I can't reboot at the moment but any reason why a df command would hang for a long time? Any service(s) I can restart? The load is a bit high, around 4, but no process(es) stick out.

Comment 60 Tomas Mraz 2017-11-30 17:05:12 UTC

It's most probably blocked on some device. Stracing the process can show which operation on which descriptor is blocked. Anyway this bugzilla is not a support platform and this is not a problem of anacron, or even logwatch.