Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1226958

Summary: race condition causes many syslog messages: File a[[:xdigit:]]{13} is in wrong format - aborting
Product: Red Hat Enterprise Linux 6 Reporter: Bryan Totty <btotty>
Component: atAssignee: Tomas Mraz <tmraz>
Status: CLOSED ERRATA QA Contact: Radka Brychtova <rskvaril>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.6CC: btotty, cww, mdshaikh, psklenar, pvrabec, rsawhill, rskvaril
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: at-3.1.10-49.el6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1319315 (view as bug list) Environment:
Last Closed: 2017-03-21 11:01:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1002711, 1269194, 1319315, 1343211    

Description Bryan Totty 2015-06-01 14:15:26 UTC
Description of problem:

Please refer to the following URL: 

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=638256

--------------------

"In my experience, when a file system fills up, it often seems that there's room
in the inode table when there's no room for file bodies.  Unintentionally empty
files become common.  If /var/spool/cron/atjobs ends up containing a zero-length
job file for a job due over an hour ago, then atd sometimes goes nuts, logging
an unbounded number of messages like:

2011-08-09T16:11:12-07:00 merc55rm atd[4194]: File a00afd014dc579 is in wrong format - aborting

Even when the original reason for the full file system is removed, atd can go
on to fill it again with gigabytes of syslog entries.

Google finds many similar complaints over the years:

http://goo.gl/rvdP6
http://ubuntuforums.org/archive/index.php/t-1575261.html
https://bugzilla.redhat.com/show_bug.cgi?id=718422
http://forums.fedoraforum.org/showthread.php?t=252412

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=101919 may have been the cause
of an apparently corrupted at job.  What I'm focusing on here is the manifold
repetition of the error message, assuming a corrupted at job.

On one particular machine, I was able to reproduce this, most of the time, with
this test:

sudo su -
cd /var/spool/cron/atjobs
/etc/init.d/atd stop
touch a0000000000000
chmod +x a0000000000000
grep -w atd /var/log/syslog | wc -l
/etc/init.d/atd start
sleep 1
grep -w atd /var/log/syslog | wc -l
sleep 1
grep -w atd /var/log/syslog | wc -l

The number of lines of atd message rises for a while, then stabilizes.

I wasn't able to reliably reproduce the issue on a couple of other machines,
including the one from which I'm reporting the bug, unless I started atd from
within valgrind:

valgrind --trace-children=yes -q /etc/init.d/atd start

An strace extract, available on request, from the system where I could easily
reproduce the problem, showed that atd stabilized when the child that logs
the "aborting" message exits before the main atd process goes to sleep.
If the parent wins the race, then the sleep is interrupted by SIGCHLD.  The
parent then calls run_loop again and the cycle repeats.

My suggested fix sleeps again if we're woken prematurely.

-- System Information:
Debian Release: 5.0.8
  APT prefers oldstable
  APT policy: (500, 'oldstable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.26-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages at depends on:
ii  exim4                     4.69-9+lenny4  metapackage to ease Exim MTA (v4) 
ii  exim4-daemon-heavy [mail- 4.69-9+lenny4  Exim MTA (v4) daemon with extended
ii  libc6                     2.7-18lenny7   GNU C Library: Shared libraries
ii  libpam0g                  1.0.1-5+lenny1 Pluggable Authentication Modules l
ii  lsb-base                  3.2-20         Linux Standard Base 3.2 init scrip

at recommends no packages.

at suggests no packages.

-- no debconf information"

--------------------


atd.c.patch:

--- /tmp/at-3.1.10.2/atd.c.orig	2011-08-16 19:14:41.000000000 -0700
+++ /tmp/at-3.1.10.2/atd.c	2011-08-16 19:15:54.000000000 -0700
@@ -792,11 +792,12 @@
 
     daemon_setup();
 
+    now = time(NULL);
     do {
-	now = time(NULL);
 	next_invocation = run_loop();
-	if (next_invocation > now) {
+	while (!term_signal && next_invocation > now) {
 	   sleep(next_invocation - now);
+	   now = time(NULL);
 	}
     } while (!term_signal);
     daemon_cleanup();

Comment 5 Tomas Mraz 2015-06-01 14:54:24 UTC
Note that at command should not create the empty job files anymore with at-3.1.10-48.el6 so this makes the problem of very low severity.

Also the corruption due to long user name was fixed in upstream 3.1.9 so this should not be the case either.

I am not saying that we should not patch this race condition anyway, just that it would be very interesting to know why the customer hits it.

Comment 10 Tomas Mraz 2015-09-09 08:51:31 UTC
Note that the patch proposed is not correct because it breaks scheduling of new jobs by atd to short times after 'now'.

Comment 11 Tomas Mraz 2015-09-09 12:06:56 UTC
The proper solution is to delete the bogus file in the atd.

Comment 19 Radka Brychtova 2016-10-19 13:35:33 UTC
OLD version:
at-3.1.10-48.el6.i686
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: Test
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [  BEGIN   ] :: Running 'cd /var/spool/at'
:: [   PASS   ] :: Command 'cd /var/spool/at' (Expected 0, got 0)
atd (pid  5267) is running...
Stopping atd: [  OK  ]
:: [  BEGIN   ] :: Create empty file  :: actually running 'touch a0000000000000'
:: [   PASS   ] :: Create empty file  (Expected 0, got 0)
:: [  BEGIN   ] :: Running 'chmod +x a0000000000000'
:: [   PASS   ] :: Command 'chmod +x a0000000000000' (Expected 0, got 0)
:: [   PASS   ] :: There should be no atd message in logfiles (Assert: expected 0, got 0)
atd is stopped
Starting atd: [  OK  ]
:: [ 09:30:18 ] :: rlServiceStart: Service atd started successfully
:: [   FAIL   ] :: In log should be only one warning (Assert: '1' should equal '69')
:: [   FAIL   ] :: File a0000000000000 should not exist 
'ba094dcf-1022-4715-a4cc-1ead20892747'
Test result: FAIL


NEW version:
 at-3.1.10-49.el6.i686
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: Test
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [  BEGIN   ] :: Running 'cd /var/spool/at'
:: [   PASS   ] :: Command 'cd /var/spool/at' (Expected 0, got 0)
atd (pid  15964) is running...
Stopping atd: [  OK  ]
:: [  BEGIN   ] :: Create empty file  :: actually running 'touch a0000000000000'
:: [   PASS   ] :: Create empty file  (Expected 0, got 0)
:: [  BEGIN   ] :: Running 'chmod +x a0000000000000'
:: [   PASS   ] :: Command 'chmod +x a0000000000000' (Expected 0, got 0)
:: [   PASS   ] :: There should be no atd message in logfiles (Assert: expected 0, got 0)
atd is stopped
Starting atd: [  OK  ]
:: [ 09:55:56 ] :: rlServiceStart: Service atd started successfully
:: [   PASS   ] :: In log should be only one warning (Assert: '1' should equal '1')
:: [   PASS   ] :: File a0000000000000 should not exist 
'73675477-aec4-4d7d-a2c8-3bddfb5074d5'
Test result: PASS

Since the test passed on all architectures => VERIFIED

Comment 21 errata-xmlrpc 2017-03-21 11:01:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0703.html