Bug 1587995

Summary: auditd sometimes in failed state after boot
Product: Red Hat Enterprise Linux 7 Reporter: redhat-airlock
Component: auditAssignee: Steve Grubb <sgrubb>
Status: CLOSED ERRATA QA Contact: BaseOS QE Security Team <qe-baseos-security>
Severity: high Docs Contact: Ioanna Gkioka <igkioka>
Priority: urgent    
Version: 7.5CC: dpal, fkrska, igkioka, mmatsuya, omoris, rahranja, sgrubb
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: audit-2.8.4-3.el7 Doc Type: Bug Fix
Doc Text:
Previously, systemd in certain cases terminated the audit daemon unexpectedly during startup. As a consequence, the audit.service was in a failed state after the boot process. With this update, an unnecessary function call has been removed, and the described problem no longer occurs.
Story Points: ---
Clone Of:
: 1607298 (view as bug list) Environment:
Last Closed: 2018-10-30 11:28:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1607298    
Attachments:
Description Flags
Proposed patch none

Description redhat-airlock 2018-06-06 12:44:47 UTC
Description of problem:
Since the update to 7.5 we experience that auditd is sometimes in failed state after boot. From the description it looks very similar to #1470810: Systemd tries to start auditd, tries to terminate it after 1min without success, then kills it after further 2 min:

root@myhost:~[26] # systemctl status auditd
-------------------------
● auditd.service - Security Auditing Service
   Loaded: loaded (/usr/lib/systemd/system/auditd.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Wed 2018-06-06 09:12:03 CEST; 50min ago
     Docs: man:auditd(8)
           https://github.com/linux-audit/audit-documentation
  Process: 491 ExecStart=/sbin/auditd (code=killed, signal=KILL)

Jun 06 09:09:03 myhost systemd[1]: Starting Security Auditing Service...
Jun 06 09:09:03 myhost auditd[501]: Started dispatcher: /sbin/audispd pid: 503
Jun 06 09:10:33 myhost systemd[1]: auditd.service start operation timed out. Terminating.
Jun 06 09:12:03 myhost systemd[1]: auditd.service stop-final-sigterm timed out. Killing.
Jun 06 09:12:03 myhost systemd[1]: auditd.service: control process exited, code=killed status=9
Jun 06 09:12:03 myhost systemd[1]: Failed to start Security Auditing Service.
Jun 06 09:12:03 myhost systemd[1]: Unit auditd.service entered failed state.
Jun 06 09:12:03 myhost systemd[1]: auditd.service failed.
root@myhost:~[27] #
-------------------------

In #1470810 a "netlink communication" bug was mentioned, which was fixed in Fedora with kernel 4.12, which might be the root of this problem. Unfortunately there is no reference which bug that is, and so I could not find out if this bug does exist in 7.5 (and not in 7.4?).


Version-Release number of selected component (if applicable): 
audit-2.8.1-3.el7.x86_64
kernel-3.10.0-862.2.3.el7.x86_64


How reproducible:
Unfortunately it does not happen on all Systems. What sets the Systems apart we are able to reproduce it is, that it is a Virtualbox VM with an single partition installation (no LVM).

Steps to Reproduce:
1. reboot system 
2. log in check for failed services with "systemcl --failed"
3. repeat if no failed auditd is reported (2-10 iterations needed)



Actual results:
# systemctl --failed
  UNIT           LOAD   ACTIVE SUB    DESCRIPTION
● auditd.service loaded failed failed Security Auditing Service

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.


Expected results:
# systemctl --failed
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.


Additional info:

Comment 2 Steve Grubb 2018-06-06 16:55:50 UTC
Hello.By any chance did you try to run with the debug version of auditd mentioned in bug 1470810? It might be the same thing or maybe something else.

Comment 3 redhat-airlock 2018-06-07 06:45:06 UTC
Hi, 
The debug RPM on [1] seems to be no longer available. Even if it was, it is a different version (audit-2.7.7-2.fc26 vs audit-2.8.1-3.el7.x86_64). 

If it's possible for you to provide a debug version of the el7 RPM I can test it with that

[1] https://koji.fedoraproject.org/koji/buildinfo?buildID=918802

Comment 4 redhat-airlock 2018-06-27 08:06:21 UTC
The debug patch is in this repo: git://pkgs.fedoraproject.org/rpms/audit , commit f37518040e617f75b7f8bc3e225b6c90fb9e4602 .

It was difficult to find the bug, but we have found it :-)

This is the stack of a hanging auditd process (pstack):

Thread 2 (Thread 0x7f7f0c4c0700 (LWP 4999)):
#0  0x00007f7f0e3c7995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000055d600e8efc3 in flush_thread_main ()
#2  0x00007f7f0e3c3e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7f0d6aebad in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f7f0f25a880 (LWP 4998)):
#0  0x00007f7f0d6bc89c in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f7f0d6669bc in _L_lock_2546 () from /lib64/libc.so.6
#2  0x00007f7f0d6667f7 in __tz_convert () from /lib64/libc.so.6
#3  0x00007f7f0d6a873a in __vsyslog_chk () from /lib64/libc.so.6
#4  0x00007f7f0e7e4025 in audit_msg () from /lib64/libaudit.so.1
#5  0x000055d600e94e6d in dispatcher_reaped ()
#6  0x000055d600e8e0a5 in child_handler ()
#7  <signal handler called>
#8  0x00007f7f0d6a906a in mmap64 () from /lib64/libc.so.6
#9  0x00007f7f0d61de21 in __GI__IO_file_doallocate () from /lib64/libc.so.6
#10 0x00007f7f0d62cc27 in __GI__IO_doallocbuf () from /lib64/libc.so.6
#11 0x00007f7f0d62a61c in __GI__IO_file_xsgetn () from /lib64/libc.so.6
#12 0x00007f7f0d62945e in fread_unlocked () from /lib64/libc.so.6
#13 0x00007f7f0d666c8d in __tzfile_read () from /lib64/libc.so.6
#14 0x00007f7f0d665e64 in tzset_internal () from /lib64/libc.so.6
#15 0x00007f7f0d666813 in __tz_convert () from /lib64/libc.so.6
#16 0x00007f7f0d6a873a in __vsyslog_chk () from /lib64/libc.so.6
#17 0x00007f7f0e7e4025 in audit_msg () from /lib64/libaudit.so.1
#18 0x000055d600e94d45 in init_dispatcher ()
#19 0x000055d600e8d05e in main ()

This means that syslog() was called in a signal handler. syslog() is not async-signal-safe, see the "signal-safety" man page. The process hangs when it tries to call syslog() from the signal handler.

We found some more bugs with a code review: FD_CLOEXEC is not set on init_pipe[1] (a simple typo) and EINTR handling is missing for a read().

With all 3 bugs fixed, auditd starts reliably without hanging.

It would be great if you could integrate the bugfixes into an update.

Comment 5 redhat-airlock 2018-06-27 08:07:36 UTC
Created attachment 1454967 [details]
Proposed patch

Comment 6 Steve Grubb 2018-06-27 12:38:34 UTC
Thanks for the patch and troubleshooting! I am truly surprised by this. The libev documentation says that signals are serialized with other events and not delivered asynchronously (from a signal handler). In any event, the message being logged is not important and is more like a debug level message.

I should be able to fit this into the next release. Thanks!

Comment 7 Steve Grubb 2018-06-27 15:02:37 UTC
audit-2.8.4-3.el7 was built to address this issue.

Comment 9 redhat-airlock 2018-06-27 15:13:42 UTC
> I should be able to fit this into the next release. Thanks!

Great!


> The libev documentation says that signals are serialized with other events and not delivered asynchronously (from a signal handler).

It happens during startup, probably before libev is in control. Probably it's this signal handler:

/* This signal handler gets replaced later. Its here in case
 * the dispatcher exits before libev is in control */
sa.sa_handler = child_handler2;
sigaction(SIGCHLD, &sa, NULL);


child_handler2() calls child_handler(), and child_handler() calls dispatcher_reaped() which calls syslog().

Comment 16 errata-xmlrpc 2018-10-30 11:28:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3237