1587995 – auditd sometimes in failed state after boot

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1587995 - auditd sometimes in failed state after boot

Summary: auditd sometimes in failed state after boot

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	audit
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Steve Grubb
QA Contact:	BaseOS QE Security Team
Docs Contact:	Ioanna Gkioka
URL:
Whiteboard:
Depends On:
Blocks:	1607298
TreeView+	depends on / blocked

Reported:	2018-06-06 12:44 UTC by redhat-airlock
Modified:	2021-09-09 14:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:	audit-2.8.4-3.el7
Doc Type:	Bug Fix
Doc Text:	Previously, systemd in certain cases terminated the audit daemon unexpectedly during startup. As a consequence, the audit.service was in a failed state after the boot process. With this update, an unnecessary function call has been removed, and the described problem no longer occurs.
Clone Of:
Clones:	1607298 (view as bug list)
Environment:
Last Closed:	2018-10-30 11:28:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Proposed patch (864 bytes, patch) 2018-06-27 08:07 UTC, redhat-airlock	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3568691	0	None	None	None	2018-08-17 09:12:19 UTC
Red Hat Product Errata	RHBA-2018:3237	0	None	None	None	2018-10-30 11:29:01 UTC

Description redhat-airlock 2018-06-06 12:44:47 UTC

Description of problem:
Since the update to 7.5 we experience that auditd is sometimes in failed state after boot. From the description it looks very similar to #1470810: Systemd tries to start auditd, tries to terminate it after 1min without success, then kills it after further 2 min:

root@myhost:~[26] # systemctl status auditd
-------------------------
● auditd.service - Security Auditing Service
   Loaded: loaded (/usr/lib/systemd/system/auditd.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Wed 2018-06-06 09:12:03 CEST; 50min ago
     Docs: man:auditd(8)
           https://github.com/linux-audit/audit-documentation
  Process: 491 ExecStart=/sbin/auditd (code=killed, signal=KILL)

Jun 06 09:09:03 myhost systemd[1]: Starting Security Auditing Service...
Jun 06 09:09:03 myhost auditd[501]: Started dispatcher: /sbin/audispd pid: 503
Jun 06 09:10:33 myhost systemd[1]: auditd.service start operation timed out. Terminating.
Jun 06 09:12:03 myhost systemd[1]: auditd.service stop-final-sigterm timed out. Killing.
Jun 06 09:12:03 myhost systemd[1]: auditd.service: control process exited, code=killed status=9
Jun 06 09:12:03 myhost systemd[1]: Failed to start Security Auditing Service.
Jun 06 09:12:03 myhost systemd[1]: Unit auditd.service entered failed state.
Jun 06 09:12:03 myhost systemd[1]: auditd.service failed.
root@myhost:~[27] #
-------------------------

In #1470810 a "netlink communication" bug was mentioned, which was fixed in Fedora with kernel 4.12, which might be the root of this problem. Unfortunately there is no reference which bug that is, and so I could not find out if this bug does exist in 7.5 (and not in 7.4?).


Version-Release number of selected component (if applicable): 
audit-2.8.1-3.el7.x86_64
kernel-3.10.0-862.2.3.el7.x86_64


How reproducible:
Unfortunately it does not happen on all Systems. What sets the Systems apart we are able to reproduce it is, that it is a Virtualbox VM with an single partition installation (no LVM).

Steps to Reproduce:
1. reboot system 
2. log in check for failed services with "systemcl --failed"
3. repeat if no failed auditd is reported (2-10 iterations needed)



Actual results:
# systemctl --failed
  UNIT           LOAD   ACTIVE SUB    DESCRIPTION
● auditd.service loaded failed failed Security Auditing Service

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.


Expected results:
# systemctl --failed
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.


Additional info:

Comment 2 Steve Grubb 2018-06-06 16:55:50 UTC

Hello.By any chance did you try to run with the debug version of auditd mentioned in bug 1470810? It might be the same thing or maybe something else.

Comment 3 redhat-airlock 2018-06-07 06:45:06 UTC

Hi, 
The debug RPM on [1] seems to be no longer available. Even if it was, it is a different version (audit-2.7.7-2.fc26 vs audit-2.8.1-3.el7.x86_64). 

If it's possible for you to provide a debug version of the el7 RPM I can test it with that

[1] https://koji.fedoraproject.org/koji/buildinfo?buildID=918802

Comment 4 redhat-airlock 2018-06-27 08:06:21 UTC

The debug patch is in this repo: git://pkgs.fedoraproject.org/rpms/audit , commit f37518040e617f75b7f8bc3e225b6c90fb9e4602 .

It was difficult to find the bug, but we have found it :-)

This is the stack of a hanging auditd process (pstack):

Thread 2 (Thread 0x7f7f0c4c0700 (LWP 4999)):
#0  0x00007f7f0e3c7995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000055d600e8efc3 in flush_thread_main ()
#2  0x00007f7f0e3c3e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f7f0d6aebad in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f7f0f25a880 (LWP 4998)):
#0  0x00007f7f0d6bc89c in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f7f0d6669bc in _L_lock_2546 () from /lib64/libc.so.6
#2  0x00007f7f0d6667f7 in __tz_convert () from /lib64/libc.so.6
#3  0x00007f7f0d6a873a in __vsyslog_chk () from /lib64/libc.so.6
#4  0x00007f7f0e7e4025 in audit_msg () from /lib64/libaudit.so.1
#5  0x000055d600e94e6d in dispatcher_reaped ()
#6  0x000055d600e8e0a5 in child_handler ()
#7  <signal handler called>
#8  0x00007f7f0d6a906a in mmap64 () from /lib64/libc.so.6
#9  0x00007f7f0d61de21 in __GI__IO_file_doallocate () from /lib64/libc.so.6
#10 0x00007f7f0d62cc27 in __GI__IO_doallocbuf () from /lib64/libc.so.6
#11 0x00007f7f0d62a61c in __GI__IO_file_xsgetn () from /lib64/libc.so.6
#12 0x00007f7f0d62945e in fread_unlocked () from /lib64/libc.so.6
#13 0x00007f7f0d666c8d in __tzfile_read () from /lib64/libc.so.6
#14 0x00007f7f0d665e64 in tzset_internal () from /lib64/libc.so.6
#15 0x00007f7f0d666813 in __tz_convert () from /lib64/libc.so.6
#16 0x00007f7f0d6a873a in __vsyslog_chk () from /lib64/libc.so.6
#17 0x00007f7f0e7e4025 in audit_msg () from /lib64/libaudit.so.1
#18 0x000055d600e94d45 in init_dispatcher ()
#19 0x000055d600e8d05e in main ()

This means that syslog() was called in a signal handler. syslog() is not async-signal-safe, see the "signal-safety" man page. The process hangs when it tries to call syslog() from the signal handler.

We found some more bugs with a code review: FD_CLOEXEC is not set on init_pipe[1] (a simple typo) and EINTR handling is missing for a read().

With all 3 bugs fixed, auditd starts reliably without hanging.

It would be great if you could integrate the bugfixes into an update.

Comment 5 redhat-airlock 2018-06-27 08:07:36 UTC

Created attachment 1454967 [details]
Proposed patch

Comment 6 Steve Grubb 2018-06-27 12:38:34 UTC

Thanks for the patch and troubleshooting! I am truly surprised by this. The libev documentation says that signals are serialized with other events and not delivered asynchronously (from a signal handler). In any event, the message being logged is not important and is more like a debug level message.

I should be able to fit this into the next release. Thanks!

Comment 7 Steve Grubb 2018-06-27 15:02:37 UTC

audit-2.8.4-3.el7 was built to address this issue.

Comment 9 redhat-airlock 2018-06-27 15:13:42 UTC

> I should be able to fit this into the next release. Thanks!

Great!


> The libev documentation says that signals are serialized with other events and not delivered asynchronously (from a signal handler).

It happens during startup, probably before libev is in control. Probably it's this signal handler:

/* This signal handler gets replaced later. Its here in case
 * the dispatcher exits before libev is in control */
sa.sa_handler = child_handler2;
sigaction(SIGCHLD, &sa, NULL);


child_handler2() calls child_handler(), and child_handler() calls dispatcher_reaped() which calls syslog().

Comment 16 errata-xmlrpc 2018-10-30 11:28:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3237

Note You need to log in before you can comment on or make changes to this bug.