Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Previously, systemd in certain cases terminated the audit daemon unexpectedly during startup. As a consequence, the audit.service was in a failed state after the boot process. With this update, an unnecessary function call has been removed, and the described problem no longer occurs.
Description of problem:
Since the update to 7.5 we experience that auditd is sometimes in failed state after boot. From the description it looks very similar to #1470810: Systemd tries to start auditd, tries to terminate it after 1min without success, then kills it after further 2 min:
root@myhost:~[26] # systemctl status auditd
-------------------------
● auditd.service - Security Auditing Service
Loaded: loaded (/usr/lib/systemd/system/auditd.service; enabled; vendor preset: enabled)
Active: failed (Result: signal) since Wed 2018-06-06 09:12:03 CEST; 50min ago
Docs: man:auditd(8)
https://github.com/linux-audit/audit-documentation
Process: 491 ExecStart=/sbin/auditd (code=killed, signal=KILL)
Jun 06 09:09:03 myhost systemd[1]: Starting Security Auditing Service...
Jun 06 09:09:03 myhost auditd[501]: Started dispatcher: /sbin/audispd pid: 503
Jun 06 09:10:33 myhost systemd[1]: auditd.service start operation timed out. Terminating.
Jun 06 09:12:03 myhost systemd[1]: auditd.service stop-final-sigterm timed out. Killing.
Jun 06 09:12:03 myhost systemd[1]: auditd.service: control process exited, code=killed status=9
Jun 06 09:12:03 myhost systemd[1]: Failed to start Security Auditing Service.
Jun 06 09:12:03 myhost systemd[1]: Unit auditd.service entered failed state.
Jun 06 09:12:03 myhost systemd[1]: auditd.service failed.
root@myhost:~[27] #
-------------------------
In #1470810 a "netlink communication" bug was mentioned, which was fixed in Fedora with kernel 4.12, which might be the root of this problem. Unfortunately there is no reference which bug that is, and so I could not find out if this bug does exist in 7.5 (and not in 7.4?).
Version-Release number of selected component (if applicable):
audit-2.8.1-3.el7.x86_64
kernel-3.10.0-862.2.3.el7.x86_64
How reproducible:
Unfortunately it does not happen on all Systems. What sets the Systems apart we are able to reproduce it is, that it is a Virtualbox VM with an single partition installation (no LVM).
Steps to Reproduce:
1. reboot system
2. log in check for failed services with "systemcl --failed"
3. repeat if no failed auditd is reported (2-10 iterations needed)
Actual results:
# systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● auditd.service loaded failed failed Security Auditing Service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
Expected results:
# systemctl --failed
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
Additional info:
Hi,
The debug RPM on [1] seems to be no longer available. Even if it was, it is a different version (audit-2.7.7-2.fc26 vs audit-2.8.1-3.el7.x86_64).
If it's possible for you to provide a debug version of the el7 RPM I can test it with that
[1] https://koji.fedoraproject.org/koji/buildinfo?buildID=918802
The debug patch is in this repo: git://pkgs.fedoraproject.org/rpms/audit , commit f37518040e617f75b7f8bc3e225b6c90fb9e4602 .
It was difficult to find the bug, but we have found it :-)
This is the stack of a hanging auditd process (pstack):
Thread 2 (Thread 0x7f7f0c4c0700 (LWP 4999)):
#0 0x00007f7f0e3c7995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000055d600e8efc3 in flush_thread_main ()
#2 0x00007f7f0e3c3e25 in start_thread () from /lib64/libpthread.so.0
#3 0x00007f7f0d6aebad in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f7f0f25a880 (LWP 4998)):
#0 0x00007f7f0d6bc89c in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007f7f0d6669bc in _L_lock_2546 () from /lib64/libc.so.6
#2 0x00007f7f0d6667f7 in __tz_convert () from /lib64/libc.so.6
#3 0x00007f7f0d6a873a in __vsyslog_chk () from /lib64/libc.so.6
#4 0x00007f7f0e7e4025 in audit_msg () from /lib64/libaudit.so.1
#5 0x000055d600e94e6d in dispatcher_reaped ()
#6 0x000055d600e8e0a5 in child_handler ()
#7 <signal handler called>
#8 0x00007f7f0d6a906a in mmap64 () from /lib64/libc.so.6
#9 0x00007f7f0d61de21 in __GI__IO_file_doallocate () from /lib64/libc.so.6
#10 0x00007f7f0d62cc27 in __GI__IO_doallocbuf () from /lib64/libc.so.6
#11 0x00007f7f0d62a61c in __GI__IO_file_xsgetn () from /lib64/libc.so.6
#12 0x00007f7f0d62945e in fread_unlocked () from /lib64/libc.so.6
#13 0x00007f7f0d666c8d in __tzfile_read () from /lib64/libc.so.6
#14 0x00007f7f0d665e64 in tzset_internal () from /lib64/libc.so.6
#15 0x00007f7f0d666813 in __tz_convert () from /lib64/libc.so.6
#16 0x00007f7f0d6a873a in __vsyslog_chk () from /lib64/libc.so.6
#17 0x00007f7f0e7e4025 in audit_msg () from /lib64/libaudit.so.1
#18 0x000055d600e94d45 in init_dispatcher ()
#19 0x000055d600e8d05e in main ()
This means that syslog() was called in a signal handler. syslog() is not async-signal-safe, see the "signal-safety" man page. The process hangs when it tries to call syslog() from the signal handler.
We found some more bugs with a code review: FD_CLOEXEC is not set on init_pipe[1] (a simple typo) and EINTR handling is missing for a read().
With all 3 bugs fixed, auditd starts reliably without hanging.
It would be great if you could integrate the bugfixes into an update.
Thanks for the patch and troubleshooting! I am truly surprised by this. The libev documentation says that signals are serialized with other events and not delivered asynchronously (from a signal handler). In any event, the message being logged is not important and is more like a debug level message.
I should be able to fit this into the next release. Thanks!
> I should be able to fit this into the next release. Thanks!
Great!
> The libev documentation says that signals are serialized with other events and not delivered asynchronously (from a signal handler).
It happens during startup, probably before libev is in control. Probably it's this signal handler:
/* This signal handler gets replaced later. Its here in case
* the dispatcher exits before libev is in control */
sa.sa_handler = child_handler2;
sigaction(SIGCHLD, &sa, NULL);
child_handler2() calls child_handler(), and child_handler() calls dispatcher_reaped() which calls syslog().
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2018:3237
Description of problem: Since the update to 7.5 we experience that auditd is sometimes in failed state after boot. From the description it looks very similar to #1470810: Systemd tries to start auditd, tries to terminate it after 1min without success, then kills it after further 2 min: root@myhost:~[26] # systemctl status auditd ------------------------- ● auditd.service - Security Auditing Service Loaded: loaded (/usr/lib/systemd/system/auditd.service; enabled; vendor preset: enabled) Active: failed (Result: signal) since Wed 2018-06-06 09:12:03 CEST; 50min ago Docs: man:auditd(8) https://github.com/linux-audit/audit-documentation Process: 491 ExecStart=/sbin/auditd (code=killed, signal=KILL) Jun 06 09:09:03 myhost systemd[1]: Starting Security Auditing Service... Jun 06 09:09:03 myhost auditd[501]: Started dispatcher: /sbin/audispd pid: 503 Jun 06 09:10:33 myhost systemd[1]: auditd.service start operation timed out. Terminating. Jun 06 09:12:03 myhost systemd[1]: auditd.service stop-final-sigterm timed out. Killing. Jun 06 09:12:03 myhost systemd[1]: auditd.service: control process exited, code=killed status=9 Jun 06 09:12:03 myhost systemd[1]: Failed to start Security Auditing Service. Jun 06 09:12:03 myhost systemd[1]: Unit auditd.service entered failed state. Jun 06 09:12:03 myhost systemd[1]: auditd.service failed. root@myhost:~[27] # ------------------------- In #1470810 a "netlink communication" bug was mentioned, which was fixed in Fedora with kernel 4.12, which might be the root of this problem. Unfortunately there is no reference which bug that is, and so I could not find out if this bug does exist in 7.5 (and not in 7.4?). Version-Release number of selected component (if applicable): audit-2.8.1-3.el7.x86_64 kernel-3.10.0-862.2.3.el7.x86_64 How reproducible: Unfortunately it does not happen on all Systems. What sets the Systems apart we are able to reproduce it is, that it is a Virtualbox VM with an single partition installation (no LVM). Steps to Reproduce: 1. reboot system 2. log in check for failed services with "systemcl --failed" 3. repeat if no failed auditd is reported (2-10 iterations needed) Actual results: # systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● auditd.service loaded failed failed Security Auditing Service LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 1 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'. Expected results: # systemctl --failed 0 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'. Additional info: