Red Hat Bugzilla – Bug 429941
[RHEL5 U2] Audit fails to shutdown properly
Last modified: 2009-01-20 15:22:19 EST
Description of problem:
While running the audit-test-2088 test suite the system will hang.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install RHEL5-U1. Update with kernel 2.6.18-71.el5
2. Install the audit-test-2088 RHTS rpm
3. While running the test restart audit daemon
INIT: Sending processes the TERM signal
Shutting down Avahi daemon: [ OK ]
Stopping HAL daemon: [ OK ]
Stopping yum-updatesd: [ OK ]
Stopping atd: [ OK ]
Stopping cups: [ OK ]
Shutting down xfs: [ OK ]
Shutting down console mouse services: [ OK ]
Stopping sshd: [ OK ]
Shutting down sm-client: [ OK ]
Shutting down sendmail: [ OK ]
Shutting down vsftpd: [ OK ]
Stopping acpi daemon: [ OK ]
Stopping crond: [ OK ]
Stopping autofs: [ OK ]
Stopping automount: [ OK ]
Shutting down ntpd: [ OK ]
Killing mdmonitor: [ OK ]
Stopping system message bus: [ OK ]
Stopping RPC idmapd: [ OK ]
Stopping NFS statd: [ OK ]
Stopping mcstransd: [ OK ]
Stopping portmap: [ OK ]
Shutting down restorecond: [ OK ]
Stopping auditd: [FAILED]
System should reboot properly.
Here is a link to the log with some Alt+SysRq data
Note: After several days the above logs will be zipped to save on space. You may
need to add a .gz to the end of the urls.
Here's my first thought... notice kauditd and auditd are both hung up on a
waitqueue inside the netlink code. We either get on this waitqueue because the
netlink socket state is no longer 'NETLINK_CONNECTED' or because we ran out of
room to add the skb to the socket. see netlink_attachskb()
I'm going to assume it is the 'too much stuff in the socket buffer to take
another skb' problem. the auditd daemon is what is supposed to be draining that
socket but it is blocked trying to send a netlink message back to itself...
Would it be possible to instead of having audit_send_reply() directly call into
the netlink_unicast() to instead put the skbs on the queue for kauditd to send
later? This way auditd could be freed up to drain the socket and kauditd could
start making progress again.
audit_send_reply() is only sent in response to a AUDIT_GET, AUDIT_SIGNAL_INFO,
or AUDIT_TTY_GET, request. I assume this isn't a problem since there is
currently no gaurentee that the next skb auditd will pop will actually be this
reply, so making it even more likely it won't be this reply doesn't seem like it
would be a problem.
sgrubb am I correct in my belief that auditd will be just fine if these replies
come after some number of other messages which have already been placed on the
queue to be send by kauditd?
Basic summary of what I think happened:
1) auditd got busy doing something, not draining the socket
2) we generated enough audit messages to fill the socket, kauditd had to sleep
3) auditd sent a AUDIT_GET, AUDIT_SIGNAL_INFO, or AUDIT_TTY_GET request
4) auditd went to sleep waiting for its own reply
If we can put the reply in step 4 on the audit_skb_queue I think we wouldn't be
deadlocking. If we can't, I think we need to fix userspace so we can :)
Or we could do what audit_send_list() does and just run in a whole new kthread
sgrubb suggested a second netlink socket where we could have sync communication,
that could be used for this and for the audit_send_list() stuff.
(admittedly I still like my first answer)
In reply to comment #1, auditd might be ok. We'd have to test it to see. During
shutdown, auditd would be trying to get signal info. Its not doing anything with
tty and it probably is not doing anything with a get command. I'd be 99% sure
its trying to query signal info.
In the case of hup, usr1, & 2, all its looking for is an ACK that the command
was received. The actual data can arrive later. In the case of sigterm, the
audit daemon breaks out of its loop reading events and sends a set auditd pid 0
to stop the flow of events. Maybe its not getting an ack on setting pid 0. If we
set auditd_pid 0, the whole queue can be flushed (maybe to syslog) since we
don't care anymore.
It might be worthwhile patching a kernel to collect additional data.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.