Bug 429941 - [RHEL5 U2] Audit fails to shutdown properly
Summary: [RHEL5 U2] Audit fails to shutdown properly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Eric Paris
QA Contact: Martin Jenner
URL: http://rhts.lab.boston.redhat.com/tes...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-01-23 21:42 UTC by Jeff Burke
Modified: 2009-01-20 20:22 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 20:22:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Jeff Burke 2008-01-23 21:42:25 UTC
Description of problem:
 While running the audit-test-2088 test suite the system will hang.

Version-Release number of selected component (if applicable):
 2.6.18-71.el5

How reproducible:
 Rare

Steps to Reproduce:
1. Install RHEL5-U1. Update with kernel 2.6.18-71.el5
2. Install the audit-test-2088 RHTS rpm
3. While running the test restart audit daemon
  
Actual results:

INIT: Sending processes the TERM signal
Shutting down Avahi daemon: [  OK  ]
Stopping HAL daemon: [  OK  ]
Stopping yum-updatesd: [  OK  ]
Stopping atd: [  OK  ]
Stopping cups: [  OK  ]
Shutting down xfs: [  OK  ]
Shutting down console mouse services: [  OK  ]
Stopping sshd: [  OK  ]
Shutting down sm-client: [  OK  ]
Shutting down sendmail: [  OK  ]
Shutting down vsftpd: [  OK  ]
Stopping acpi daemon: [  OK  ]
Stopping crond: [  OK  ]
Stopping autofs:  [  OK  ]
Stopping automount: [  OK  ]
Shutting down ntpd: [  OK  ]
Killing mdmonitor: [  OK  ]
Stopping system message bus: [  OK  ]
Stopping RPC idmapd: [  OK  ]
Stopping NFS statd: [  OK  ]
Stopping mcstransd: [  OK  ]
Stopping portmap: [  OK  ]
Shutting down restorecond: [  OK  ]
Stopping auditd: [FAILED]

Expected results:
 System should reboot properly.

Additional info:
Here is a link to the log with some Alt+SysRq data
http://rhts.lab.boston.redhat.com/testlogs/14060/49430/403056/1734497-test_log--kernel-security-audit-audit-test-2088-EXTERNALWATCHDOG
http://rhts.lab.boston.redhat.com/testlogs/14060/49430/403056/TESTOUT.log

Note: After several days the above logs will be zipped to save on space. You may
need to add a .gz to the end of the urls.

Comment 1 Eric Paris 2008-01-24 19:55:44 UTC
Here's my first thought...  notice kauditd and auditd are both hung up on a
waitqueue inside the netlink code.  We either get on this waitqueue because the
netlink socket state is no longer 'NETLINK_CONNECTED' or because we ran out of
room to add the skb to the socket.  see netlink_attachskb()

I'm going to assume it is the 'too much stuff in the socket buffer to take
another skb' problem.  the auditd daemon is what is supposed to be draining that
socket but it is blocked trying to send a netlink message back to itself...

Would it be possible to instead of having audit_send_reply() directly call into
the netlink_unicast() to instead put the skbs on the queue for kauditd to send
later?  This way auditd could be freed up to drain the socket and kauditd could
start making progress again.

audit_send_reply() is only sent in response to a AUDIT_GET, AUDIT_SIGNAL_INFO,
or AUDIT_TTY_GET, request.  I assume this isn't a problem since there is
currently no gaurentee that the next skb auditd will pop will actually be this
reply, so making it even more likely it won't be this reply doesn't seem like it
would be a problem.

sgrubb am I correct in my belief that auditd will be just fine if these replies
come after some number of other messages which have already been placed on the
queue to be send by kauditd?

Comment 2 Eric Paris 2008-01-24 20:12:45 UTC
Basic summary of what I think happened:

1) auditd got busy doing something, not draining the socket
2) we generated enough audit messages to fill the socket, kauditd had to sleep
3) auditd sent a AUDIT_GET, AUDIT_SIGNAL_INFO, or AUDIT_TTY_GET request
4) auditd went to sleep waiting for its own reply

If we can put the reply in step 4 on the audit_skb_queue I think we wouldn't be
deadlocking.  If we can't, I think we need to fix userspace so we can   :)

Comment 3 Eric Paris 2008-01-24 20:21:43 UTC
Or we could do what audit_send_list() does and just run in a whole new kthread

Comment 4 Eric Paris 2008-01-24 20:23:14 UTC
sgrubb suggested a second netlink socket where we could have sync communication,
that could be used for this and for the audit_send_list() stuff.

(admittedly I still like my first answer)

Comment 5 Steve Grubb 2008-01-24 20:51:21 UTC
In reply to comment #1, auditd might be ok. We'd have to test it to see. During
shutdown, auditd would be trying to get signal info. Its not doing anything with
tty and it probably is not doing anything with a get command. I'd be 99% sure
its trying to query signal info. 

In the case of hup, usr1, & 2, all its looking for is an ACK that the command
was received. The actual data can arrive later. In the case of sigterm, the
audit daemon breaks out of its loop reading events and sends a set auditd pid 0
to stop the flow of events. Maybe its not getting an ack on setting pid 0. If we
set auditd_pid 0, the whole queue can be flushed (maybe to syslog) since we
don't care anymore.

It might be worthwhile patching a kernel to collect additional data.

Comment 7 RHEL Program Management 2008-04-04 18:39:54 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Don Zickus 2008-07-23 18:54:39 UTC
in kernel-2.6.18-99.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 errata-xmlrpc 2009-01-20 20:22:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html


Note You need to log in before you can comment on or make changes to this bug.