Bug 429941

Summary: [RHEL5 U2] Audit fails to shutdown properly
Product: Red Hat Enterprise Linux 5 Reporter: Jeff Burke <jburke>
Component: kernelAssignee: Eric Paris <eparis>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: low Docs Contact:
Priority: low    
Version: 5.2CC: aviro, duck, dzickus, eparis, mgahagan, sgrubb
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
URL: http://rhts.lab.boston.redhat.com/testlogs/14060/49430/403056/1734497-test_log--kernel-security-audit-audit-test-2088-EXTERNALWATCHDOG
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 20:22:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jeff Burke 2008-01-23 21:42:25 UTC
Description of problem:
 While running the audit-test-2088 test suite the system will hang.

Version-Release number of selected component (if applicable):
 2.6.18-71.el5

How reproducible:
 Rare

Steps to Reproduce:
1. Install RHEL5-U1. Update with kernel 2.6.18-71.el5
2. Install the audit-test-2088 RHTS rpm
3. While running the test restart audit daemon
  
Actual results:

INIT: Sending processes the TERM signal
Shutting down Avahi daemon: [  OK  ]
Stopping HAL daemon: [  OK  ]
Stopping yum-updatesd: [  OK  ]
Stopping atd: [  OK  ]
Stopping cups: [  OK  ]
Shutting down xfs: [  OK  ]
Shutting down console mouse services: [  OK  ]
Stopping sshd: [  OK  ]
Shutting down sm-client: [  OK  ]
Shutting down sendmail: [  OK  ]
Shutting down vsftpd: [  OK  ]
Stopping acpi daemon: [  OK  ]
Stopping crond: [  OK  ]
Stopping autofs:  [  OK  ]
Stopping automount: [  OK  ]
Shutting down ntpd: [  OK  ]
Killing mdmonitor: [  OK  ]
Stopping system message bus: [  OK  ]
Stopping RPC idmapd: [  OK  ]
Stopping NFS statd: [  OK  ]
Stopping mcstransd: [  OK  ]
Stopping portmap: [  OK  ]
Shutting down restorecond: [  OK  ]
Stopping auditd: [FAILED]

Expected results:
 System should reboot properly.

Additional info:
Here is a link to the log with some Alt+SysRq data
http://rhts.lab.boston.redhat.com/testlogs/14060/49430/403056/1734497-test_log--kernel-security-audit-audit-test-2088-EXTERNALWATCHDOG
http://rhts.lab.boston.redhat.com/testlogs/14060/49430/403056/TESTOUT.log

Note: After several days the above logs will be zipped to save on space. You may
need to add a .gz to the end of the urls.

Comment 1 Eric Paris 2008-01-24 19:55:44 UTC
Here's my first thought...  notice kauditd and auditd are both hung up on a
waitqueue inside the netlink code.  We either get on this waitqueue because the
netlink socket state is no longer 'NETLINK_CONNECTED' or because we ran out of
room to add the skb to the socket.  see netlink_attachskb()

I'm going to assume it is the 'too much stuff in the socket buffer to take
another skb' problem.  the auditd daemon is what is supposed to be draining that
socket but it is blocked trying to send a netlink message back to itself...

Would it be possible to instead of having audit_send_reply() directly call into
the netlink_unicast() to instead put the skbs on the queue for kauditd to send
later?  This way auditd could be freed up to drain the socket and kauditd could
start making progress again.

audit_send_reply() is only sent in response to a AUDIT_GET, AUDIT_SIGNAL_INFO,
or AUDIT_TTY_GET, request.  I assume this isn't a problem since there is
currently no gaurentee that the next skb auditd will pop will actually be this
reply, so making it even more likely it won't be this reply doesn't seem like it
would be a problem.

sgrubb am I correct in my belief that auditd will be just fine if these replies
come after some number of other messages which have already been placed on the
queue to be send by kauditd?

Comment 2 Eric Paris 2008-01-24 20:12:45 UTC
Basic summary of what I think happened:

1) auditd got busy doing something, not draining the socket
2) we generated enough audit messages to fill the socket, kauditd had to sleep
3) auditd sent a AUDIT_GET, AUDIT_SIGNAL_INFO, or AUDIT_TTY_GET request
4) auditd went to sleep waiting for its own reply

If we can put the reply in step 4 on the audit_skb_queue I think we wouldn't be
deadlocking.  If we can't, I think we need to fix userspace so we can   :)

Comment 3 Eric Paris 2008-01-24 20:21:43 UTC
Or we could do what audit_send_list() does and just run in a whole new kthread

Comment 4 Eric Paris 2008-01-24 20:23:14 UTC
sgrubb suggested a second netlink socket where we could have sync communication,
that could be used for this and for the audit_send_list() stuff.

(admittedly I still like my first answer)

Comment 5 Steve Grubb 2008-01-24 20:51:21 UTC
In reply to comment #1, auditd might be ok. We'd have to test it to see. During
shutdown, auditd would be trying to get signal info. Its not doing anything with
tty and it probably is not doing anything with a get command. I'd be 99% sure
its trying to query signal info. 

In the case of hup, usr1, & 2, all its looking for is an ACK that the command
was received. The actual data can arrive later. In the case of sigterm, the
audit daemon breaks out of its loop reading events and sends a set auditd pid 0
to stop the flow of events. Maybe its not getting an ack on setting pid 0. If we
set auditd_pid 0, the whole queue can be flushed (maybe to syslog) since we
don't care anymore.

It might be worthwhile patching a kernel to collect additional data.

Comment 7 RHEL Program Management 2008-04-04 18:39:54 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Don Zickus 2008-07-23 18:54:39 UTC
in kernel-2.6.18-99.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 errata-xmlrpc 2009-01-20 20:22:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html