Bug 680126

Summary: kernel: BUG: warning at drivers/char/tty_audit.c:55/tty_audit_buf_free()
Product: Red Hat Enterprise Linux 6 Reporter: Steve Grubb <sgrubb>
Component: kernelAssignee: Danny Feng <dfeng>
Status: CLOSED ERRATA QA Contact: Boris Ranto <branto>
Severity: high Docs Contact:
Priority: urgent    
Version: 6.0CC: aviro, branto, caiqian, csf1dsh, dhoward, donhoover, mitr, sgrubb
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-128.el6 Doc Type: Bug Fix
Doc Text:
Using the pam_tty_audit.so module (which enables or disables TTY auditing for specified users) in the /etc/pam.d/sudo file and in the /etc/pam.d/system-auth file when the audit package is not installed resulted in soft lock-ups on CPUs. As a result, the kernel became unresponsive. This was due to the kernel exiting immediately after TTY auditing was disabled, without emptying the buffer, which caused the kernel to spin in a loop, copying 0 bytes at each iteration and attempting to push each time without any effect. With this update, a locking mechanism is introduced to prevent the aforementioned behavior.
Story Points: ---
Clone Of: 679563 Environment:
Last Closed: 2011-05-19 08:45:29 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On: 679563    
Bug Blocks: 584498, 684275, 846801, 846802    
Attachments:
Description Flags
Proposed patch to tty_audit.c none

Description Steve Grubb 2011-02-24 07:46:48 EST
+++ This bug was initially created as a clone of Bug #679563 +++

Description of problem:
Using pam_tty_audit.so in /etc/pam.d/sudo and system-auth, but not having the audit rpm installed results in soft lockups on CPUs ultimately leading to a system kernel hang.  The kernel stack traces reference tty_audit_buf_free and tty_audit_add_data 

Version-Release number of selected component (if applicable):
Linux uskyarpds03c0 2.6.18-238.el5 #1 SMP Sun Dec 19 14:22:44 EST 2010 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
configure PAM to use pam_tty_audit.so.  If installed, stop the auditd daemon, disable auditing in the kernel with "auditctl -e 0", then remove the audit rpm, and wait for CPU lockups (we consitently get them when users were logging onto the system as user oracle then upgrading Oracle from 10 to 11g).

Steps to Reproduce:
1. configure pam_tty_audit.so in sudo and system-auth:
/etc/pam.d/sudo:session    optional    pam_tty_audit.so open_only enable=*
/etc/pam.d/system-auth:session     optional      pam_tty_audit.so enable=*

2. disable auditing in the kernel:
auditctl -e 0

3. stop the auditd daemon and remove the rpm:
service auditd stop 
yum remove audit-1.7.18-2.el5.x86_64

4. sudo to a different user or ssh into the system then logoff.  When you type 'exit' at the prompt you will see this in the messages log:

Feb 15 15:53:38 uskyarpds01g0 kernel: BUG: warning at drivers/char/tty_audit.c:55/tty_audit_buf_free() (Not tainted)
Feb 15 15:53:38 uskyarpds01g0 kernel:
Feb 15 15:53:38 uskyarpds01g0 kernel: Call Trace:
Feb 15 15:53:38 uskyarpds01g0 kernel:  [<ffffffff801b4f2a>] tty_audit_buf_free+0x39/0x4d
Feb 15 15:53:38 uskyarpds01g0 kernel:  [<ffffffff8001577a>] do_exit+0x26d/0x911
Feb 15 15:53:38 uskyarpds01g0 kernel:  [<ffffffff800496a1>] cpuset_exit+0x0/0x88
Feb 15 15:53:38 uskyarpds01g0 kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0
Feb 15 15:53:38 uskyarpds01g0 kernel:

If you wait long enough, the above error then is followed by a more ominous one that eventualy hangs the system:

Feb 15 16:07:57 uskyarpds01g0 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ksh:13737]
Feb 15 16:07:57 uskyarpds01g0 kernel: CPU 7:
Feb 15 16:07:57 uskyarpds01g0 kernel: Modules linked in: nfs fscache nfs_acl mptctl mptbase ipmi_watchdog ipmi_devintf ipmi_si ipmi_msghandler autofs4 lockd sunrpc bonding ipv6 xfrm_nalgo crypto_api dm_round_robin scsi_dh_alua dm_multipath scsi_dh video backlight sbs power_meter i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ide_cd sg cdrom pcspkr e1000e k8temp k8_edac serio_raw bnx2 shpchp hwmon hpilo edac_mc dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage qla2xxx scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Feb 15 16:07:57 uskyarpds01g0 kernel: Pid: 13737, comm: ksh Not tainted 2.6.18-194.el5 #1
Feb 15 16:07:57 uskyarpds01g0 kernel: RIP: 0010:[<ffffffff801b51d1>]  [<ffffffff801b51d1>] tty_audit_add_data+0x283/0x325
Feb 15 16:07:57 uskyarpds01g0 kernel: RSP: 0018:ffff810560e4bd78  EFLAGS: 00000206
Feb 15 16:07:57 uskyarpds01g0 kernel: RAX: 0000000000ae8f94 RBX: ffff81087ed39280 RCX: 0000000000000000
Feb 15 16:07:57 uskyarpds01g0 kernel: RDX: ffff81087ed39280 RSI: 0000000000ae8f94 RDI: 0000000000001000
Feb 15 16:07:57 uskyarpds01g0 kernel: RBP: 0000000000000246 R08: 0000000000000000 R09: ffff8103eea33380
Feb 15 16:07:57 uskyarpds01g0 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: ffff81087ed39280
Feb 15 16:07:57 uskyarpds01g0 kernel: R13: 0000000000ae8f94 R14: 0000000000000000 R15: ffff8103eea33380
Feb 15 16:07:57 uskyarpds01g0 kernel: FS:  00002ad107b57dd0(0000) GS:ffff81087fe0f6c0(0000) knlGS:00000000f6a26b90
Feb 15 16:07:57 uskyarpds01g0 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 15 16:07:57 uskyarpds01g0 kernel: CR2: 00002b9430722000 CR3: 0000000702a73000 CR4: 00000000000006e0
Feb 15 16:07:57 uskyarpds01g0 kernel:
Feb 15 16:07:57 uskyarpds01g0 kernel: Call Trace:
Feb 15 16:07:57 uskyarpds01g0 kernel:  [<ffffffff801b5221>] tty_audit_add_data+0x2d3/0x325
Feb 15 16:07:57 uskyarpds01g0 kernel:  [<ffffffff800e9c88>] core_sys_select+0x1f9/0x265
Feb 15 16:07:57 uskyarpds01g0 kernel:  [<ffffffff80033f4d>] copy_from_read_buf+0x99/0xdc
Feb 15 16:07:57 uskyarpds01g0 kernel:  [<ffffffff80029914>] read_chan+0x52f/0x694
Feb 15 16:07:57 uskyarpds01g0 kernel:  [<ffffffff8008e16d>] default_wake_function+0x0/0xe
Feb 15 16:07:57 uskyarpds01g0 kernel:  [<ffffffff80046ece>] tty_ldisc_ref_wait+0xe/0x9b
Feb 15 16:07:57 uskyarpds01g0 kernel:  [<ffffffff8003277f>] tty_read+0x72/0xc8
Feb 15 16:07:57 uskyarpds01g0 kernel:  [<ffffffff8000b6b0>] vfs_read+0xcb/0x171
Feb 15 16:07:57 uskyarpds01g0 kernel:  [<ffffffff80011c01>] sys_read+0x45/0x6e
Feb 15 16:07:57 uskyarpds01g0 kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0
Feb 15 16:07:57 uskyarpds01g0 kernel:

Actual results:
The system becomes unresponsive and requires a reboot

Expected results:
It is my position that pam_tty_audit.so should be able to detect that auditing is disabled in the kernel and not continue manipulating audit data until such a time that auditing is enabled in the kernel. 

Additional info:
From analyzing the source code for pam_tty_audit, my theory is that the trouble begins with the first time tty_audit_add_data() is called.

pseudo-code:
tty_audit_add_data()
 mutex lock
 if buf is full
    tty_audit_buf_push_current()  <<-- this appears to lead to a blocking call via the audit API, possibly because auditing in the kernel is disabled and there is nothing to handle this call
    emtpy buf
 release lock

I think that after tty_audit_add_data is called again for the same process/tty, it then blocks on the first mutex lock, causing the system hang.

Granted, most users should not run into this issue if they have the audit rpm installed and the auditd daemon running, and therefore auditing enabled in the kernel as displayed by the command "auditctl -s" - however, it is my belief that I should not be able to hang the kernel in such a way.

--- Additional comment from csf1dsh@ups.com on 2011-02-22 16:34:48 EST ---

As an extra clarification, we found this because our configuration management system enabled PAM tty auditing on all our boxes and a few that didn't have the audit rpm installed started crashing.  So this issue only occurs when tty auditing is configured in PAM, but NO audit rpm is installed (auditd doesn't exist and auditing is not enabled in kernel which is the default state with no audit rpm installed).

--- Additional comment from aviro@redhat.com on 2011-02-23 09:20:13 EST ---

Not quite...  What happens is that tty_audit_buf_push() returns immediately if audit_enabled is zero.  Without emptying the buffer.  With obvious effect on tty_audit_add_data() that ends up spinning in that loop, copying 0 bytes at each iteration and attempting to push each time without any effect.  Holding the lock all along...

The question is, why do we try to retain the buffer contents until audit_enabled becomes non-zero (if ever)?  The obvious way to deal with the livelocks would be to replace

        if (audit_enabled == 0)
                return;

with

        if (audit_enabled == 0) {
                buf->valid = 0;
                return;
        }

in tty_audit_buf_push()...

--- Additional comment from csf1dsh@ups.com on 2011-02-23 10:31:51 EST ---

If we then choose to accept the premise that when audit_enabled returns 0, we empty the buffer contents, then why wait all the way until tty_audit_buf_push() to check audit_enabled?  I would say check it much earlier on, perhaps in tty_audit_add_data(), before setting any mutexes or wasting any resources manipulating buffers that will be thrown away anyways.

--- Additional comment from aviro@redhat.com on 2011-02-23 11:38:22 EST ---

Because it might have been disabled when we are halfway through tty_audit_add_data().  IOW, the check in tty_audit_buf_push() is needed anyway, or we'll be back to the same problem.  As to whether we want it done earlier as well... possible, but I'm not sure it's that much of a win.

--- Additional comment from mitr@redhat.com on 2011-02-23 14:02:46 EST ---

I suppose, to be pedantically correct, after
* audit_enabled = 1
* tty_audit_add_data('A')
* audit_enabled = 0
* tty_audit_add_data('B')
* audit_enabled = 1
* tty_audit_add_data('C')

both A and C should appear in the audit log.  But that's not really practical in any case, because the audit buffers are per-process - we can't very well save the audit buffer of each process upon _exit() in case auditing were later reenabled.

The "hardened" configurations run with (auditctl -e 2), in which case these problems are unimportant.


I think the simple fix from comment#2 is the best solution here; I don't think the combination of (auditctl -e 0) with tty_audit enabled is likely to be common enough that we need to specifically optimize tty_audit_add_data() for it.
Comment 3 Jeremy Miller 2011-03-05 08:37:28 EST
Created attachment 482435 [details]
Proposed patch to tty_audit.c

This is a proposed patch to tty_audit.c  Exact same patch as proposed for Bug 679563
Comment 6 Aristeu Rozanski 2011-03-30 10:23:12 EDT
Patch(es) available on kernel-2.6.32-128.el6
Comment 11 Martin Prpič 2011-05-05 04:58:27 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Using the pam_tty_audit.so module (which enables or disables TTY auditing for specified users) in the /etc/pam.d/sudo file and in the /etc/pam.d/system-auth file when the audit package is not installed resulted in soft lock-ups on CPUs. As a result, the kernel became unresponsive. This was due to the kernel exiting immediately after TTY auditing was disabled, without emptying the buffer, which caused the kernel to spin in a loop, copying 0 bytes at each iteration and attempting to push each time without any effect. With this update, a locking mechanism is introduced to prevent the aforementioned behavior.
Comment 12 errata-xmlrpc 2011-05-19 08:45:29 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html