Bug 1459326 - BUG: audit records being sent to the console even when auditd is running
BUG: audit records being sent to the console even when auditd is running
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
26
x86_64 Linux
unspecified Severity low
: ---
: ---
Assigned To: Paul Moore
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-06 15:40 EDT by Adam Williamson
Modified: 2017-10-02 02:21 EDT (History)
14 users (show)

See Also:
Fixed In Version: kernel-4.11.6-101.fc24
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-06-23 15:51:08 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
01-XXX-audit-reset_fix.patch (4.97 KB, patch)
2017-06-12 17:17 EDT, Paul Moore
no flags Details | Diff
04-XXX-audit-reset_fix.patch (5.04 KB, patch)
2017-06-13 15:23 EDT, Paul Moore
no flags Details | Diff

  None (edit)
Description Adam Williamson 2017-06-06 15:40:39 EDT
Dusty and I are looking into why openQA's testing of the Atomic installer image shows a lot of audit logs being spewed to tty1 in some cases. One thing we've found is quite interesting. It seems that if you boot without the 'quiet' kernel arg on UEFI, audit logs get sent to tty1 - but the same is not true if you boot on BIOS. We're not sure why this is, but the inconsistency seemed odd to us, and worth reporting as a potential bug (not sure which behaviour is correct).
Comment 1 Adam Williamson 2017-06-06 16:03:57 EDT
So I was a bit off in my initial description here: in fact it's not UEFI vs. BIOS, it's just some kind of race. Sometimes this happens, sometimes it doesn't.

For e.g., on production openQA, I see kernel message spammed to tty1 on the UEFI test for the last two days, but not for several days before that. On staging it happened to the *BIOS* test twice in the last few days, but not the UEFI test at all.

halfline also thinks there's something wrong with the default kernel log level. It seems that if you boot without 'quiet' or 'debug', /proc/sys/kernel/printk shows:

7	4	1	7

i.e. the kernel log level is 'debug'. halfline reckons the default level should be 4 ('warning').
Comment 2 Adam Williamson 2017-06-06 16:45:28 EDT
So the theory we've developed here is that this happens when the getty is started before auditd starts running, and pjones suggested this change to resolve it:

<pjones> see if adding getty.target to auditd.service's Before= makes it go away
<pjones> that should at least cause them to get a) cleared when the getty grabs that tty, and b) not written to the tty after that
<pjones> if that theory is correct, anyway
<dustymabe> pjones: adamw - that works
<dustymabe> https://da.gd/GYt6 

so, re-assigning to audit for now.
Comment 3 Steve Grubb 2017-06-06 17:32:40 EDT
The audit daemon does not use any TTY's. It closes its descriptors and reopens them to /dev/null. Are you sure this is not journald which enables auditing regardless of whether the user wants auditing or not? There's a whole lot of unhappy people on bug 1227379 wanting less audit events in their syslog. Maybe this is related?
Comment 4 Adam Williamson 2017-06-06 17:39:09 EDT
It's not about the audit events *happening*, it's about them being printed to tty1, which seems undesirable.
Comment 5 Dusty Mabe 2017-06-06 17:48:55 EDT
so this looks like a race condition that *can* be solved by putting `Before=getty.target` into auditd.service. Not saying that is the solution to this BZ, but that is one way to *fix* it on a system that is showing the problem.

It is worth noting that a system that is showing this problem is not receiving the messages from the kernel's audit.c. I straced auditd on a running system and I see this:

```
[root@localhost ~]# strace -F -s 512 -p 723 2>&1 | tee straceout.txt 
strace: Process 723 attached with 2 threads
[pid   723] epoll_wait(6,  <unfinished ...>
[pid   724] futex(0x55fe3d049288, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid   723] <... epoll_wait resumed> [], 64, 59743) = 0
[pid   723] epoll_wait(6,

```

all the while messages audit messages are printing to tty1

I looked at auditd's fds and this is what I see:

```
[root@localhost ~]# lsof -p 723
COMMAND PID USER   FD      TYPE             DEVICE SIZE/OFF  NODE NAME
auditd  723 root  cwd       DIR              253,0     4096    70 /
auditd  723 root  rtd       DIR              253,0     4096    70 /
auditd  723 root  txt       REG              253,0   120248  2087 /usr/sbin/auditd
auditd  723 root  mem       REG              253,0    56784  1814 /usr/lib64/libnss_files-2.25.so
auditd  723 root  mem       REG              253,0    36512  1819 /usr/lib64/libnss_sss.so.2
auditd  723 root  mem       REG              253,0   467600  1854 /usr/lib64/libpcre.so.1.2.8
auditd  723 root  mem       REG              253,0   158320  1929 /usr/lib64/libselinux.so.1
auditd  723 root  mem       REG              253,0   109800  1907 /usr/lib64/libresolv-2.25.so
auditd  723 root  mem       REG              253,0    15288  1691 /usr/lib64/libkeyutils.so.1.6
auditd  723 root  mem       REG              253,0    19496  1534 /usr/lib64/libdl-2.25.so
auditd  723 root  mem       REG              253,0    61680  1699 /usr/lib64/libkrb5support.so.0.1
auditd  723 root  mem       REG              253,0    15344  1499 /usr/lib64/libcom_err.so.2.1
auditd  723 root  mem       REG              253,0   209904  1687 /usr/lib64/libk5crypto.so.3.1
auditd  723 root  mem       REG              253,0  2163088  1484 /usr/lib64/libc-2.25.so
auditd  723 root  mem       REG              253,0   958040  1697 /usr/lib64/libkrb5.so.3.3
auditd  723 root  mem       REG              253,0   316928  1636 /usr/lib64/libgssapi_krb5.so.2.2
auditd  723 root  mem       REG              253,0  1226376  1741 /usr/lib64/libm-2.25.so
auditd  723 root  mem       REG              253,0    43656  1921 /usr/lib64/librt-2.25.so
auditd  723 root  mem       REG              253,0   149800  1883 /usr/lib64/libpthread-2.25.so
auditd  723 root  mem       REG              253,0    19352  1487 /usr/lib64/libcap-ng.so.0.0.0
auditd  723 root  mem       REG              253,0   122080  1454 /usr/lib64/libaudit.so.1.0.0
auditd  723 root  mem       REG              253,0   123184  1456 /usr/lib64/libauparse.so.0.0.0
auditd  723 root  mem       REG              253,0   115472  1807 /usr/lib64/libnsl-2.25.so
auditd  723 root  mem       REG              253,0    42560  2040 /usr/lib64/libwrap.so.0.7.6
auditd  723 root  mem       REG              253,0   187368  1423 /usr/lib64/ld-2.25.so
auditd  723 root    0u      CHR                1,3      0t0  1028 /dev/null
auditd  723 root    1u      CHR                1,3      0t0  1028 /dev/null
auditd  723 root    2u      CHR                1,3      0t0  1028 /dev/null
auditd  723 root    3u  netlink                         0t0 17852 AUDIT
auditd  723 root    5w      REG              253,0      199 31448 /var/log/audit/audit.log
auditd  723 root    6u  a_inode               0,12        0  9775 [eventpoll]
auditd  723 root    7u  a_inode               0,12        0  9775 [eventfd]
auditd  723 root    8u     unix 0xffff92a8f9857000      0t0 17856 type=DGRAM
auditd  723 root    9u     unix 0xffff92a8f9855800      0t0 17865 type=STREAM
auditd  723 root   10u     unix 0xffff92a8f9856c00      0t0 17866 type=STREAM
[root@localhost ~]# 
[root@localhost ~]# 
[root@localhost ~]# ls -l /proc/723/fd/ 
total 0
lrwx------. 1 root root 64 Jun  6 17:12 0 -> /dev/null
lrwx------. 1 root root 64 Jun  6 17:12 1 -> /dev/null
lrwx------. 1 root root 64 Jun  6 17:12 10 -> socket:[17866]
lrwx------. 1 root root 64 Jun  6 17:12 2 -> /dev/null
lrwx------. 1 root root 64 Jun  6 17:12 3 -> socket:[17852]
l-wx------. 1 root root 64 Jun  6 17:12 5 -> /var/log/audit/audit.log
lrwx------. 1 root root 64 Jun  6 17:12 6 -> anon_inode:[eventpoll]
lrwx------. 1 root root 64 Jun  6 17:12 7 -> anon_inode:[eventfd]
lrwx------. 1 root root 64 Jun  6 17:12 8 -> socket:[17856]
lrwx------. 1 root root 64 Jun  6 17:12 9 -> socket:[17865]
``` 


I tried to start auditd at system start with strace enabled but that did not yield a system that showed the problem, so the race condition was not observed and the strace output is probably not useful. Peter Jones will have some other notes/comments to add most likely.
Comment 6 Steve Grubb 2017-06-06 18:05:47 EDT
During boot systemd is sending an event for each service that starts up. This is required for common criteria. The problem is that journald enables auditing and connects to a netlink multicast socket for the audit system and starts collecting audit events. This can't be turned off by configuration. I think these events gets passed to rsyslog which then puts it on the screen. This is most likely what you are seeing. I have no idea how the audit daemon could write to TTY1.
Comment 7 Dusty Mabe 2017-06-06 18:30:24 EDT
the system boots up with console=tty1 so the kernel messages are coming to tty1. my understanding is that once auditd is started the audit kernel messages shouldn't come to the console of the machine. If the message doesn't make it to the audit daemon then i think it defaults back to printing via printk

https://github.com/vathpela/linux/blob/master/kernel/audit.c#L500

peter jones has much more context on this topic. This is way below my level of knowledge :)
Comment 8 Steve Grubb 2017-06-06 20:47:10 EDT
OK. Then this is about the kernel doing printk and not anything to do with the audit daemon? If audit is not enabled (audit=1 on kernel boot command line and auditd has not started), there is not supposed to be any events except AVC's. So, this may be a case where systemd/journal is enabling audit on its own. One way to check this is to uninstall the audit daemon and then see if you still get events during boot. If you get even just 1 event, this bz should be transferred to systemd.
Comment 9 Dusty Mabe 2017-06-06 22:35:04 EDT
so you are saying that if audit isn't started then i should never see an audit message other than an avc?

in our case auditd is started and running, but we think it is not properly receiving messages from the kernel, so it falls back to printk. 

I guess we'll try to debug this more tomorrow. Thanks for the discussion so far.
Comment 10 Steve Grubb 2017-06-07 08:37:46 EDT
That is correct. When auditd starts, it sends an audit enable command to the kernel to start the flow of events. If you did not boot with audit=1, then you should see no events. Rather than uninstall it, you can "systemctl disable auditd".

Auditd appears to be working fine. Looking at the data collected in comment 5, I see stding, stdout, and stderr all pointing /dev/null. And auditd is properly receiving events on descriptor 3 which is the netlink socket.
Comment 11 Dusty Mabe 2017-06-07 08:58:26 EDT
hey steve, is it possible for you to join us on IRC? There are several of us in #fedora-desktop on gimpnet https://wiki.gnome.org/Sysadmin/IRC

alternatively you can propose a different channel for us to join.
Comment 12 Ray Strode [halfline] 2017-06-07 09:38:25 EDT
so my reading of the link in comment 7 says the messages get unconditionaly printk as a notice. See this code:

	/* at this point it is uncertain if we will ever send this to auditd so
	 * try to send the message via printk before we go any further */
        kauditd_printk_skb(skb);

That suggests there isn't some fall back if auditd is unavailable.

So this is really about the console log level, I guess.  The messages are printed at level 5.  The kernel log level starts at 7 when quiet is removed. That means the audit messages should always go to the console if there's no quiet on the kernel command line, unless i'm missing something.  If the messages sometimes don't show up, I guess that means something is doing the equivalent of dmesg -n4 or the equivalent of dmesg -D before the messages are printed (or something).  in the cases when the messages do show up something isn't doing that, I guess. maybe that's the race?

Note there isn't just audit messages in the output, if you look at

https://openqa.fedoraproject.org/tests/106013#step/_console_wait_login/8

There's also "random: crng done".  Would be interesting to see how f25 behaves.
Comment 13 Ray Strode [halfline] 2017-06-07 09:48:56 EDT
(In reply to Ray Strode [halfline] from comment #12)
> so my reading of the link in comment 7 says the messages get unconditionaly
> printk as a notice. See this code:
> 
> 	/* at this point it is uncertain if we will ever send this to auditd so
> 	 * try to send the message via printk before we go any further */
>         kauditd_printk_skb(skb);
> 
> That suggests there isn't some fall back if auditd is unavailable.
So I was wrong.  that code is called from kaudit_hold_skb which only gets called in error conditions.

So this could indeed be an auditd bug or selinux preventing auditd from getting audit messages or something like that.
Comment 14 Ray Strode [halfline] 2017-06-07 11:34:22 EDT
I don't think this is an auditd bug.  dustymabe did some gdb debugging:

<dustymabe> (gdb) call (struct epoll_event *) malloc(sizeof(struct epoll_event))
<dustymabe> $1 = (struct epoll_event *) 0x7f1a2504dc20
<dustymabe> (gdb) call epoll_ctl(6, 1, 3, $1)
<dustymabe> $2 = -1
<dustymabe> (gdb)  p errno
<dustymabe> $3 = 17

so the netlink socket (fd 6) is already in the epoll set and and auditd is waiting in epoll_wait. it's not getting woken up by the kernel though.

kernel probably forgot about it.

sgrubb thinks the bug may have been introduced by this commit:

https://www.redhat.com/archives/linux-audit/2017-March/msg00114.html
Comment 15 Paul Moore 2017-06-07 11:41:16 EDT
Two questions:

* What kernel version?

* Does restarting auditd quiet the records to the console?
Comment 16 Paul Moore 2017-06-07 11:43:56 EDT
(In reply to Ray Strode [halfline] from comment #14)
> sgrubb thinks the bug may have been introduced by this commit:
> 
> https://www.redhat.com/archives/linux-audit/2017-March/msg00114.html

There were a number of big changes to the kernel's audit daemon tracking code to fix a lot of really bad, and long standing problems; I wouldn't attribute it to that single patch just yet.
Comment 17 Dusty Mabe 2017-06-07 11:48:17 EDT
(In reply to Paul Moore from comment #15)
> Two questions:
> 
> * What kernel version?

kernel-4.11.0-2.fc26.x86_64

> 
> * Does restarting auditd quiet the records to the console?

[fedora@cloudhost ~]$ sudo systemctl restart auditd 
Failed to restart auditd.service: Operation refused, unit auditd.service may be requested by dependency only (it is configured to refuse manual start/stop).
See system logs and 'systemctl status auditd.service' for details.

??
Comment 18 Paul Moore 2017-06-07 11:53:05 EDT
(In reply to Dusty Mabe from comment #17)
> (In reply to Paul Moore from comment #15)
> > Two questions:
> > 
> > * What kernel version?
> 
> kernel-4.11.0-2.fc26.x86_64

I'm guessing that is missing the commit below; it needed to be backported by hand and resubmitted to stable (https://www.redhat.com/archives/linux-audit/2017-May/msg00068.html).

  commit 48d0e023af9799cd7220335baf8e3ba61eeafbeb
  Author: Paul Moore <paul@paul-moore.com>
  Date:   Tue May 2 10:16:05 2017 -0400                                           
                                                                                
    audit: fix the RCU locking for the auditd_connection structure              
                                                                                
    Cong Wang correctly pointed out that the RCU read locking of the            
    auditd_connection struct was wrong, this patch correct this by              
    adopting a more traditional, and correct RCU locking model.                 
                                                                                
    This patch is heavily based on an earlier prototype by Cong Wang.           
                                                                                
    Cc: <stable@vger.kernel.org> # 4.11.x-                                      
    Reported-by: Cong Wang <xiyou.wangcong@gmail.com>                           
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>                         
    Signed-off-by: Paul Moore <paul@paul-moore.com> 

Are you able to reproduce this enough such that if I built a scratch kernel for testing you would be able to verify it?

> > * Does restarting auditd quiet the records to the console?
> 
> [fedora@cloudhost ~]$ sudo systemctl restart auditd 
> Failed to restart auditd.service: Operation refused, unit auditd.service may
> be requested by dependency only (it is configured to refuse manual
> start/stop).
> See system logs and 'systemctl status auditd.service' for details.

The auditd systemd unit file can be very cranky, you probably need to just kill auditd and restart it.
Comment 19 Steve Grubb 2017-06-07 11:54:51 EDT
Correction, I didn't say
https://www.redhat.com/archives/linux-audit/2017-March/msg00114.html
caused the problem, I said maybe it fixes the problem. :-)

To restart the audit daemon, use the service command.
Comment 20 Dusty Mabe 2017-06-07 11:59:09 EDT
(In reply to Paul Moore from comment #18)

> 
> Are you able to reproduce this enough such that if I built a scratch kernel
> for testing you would be able to verify it?
> 

It's doesn't happen every time but seems to be reliable enough that I can get it to happen by just rebooting a box < 5 times. Scratch build might be useful.

> 
> The auditd systemd unit file can be very cranky, you probably need to just
> kill auditd and restart it.

killed auditd and restarted it - cleared up the problem, no more messages coming to the console.
Comment 21 Paul Moore 2017-06-07 12:03:56 EDT
(In reply to Dusty Mabe from comment #20)
> (In reply to Paul Moore from comment #18)
> > Are you able to reproduce this enough such that if I built a scratch kernel
> > for testing you would be able to verify it?
> 
> It's doesn't happen every time but seems to be reliable enough that I can
> get it to happen by just rebooting a box < 5 times. Scratch build might be
> useful.

FWIW, I've been restarting my Rawhide test system with my test kernels (v4.12 based) while we've been discussing this BZ and I haven't seen the problem.

I'll do a 4.11 stable kernel build right now, but if you are feeling adventurous you can try the latest build from here:

* https://copr.fedorainfracloud.org/coprs/pcmoore/kernel-secnext
 
> > The auditd systemd unit file can be very cranky, you probably need to just
> > kill auditd and restart it.
> 
> killed auditd and restarted it - cleared up the problem, no more messages
> coming to the console.

Okay, that sorta confirms what I'm thinking; the kernel is somehow missing the auditd registration.
Comment 22 Dusty Mabe 2017-06-07 14:29:51 EDT
(In reply to Paul Moore from comment #21)

> I'll do a 4.11 stable kernel build right now, 

Let me know when you have a link to that and i'll try it out.
Comment 23 Paul Moore 2017-06-07 14:46:22 EDT
Sorry for the delay, I got a bit distracted, but here is the build:

* https://koji.fedoraproject.org/koji/taskinfo?taskID=19902321

It's the standard Fedora kernel-4.11.4-300.fc26.src.rpm kernel with the fix mentioned above in comment #18.  Let me know if it works, or doesn't, for you.
Comment 24 Dusty Mabe 2017-06-07 21:08:20 EDT
I first tried to reproduce with the latest kernel in updates-testing:

kernel-core-4.11.3-302.fc26.x86_64

I *was* able to reproduce with this kernel. I then tried with the kernel from your scratch build:

kernel-core-4.11.4-300.1.testing.fc26.x86_64

With initial testing I have not been able to reproduce this. I'll give it some more tries, but preliminary results look like the issue is fixed with your scratch build.
Comment 25 Dusty Mabe 2017-06-07 21:13:21 EDT
sigh, spoke just a little too soon. I got it to reproduce at least once with the scratch build kernel :(
Comment 26 Paul Moore 2017-06-08 11:30:41 EDT
I was stuck in meetings this morning so I didn't get to make any progress on this, but I'm building a dev kernel right now to play with, I'll update the BZ when I've got something to report.

In the meantime, if you wanted to give the 4.12-rcX kernels a shot that might be an interesting data point, but it isn't a requirement.

* https://copr.fedorainfracloud.org/coprs/pcmoore/kernel-secnext
Comment 27 Paul Moore 2017-06-08 13:40:58 EDT
Quick update: after rebooting my test machine using a locally built kernel based on kernel-4.11.4-300.fc26 with the RCU fix applied I was finally able to see the problem ... on the 15th reboot :/

To add a bit more information, the kernel definitely doesn't think auditd is connected:

  # rpm -q audit
  audit-2.7.6-1.fc27.x86_64
  # uname -r
  4.11.4+
  # auditctl -s
  enabled 1
  failure 1
  pid 0
  rate_limit 0
  backlog_limit 64
  lost 0
  backlog 0
  backlog_wait_time 60000
  loginuid_immutable 0 unlocked
Comment 28 Paul Moore 2017-06-08 13:42:05 EDT
(In reply to Paul Moore from comment #27)
> Quick update: after rebooting my test machine using a locally built kernel
> based on kernel-4.11.4-300.fc26 with the RCU fix applied I was finally able
> to see the problem ... on the 15th reboot :/
> 
> To add a bit more information, the kernel definitely doesn't think auditd is
> connected:
> 
>   # rpm -q audit
>   audit-2.7.6-1.fc27.x86_64
>   # uname -r
>   4.11.4+
>   # auditctl -s
>   enabled 1
>   failure 1
>   pid 0
>   rate_limit 0
>   backlog_limit 64
>   lost 0
>   backlog 0
>   backlog_wait_time 60000
>   loginuid_immutable 0 unlocked

Additional:

  # pidof auditd
  358
Comment 29 Paul Moore 2017-06-08 14:22:39 EDT
I added some additional instrumentation to try and determine the root cause, but after 30 reboots I had not triggered the problem again.  I added a small bit of scripting to rc.local and which should keep automatically rebooting the system until the problem is triggered, hopefully this will help ...
Comment 30 Paul Moore 2017-06-08 15:22:50 EDT
After almost on hour of continuous reboots I'm still not seeing the problem, clearly the solution is to introduce extra instrumentation ;)

Time to look a bit closer.
Comment 31 Paul Moore 2017-06-08 19:35:43 EDT
Last update for today: I backed off on the instrumentation and was able to reproduce the problem, some notes:

* I only see one AUDIT_SET/AUDIT_STATUS_PID message to set register the running auditd instance, I'm not seeing any indication that auditd is unregistering itself.

* I need to remove the audit printk rate limiter as that is causing messages to be dropped in this case, but I'm not seeing the AUDIT_CONFIG_CHANGE record one would expect from an auditd registration.

I'm going to continue testing, but my current theory is that there is some odd situation where the kauditd_thread attempts to send queued records up to auditd before everything is "OK" with registration, fails, and resets the connection (as it should on failure).

More updates when I know more ...
Comment 32 Dusty Mabe 2017-06-09 10:01:51 EDT
Never really paid any attention to this before, but I think I notice this same problem on my Fedora 25 desktop at home. I switched to a different tty (to get to a login prompt) and I notice audit messages there. auditd is running.

This might be an interesting data point.
Comment 33 Paul Moore 2017-06-09 11:54:51 EDT
(In reply to Dusty Mabe from comment #32)
> Never really paid any attention to this before, but I think I notice this
> same problem on my Fedora 25 desktop at home. I switched to a different tty
> (to get to a login prompt) and I notice audit messages there. auditd is
> running.
> 
> This might be an interesting data point.

Possibly, but I'm not sure how helpful it will be at this point; there have been a *lot* of changes upstream.
Comment 34 Paul Moore 2017-06-09 13:01:47 EDT
One quick update: defeating the audit printk rate limit would appear to make the problem disappear on my test system.  Considering that only kauditd_thread should be the only one calling into that function here it tends to lend credibility to the theory in comment #31.

Resuming testing with the rate limiter restored and some instrumentation in auditd_reset().
Comment 35 Paul Moore 2017-06-09 14:27:47 EDT
I need to preface this by saying that I don't have any confirmation that this is indeed the problem, however, looking at the code I think I may have found the problem ...

This issue is partly due to the read-copy nature of RCU, and partly due to how we sync the auditd_connection state across kauditd_thread and the audit control channel.  The kauditd_thread thread is always running so it can service the record queues and emit the multicast messages, if it happens to be just past the "main_queue" label, but before the "if (sk == NULL || ...)" if-statement which calls auditd_reset() when the new auditd connection is registered it could end up resetting the auditd connection, regardless of if it is valid or not.  This is a rather small window and the variable nature of multi-core scheduling explains why this is proving rather difficult to reproduce.

I just had this thought, so I haven't fully fleshed out the possible solutions, but possible solutions include one, or more, of the following:

* Don't call auditd_reset() from kauditd_thread if the local copy of the connection state is disconnected

* Pass an optional auditd_connection pointer to auditd_reset() and only do the reset if the pointer is current

* Other/TBD

I'll follow up with patches/test-builds once I have a reasonable solution (the above may or may not be it).
Comment 36 Paul Moore 2017-06-12 17:16:34 EDT
Today's update: I believe I have a fix, my dev kernel has been running for a few hours now in a rapid reboot cycle and I've yet to trigger the problem.  I'll attach the current patch in a moment, but here is a COPR build (currently building) for a F26 kernel with the patch:

* https://copr.fedorainfracloud.org/coprs/build/564483
Comment 37 Paul Moore 2017-06-12 17:17 EDT
Created attachment 1287102 [details]
01-XXX-audit-reset_fix.patch
Comment 38 Paul Moore 2017-06-12 17:21:58 EDT
Nevermind, scratch all that about today's update ... while attaching the patch I noticed a small problem with auditd unregistering itself.
Comment 39 Paul Moore 2017-06-13 15:22:13 EDT
Today's update: Let's try this again.  I've updated the patch and the test kernel has been chugging away for several hours now with no sign of the problem.  Dusty, any chance you could try the kernel below to see if it solves the problem on your system?

* https://copr.fedorainfracloud.org/coprs/pcmoore/kernel-testing/build/564499
Comment 40 Paul Moore 2017-06-13 15:23 EDT
Created attachment 1287438 [details]
04-XXX-audit-reset_fix.patch
Comment 41 Paul Moore 2017-06-16 11:57:06 EDT
I merged the following patch upstream and will be sending it to Linus during the next merge window (it doesn't seem significant enough to warrant a -stable backport upstream):

  commit c81be52a3ac0267aa830a2c4cb769030ea3483c9
  Author: Paul Moore <paul@paul-moore.com>
  Date:   Mon Jun 12 09:35:24 2017 -0400

    audit: fix a race condition with the auditd tracking code
    
    Originally reported by Adam and Dusty, it appears we have a small
    race window in kauditd_thread(), as documented in the Fedora BZ:
    
     * https://bugzilla.redhat.com/show_bug.cgi?id=1459326#c35
    
     "This issue is partly due to the read-copy nature of RCU, and
      partly due to how we sync the auditd_connection state across
      kauditd_thread and the audit control channel.  The kauditd_thread
      thread is always running so it can service the record queues and
      emit the multicast messages, if it happens to be just past the
      "main_queue" label, but before the "if (sk == NULL || ...)"
      if-statement which calls auditd_reset() when the new auditd
      connection is registered it could end up resetting the auditd
      connection, regardless of if it is valid or not.  This is a rather
      small window and the variable nature of multi-core scheduling
      explains why this is proving rather difficult to reproduce."
    
    The fix is to have functions only call auditd_reset() when they
    believe that the kernel/auditd connection is still valid, e.g.
    non-NULL, and to have these callers pass their local copy of the
    auditd_connection pointer to auditd_reset() where it can be compared
    with the current connection state before resetting.  If the caller
    has a stale state tracking pointer then the reset is ignored.
    
    We also make a small change to kauditd_thread() so that if the
    kernel/auditd connection is dead we skip the retry queue and send the
    records straight to the hold queue.  This is necessary as we used to
    rely on auditd_reset() to occasionally purge the retry queue but we
    are going to be calling the reset function much less now and we want
    to make sure the retry queue doesn't grow unbounded.
    
    Reported-by: Adam Williamson <awilliam@redhat.com>
    Reported-by: Dusty Mabe <dustymabe@redhat.com>
    Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
    Signed-off-by: Paul Moore <paul@paul-moore.com>
Comment 42 Paul Moore 2017-06-16 11:58:53 EDT
A link to the commit from comment #41 in the audit/next repo in case we want to backport this to Fedora before it gets picked up in 4.13.

http://git.infradead.org/users/pcmoore/audit/commit/c81be52a3ac0267aa830a2c4cb769030ea3483c9
Comment 43 Paul Moore 2017-06-16 12:30:52 EDT
Notified the Fedora Kernel list about the bug and the fix:

https://lists.fedoraproject.org/archives/list/kernel@lists.fedoraproject.org/thread/FGSKQW5KEYTELI33ELRAD43RXDV62ECR

I'm going to go ahead and put this BZ in POST as I'm not sure there is much more for me to do with this BZ from a Fedora perspective.
Comment 44 Paul Moore 2017-06-16 13:05:05 EDT
From Laura Abbott:

> Given it was important enough to file a bugzilla, I went ahead and
> applied it to F24/F25/F26 since it went cleanly. It should show up
> in the next kernel version.

See link to mailing list thread in comment #43.

Marking this as CLOSED/NEXTRELEASE.
Comment 45 Fedora Update System 2017-06-19 19:06:16 EDT
kernel-4.11.6-300.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-1225995344
Comment 46 Fedora Update System 2017-06-19 19:08:05 EDT
kernel-4.11.6-200.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2017-b93e6de389
Comment 47 Fedora Update System 2017-06-19 19:10:23 EDT
kernel-4.11.6-100.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2017-79f099cbba
Comment 48 David H. Gutteridge 2017-06-19 23:00:19 EDT
(In reply to Dusty Mabe from comment #32)
> Never really paid any attention to this before, but I think I notice this
> same problem on my Fedora 25 desktop at home. I switched to a different tty
> (to get to a login prompt) and I notice audit messages there. auditd is
> running.
> 
> This might be an interesting data point.

FWIW, I reported this as https://bugzilla.redhat.com/show_bug.cgi?id=1449831, but didn't have success reproducing it on demand.
Comment 49 Adam Williamson 2017-06-20 13:03:23 EDT
Thanks a lot, Paul and Laura!
Comment 50 Fedora Update System 2017-06-21 00:28:19 EDT
kernel-4.11.6-200.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-b93e6de389
Comment 51 Fedora Update System 2017-06-21 02:22:49 EDT
kernel-4.11.6-300.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-1225995344
Comment 52 Fedora Update System 2017-06-21 10:01:05 EDT
kernel-4.11.6-301.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-d3ed702fe4
Comment 53 Fedora Update System 2017-06-21 10:03:38 EDT
kernel-4.11.6-201.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2017-d7bc1b3056
Comment 54 Fedora Update System 2017-06-21 10:05:34 EDT
kernel-4.11.6-101.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2017-05f10e29f4
Comment 55 Fedora Update System 2017-06-22 23:25:09 EDT
kernel-4.11.6-101.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-05f10e29f4
Comment 56 Fedora Update System 2017-06-22 23:27:18 EDT
kernel-4.11.6-201.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-d7bc1b3056
Comment 57 Fedora Update System 2017-06-23 02:23:32 EDT
kernel-4.11.6-301.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-d3ed702fe4
Comment 58 Fedora Update System 2017-06-23 15:51:08 EDT
kernel-4.11.6-101.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.
Comment 59 Fedora Update System 2017-06-23 16:52:31 EDT
kernel-4.11.6-201.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.
Comment 60 Fedora Update System 2017-06-23 23:06:21 EDT
kernel-4.11.6-301.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.
Comment 61 Dusty Mabe 2017-06-28 15:12:14 EDT
i think needinfo is no longer needed

Note You need to log in before you can comment on or make changes to this bug.