Bug 1324833 - unlimited memory consumption in audit with small number of cpu cores
Summary: unlimited memory consumption in audit with small number of cpu cores
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: audit
Version: 7.2
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Steve Grubb
QA Contact: Ondrej Moriš
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-07 11:57 UTC by Ondrej Moriš
Modified: 2020-04-22 08:09 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-11 15:52:09 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Ondrej Moriš 2016-04-07 11:57:22 UTC
Description of problem:

While testing Common Criteria requirements we observed a huge memory consumption of audit in a specific circumstances. First let me describe our testing scenario. 

We are simulating a disk error for a partition containing audit log, the machine under test is a target for remote audit logging. 

1. Disk error is simulated by setting specific selinux type to audit log (games_data_t) so that selinux prevents access to auditd to the log file - this is recognized as a disk error by auditd. 

2. A simple audit event is then sent via audit-remote binary to see that a syslog reports disk error message.

On physical hardware we never experienced memory issues. With virtual hardware using either RHEV or PowerKVM with several CPU cores (8 in our case) we did not see any issues either. Recently we used a virtual machine with just a single core and after step (1) memory consumption of auditd starts to grow rapidly:

%CPU RSS
========
...
0.0  6272
0.0  6272     # Step 1 completed.
1.2  270784   
2.9  647040
4.6  945216
6.2  1024384
7.8  1135232
9.1  1272000
10.5 1449344
11.9 1626112
13.3 1649920
14.6 1716672
15.8 1756416  # Step 2 started.
17.1 1756608
18.3 1782656
19.5 1921984
20.5 1987840
21.6 1997568
22.6 2079744
22.9 2233600
23.1 2233664  # Step 2 finished.
0.0  6208     # Context of audit log fixed and auditd restarted.
0.0  6336
...

Dmesg is full of these messages

[74767.758663] audit: audit_backlog=8194 > audit_backlog_limit=8192
[74767.758664] audit: audit_lost=1985157 audit_rate_limit=0 audit_backlog_limit=8192
[74767.758664] audit: backlog limit exceeded
[74767.758742] audit: audit_backlog=8195 > audit_backlog_limit=8192
[74773.117450] audit_log_start: 672200 callbacks suppressed
[74773.117453] audit: audit_backlog=8194 > audit_backlog_limit=8192
[74773.117455] audit: audit_lost=2209225 audit_rate_limit=0 audit_backlog_limit=8192
[74773.117456] audit: backlog limit exceeded

In step (2) memory continues to grow but audit-remote logging seems not to play any significant role (memory grows even without it). In some cases auditd consumes more than 5GB of memory and system becomes unusable eventually.

We tried to experiment with various setting of virtual machine and (probably) the only thing that makes difference is the number of CPU cores. Increasing it slightly - to 2 does not help at all, increasing it significantly - to 8 completely removes the issue and memory consumption remains almost constant during the test cases (~5-6MB), only a fixed number of lost audit messages is reported by kernel with relatively low numbers.

We also experienced scheduler issues during the memory growing.

Version-Release number of selected component (if applicable):

audit-2.4.1-5.el7
kernel-3.10.0-229.7.2.el7
glibc-2.17-78.el7

How reproducible:

100% on a virtual machine with a single CPU core 

Steps to Reproduce:

1. See above.

Actual results:

Unlimited memory consumption of auditd.

Expected results:

Limited and low memory consumption of auditd.

Additional info:

* Selinux policy (targeted or mls) makes no difference.

* Updating kernel and glibc to the latest z-stream versions does not change anything.

* Hypervisor (RHEV or PowerKVM) makes no difference.

* We can provide you a testing machine with a reproducer.

Comment 1 Ondrej Moriš 2016-04-07 11:58:03 UTC
Jiri, please free to add any technical details.

Comment 2 Ondrej Moriš 2016-04-07 11:58:47 UTC
This bug is most likely two sided - both kernel and auditd contain a bug.

Comment 3 Steve Grubb 2016-04-07 12:07:27 UTC
When this happens, it might be good to get an audit status. Also, does running "auditctl --backlog_wait_time 0" fix the problem?

Comment 4 Steve Grubb 2016-04-07 12:18:00 UTC
Meaning that if you set the backlog_wait_time to 0 before running the tests, do you still see unlimited memory consumption?

Comment 5 Ondrej Moriš 2016-04-07 12:28:35 UTC
Is backlog_wait_time supported on RHEL kernels? I forgot to mention that we see this on ppc64le architecture (I do not know yet if it is arch specific or not). 

# auditctl --backlog_wait_time 0
backlog_wait_time is not supported on your kernel

Comment 6 Jiri Jaburek 2016-04-07 12:32:46 UTC
(In reply to Ondrej Moriš from comment #0)
> We also experienced scheduler issues during the memory growing.

We really haven't, the latencies were well below extreme - mostly at 10ms, going up to 50ms and occasionally around 100ms, even during memory pressure.

My observation was simply (from watching "cat /proc/<auditd>/syscall" in a loop, to not cause too much overhead when tracing) that, when idle, auditd spends most of the time in epoll_wait(2) and when the apparent memory growth is happenning, it does mostly futex(2). The CPU (sys) time consumed also seems to grow with higher memory usage, indicating that auditd has simply "more work to do" / queued.

The kernel side of the question would be - is the amount of events disproportional to what should be happening? IOW is the (huge?) amount of events expected?

Sorry, not much I can add here.

Comment 7 Steve Grubb 2016-04-07 12:35:12 UTC
I have this feeling that its related to bz1296189. In it, they mention the problem on RHEL6/7 and some other patches that start to address the issue. I think some of them are scheduled for 7.2 or 7.3.

Comment 8 Ondrej Moriš 2016-04-07 12:50:17 UTC
In that case I will check if 7.2 is affected or not.

Comment 9 Ondrej Moriš 2016-04-07 13:53:19 UTC
Unfortunately, with RHEL-7.2 kernel 3.10.0-327.el7 the problem is still reproducible.

Comment 10 Steve Grubb 2016-04-20 13:29:06 UTC
Could you try testing against kernel-3.10.0-372.el7 or later?  I think Bug #1253123 is related.

Comment 11 Ondrej Moriš 2016-04-28 11:11:25 UTC
(In reply to Steve Grubb from comment #10)
> Could you try testing against kernel-3.10.0-372.el7 or later?  I think Bug
> #1253123 is related.

I tried the reproducer with 3.10.0-383.el7.ppc64le and the problem is still there, no difference between 327 and 383 unfortunately.

Comment 14 Ondrej Moriš 2016-07-11 15:52:09 UTC
With audit-2.6.2-3 I cannot reproduce the bug. I see the following memory consumption during the test procedure from the description:

RSS
===
...
6976
6912 # Step 1 completed.
6912  
6912
6912 # Step 2 started
6912
...  < 6912 all the time >
6912 
6912 # Step 2 finished.
6720
6720
6976
6976

Tested with kernel-3.10.0-383, downgrading to audit-2.4.1-5 re-introduces the problem:

RSS
===
...
6208
6208
222272
279296
316608
348736
332544
335808
337792
323392
749504
778560
787904
824704
787840
820160
806016
788608
831616
827712
774912
794368
6208
6208

Therefore I will CLOSE this bug as CURRENTRELEASE.


Note You need to log in before you can comment on or make changes to this bug.