Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1324833

Summary:	unlimited memory consumption in audit with small number of cpu cores
Product:	Red Hat Enterprise Linux 7	Reporter:	Ondrej Moriš <omoris>
Component:	audit	Assignee:	Steve Grubb <sgrubb>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Ondrej Moriš <omoris>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.2	CC:	jjaburek, mvadkert, omoris, pmoore
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-07-11 15:52:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ondrej Moriš 2016-04-07 11:57:22 UTC

Description of problem:

While testing Common Criteria requirements we observed a huge memory consumption of audit in a specific circumstances. First let me describe our testing scenario. 

We are simulating a disk error for a partition containing audit log, the machine under test is a target for remote audit logging. 

1. Disk error is simulated by setting specific selinux type to audit log (games_data_t) so that selinux prevents access to auditd to the log file - this is recognized as a disk error by auditd. 

2. A simple audit event is then sent via audit-remote binary to see that a syslog reports disk error message.

On physical hardware we never experienced memory issues. With virtual hardware using either RHEV or PowerKVM with several CPU cores (8 in our case) we did not see any issues either. Recently we used a virtual machine with just a single core and after step (1) memory consumption of auditd starts to grow rapidly:

%CPU RSS
========
...
0.0  6272
0.0  6272     # Step 1 completed.
1.2  270784   
2.9  647040
4.6  945216
6.2  1024384
7.8  1135232
9.1  1272000
10.5 1449344
11.9 1626112
13.3 1649920
14.6 1716672
15.8 1756416  # Step 2 started.
17.1 1756608
18.3 1782656
19.5 1921984
20.5 1987840
21.6 1997568
22.6 2079744
22.9 2233600
23.1 2233664  # Step 2 finished.
0.0  6208     # Context of audit log fixed and auditd restarted.
0.0  6336
...

Dmesg is full of these messages

[74767.758663] audit: audit_backlog=8194 > audit_backlog_limit=8192
[74767.758664] audit: audit_lost=1985157 audit_rate_limit=0 audit_backlog_limit=8192
[74767.758664] audit: backlog limit exceeded
[74767.758742] audit: audit_backlog=8195 > audit_backlog_limit=8192
[74773.117450] audit_log_start: 672200 callbacks suppressed
[74773.117453] audit: audit_backlog=8194 > audit_backlog_limit=8192
[74773.117455] audit: audit_lost=2209225 audit_rate_limit=0 audit_backlog_limit=8192
[74773.117456] audit: backlog limit exceeded

In step (2) memory continues to grow but audit-remote logging seems not to play any significant role (memory grows even without it). In some cases auditd consumes more than 5GB of memory and system becomes unusable eventually.

We tried to experiment with various setting of virtual machine and (probably) the only thing that makes difference is the number of CPU cores. Increasing it slightly - to 2 does not help at all, increasing it significantly - to 8 completely removes the issue and memory consumption remains almost constant during the test cases (~5-6MB), only a fixed number of lost audit messages is reported by kernel with relatively low numbers.

We also experienced scheduler issues during the memory growing.

Version-Release number of selected component (if applicable):

audit-2.4.1-5.el7
kernel-3.10.0-229.7.2.el7
glibc-2.17-78.el7

How reproducible:

100% on a virtual machine with a single CPU core 

Steps to Reproduce:

1. See above.

Actual results:

Unlimited memory consumption of auditd.

Expected results:

Limited and low memory consumption of auditd.

Additional info:

* Selinux policy (targeted or mls) makes no difference.

* Updating kernel and glibc to the latest z-stream versions does not change anything.

* Hypervisor (RHEV or PowerKVM) makes no difference.

* We can provide you a testing machine with a reproducer.

Comment 1 Ondrej Moriš 2016-04-07 11:58:03 UTC

Jiri, please free to add any technical details.

Comment 2 Ondrej Moriš 2016-04-07 11:58:47 UTC

This bug is most likely two sided - both kernel and auditd contain a bug.

Comment 3 Steve Grubb 2016-04-07 12:07:27 UTC

When this happens, it might be good to get an audit status. Also, does running "auditctl --backlog_wait_time 0" fix the problem?

Comment 4 Steve Grubb 2016-04-07 12:18:00 UTC

Meaning that if you set the backlog_wait_time to 0 before running the tests, do you still see unlimited memory consumption?

Comment 5 Ondrej Moriš 2016-04-07 12:28:35 UTC

Is backlog_wait_time supported on RHEL kernels? I forgot to mention that we see this on ppc64le architecture (I do not know yet if it is arch specific or not). 

# auditctl --backlog_wait_time 0
backlog_wait_time is not supported on your kernel

Comment 6 Jiri Jaburek 2016-04-07 12:32:46 UTC

(In reply to Ondrej Moriš from comment #0)
> We also experienced scheduler issues during the memory growing.

We really haven't, the latencies were well below extreme - mostly at 10ms, going up to 50ms and occasionally around 100ms, even during memory pressure.

My observation was simply (from watching "cat /proc/<auditd>/syscall" in a loop, to not cause too much overhead when tracing) that, when idle, auditd spends most of the time in epoll_wait(2) and when the apparent memory growth is happenning, it does mostly futex(2). The CPU (sys) time consumed also seems to grow with higher memory usage, indicating that auditd has simply "more work to do" / queued.

The kernel side of the question would be - is the amount of events disproportional to what should be happening? IOW is the (huge?) amount of events expected?

Sorry, not much I can add here.

Comment 7 Steve Grubb 2016-04-07 12:35:12 UTC

I have this feeling that its related to bz1296189. In it, they mention the problem on RHEL6/7 and some other patches that start to address the issue. I think some of them are scheduled for 7.2 or 7.3.

Comment 8 Ondrej Moriš 2016-04-07 12:50:17 UTC

In that case I will check if 7.2 is affected or not.

Comment 9 Ondrej Moriš 2016-04-07 13:53:19 UTC

Unfortunately, with RHEL-7.2 kernel 3.10.0-327.el7 the problem is still reproducible.

Comment 10 Steve Grubb 2016-04-20 13:29:06 UTC

Could you try testing against kernel-3.10.0-372.el7 or later?  I think Bug #1253123 is related.

Comment 11 Ondrej Moriš 2016-04-28 11:11:25 UTC

(In reply to Steve Grubb from comment #10)
> Could you try testing against kernel-3.10.0-372.el7 or later?  I think Bug
> #1253123 is related.

I tried the reproducer with 3.10.0-383.el7.ppc64le and the problem is still there, no difference between 327 and 383 unfortunately.

Comment 14 Ondrej Moriš 2016-07-11 15:52:09 UTC

With audit-2.6.2-3 I cannot reproduce the bug. I see the following memory consumption during the test procedure from the description:

RSS
===
...
6976
6912 # Step 1 completed.
6912  
6912
6912 # Step 2 started
6912
...  < 6912 all the time >
6912 
6912 # Step 2 finished.
6720
6720
6976
6976

Tested with kernel-3.10.0-383, downgrading to audit-2.4.1-5 re-introduces the problem:

RSS
===
...
6208
6208
222272
279296
316608
348736
332544
335808
337792
323392
749504
778560
787904
824704
787840
820160
806016
788608
831616
827712
774912
794368
6208
6208

Therefore I will CLOSE this bug as CURRENTRELEASE.