Hide Forgot
Description of problem: While testing Common Criteria requirements we observed a huge memory consumption of audit in a specific circumstances. First let me describe our testing scenario. We are simulating a disk error for a partition containing audit log, the machine under test is a target for remote audit logging. 1. Disk error is simulated by setting specific selinux type to audit log (games_data_t) so that selinux prevents access to auditd to the log file - this is recognized as a disk error by auditd. 2. A simple audit event is then sent via audit-remote binary to see that a syslog reports disk error message. On physical hardware we never experienced memory issues. With virtual hardware using either RHEV or PowerKVM with several CPU cores (8 in our case) we did not see any issues either. Recently we used a virtual machine with just a single core and after step (1) memory consumption of auditd starts to grow rapidly: %CPU RSS ======== ... 0.0 6272 0.0 6272 # Step 1 completed. 1.2 270784 2.9 647040 4.6 945216 6.2 1024384 7.8 1135232 9.1 1272000 10.5 1449344 11.9 1626112 13.3 1649920 14.6 1716672 15.8 1756416 # Step 2 started. 17.1 1756608 18.3 1782656 19.5 1921984 20.5 1987840 21.6 1997568 22.6 2079744 22.9 2233600 23.1 2233664 # Step 2 finished. 0.0 6208 # Context of audit log fixed and auditd restarted. 0.0 6336 ... Dmesg is full of these messages [74767.758663] audit: audit_backlog=8194 > audit_backlog_limit=8192 [74767.758664] audit: audit_lost=1985157 audit_rate_limit=0 audit_backlog_limit=8192 [74767.758664] audit: backlog limit exceeded [74767.758742] audit: audit_backlog=8195 > audit_backlog_limit=8192 [74773.117450] audit_log_start: 672200 callbacks suppressed [74773.117453] audit: audit_backlog=8194 > audit_backlog_limit=8192 [74773.117455] audit: audit_lost=2209225 audit_rate_limit=0 audit_backlog_limit=8192 [74773.117456] audit: backlog limit exceeded In step (2) memory continues to grow but audit-remote logging seems not to play any significant role (memory grows even without it). In some cases auditd consumes more than 5GB of memory and system becomes unusable eventually. We tried to experiment with various setting of virtual machine and (probably) the only thing that makes difference is the number of CPU cores. Increasing it slightly - to 2 does not help at all, increasing it significantly - to 8 completely removes the issue and memory consumption remains almost constant during the test cases (~5-6MB), only a fixed number of lost audit messages is reported by kernel with relatively low numbers. We also experienced scheduler issues during the memory growing. Version-Release number of selected component (if applicable): audit-2.4.1-5.el7 kernel-3.10.0-229.7.2.el7 glibc-2.17-78.el7 How reproducible: 100% on a virtual machine with a single CPU core Steps to Reproduce: 1. See above. Actual results: Unlimited memory consumption of auditd. Expected results: Limited and low memory consumption of auditd. Additional info: * Selinux policy (targeted or mls) makes no difference. * Updating kernel and glibc to the latest z-stream versions does not change anything. * Hypervisor (RHEV or PowerKVM) makes no difference. * We can provide you a testing machine with a reproducer.
Jiri, please free to add any technical details.
This bug is most likely two sided - both kernel and auditd contain a bug.
When this happens, it might be good to get an audit status. Also, does running "auditctl --backlog_wait_time 0" fix the problem?
Meaning that if you set the backlog_wait_time to 0 before running the tests, do you still see unlimited memory consumption?
Is backlog_wait_time supported on RHEL kernels? I forgot to mention that we see this on ppc64le architecture (I do not know yet if it is arch specific or not). # auditctl --backlog_wait_time 0 backlog_wait_time is not supported on your kernel
(In reply to Ondrej Moriš from comment #0) > We also experienced scheduler issues during the memory growing. We really haven't, the latencies were well below extreme - mostly at 10ms, going up to 50ms and occasionally around 100ms, even during memory pressure. My observation was simply (from watching "cat /proc/<auditd>/syscall" in a loop, to not cause too much overhead when tracing) that, when idle, auditd spends most of the time in epoll_wait(2) and when the apparent memory growth is happenning, it does mostly futex(2). The CPU (sys) time consumed also seems to grow with higher memory usage, indicating that auditd has simply "more work to do" / queued. The kernel side of the question would be - is the amount of events disproportional to what should be happening? IOW is the (huge?) amount of events expected? Sorry, not much I can add here.
I have this feeling that its related to bz1296189. In it, they mention the problem on RHEL6/7 and some other patches that start to address the issue. I think some of them are scheduled for 7.2 or 7.3.
In that case I will check if 7.2 is affected or not.
Unfortunately, with RHEL-7.2 kernel 3.10.0-327.el7 the problem is still reproducible.
Could you try testing against kernel-3.10.0-372.el7 or later? I think Bug #1253123 is related.
(In reply to Steve Grubb from comment #10) > Could you try testing against kernel-3.10.0-372.el7 or later? I think Bug > #1253123 is related. I tried the reproducer with 3.10.0-383.el7.ppc64le and the problem is still there, no difference between 327 and 383 unfortunately.
With audit-2.6.2-3 I cannot reproduce the bug. I see the following memory consumption during the test procedure from the description: RSS === ... 6976 6912 # Step 1 completed. 6912 6912 6912 # Step 2 started 6912 ... < 6912 all the time > 6912 6912 # Step 2 finished. 6720 6720 6976 6976 Tested with kernel-3.10.0-383, downgrading to audit-2.4.1-5 re-introduces the problem: RSS === ... 6208 6208 222272 279296 316608 348736 332544 335808 337792 323392 749504 778560 787904 824704 787840 820160 806016 788608 831616 827712 774912 794368 6208 6208 Therefore I will CLOSE this bug as CURRENTRELEASE.