From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; T312461) Description of problem: We recently applied the updated kernel and laus RPMs suggested as a response to our bug 121459. Now that the system is stable, we've begun to do some testing. When we run a regression test for some of our software with auditing turned off the test completes in approximately 4 minutes. When we run the same test with auditing enabled, the test takes well over an hour to complete. The system is spending a lot of time in iowait. The top command shows that the CPU state for auditd is "D" - uninterruptable sleep (usually I/O wait). Is this a known problem? We plan to update our systems with RHEL 3 Update 2 shortly after the official release of the product because our NASA security policies require the audit function, but the current performance impact of LAuS is not acceptable. My audit configuration was included in bug 121459. I will attach the same group of files to this bug report. The regression test that we run creates and deletes a large number of files. It is a real test of software, not intentionally designed to stress auditing. We can probably provide the regression test, or perhaps a similar test if needed. Version-Release number of selected component (if applicable): laus-0.1-54RHEL3, kernel-smp-2.4.21-14.EL How reproducible: Always Steps to Reproduce: We can provide a test if required. It currently runs under a specific user path. Additional info:
Created attachment 99772 [details] gzipped tar of audit configuration files This relevant files in this attachment are the audit configuration files (audit.conf, filter.conf, filesets.conf)
Created attachment 99774 [details] Sample testcase and output of test with and without audit As a simpler test than our original test case, I ran the script "testit" with auditing enabled and again with auditing disabled. The attachment shows the script, as well as the resulting run time with audit enabled, then disabled.
As an example for a ten fold increase you would issue the following command from the bash prompt - echo 10240 > /proc/sys/dev/audit/max-messages After you issue the command you can cat the contents, i.e cat /proc/sys/dev/audit/max-messages, to verify your change before you run your test(s).
*** Bug 123372 has been marked as a duplicate of this bug. ***
I've run the testit script in a number of scenarios. There is some improvement using larger kernel message buffers but that clearly won't scale on a busy, busy system. One thing that I've noticed is that auditd does an fsync() after every single audit record is written. This is very safe and conservative. It's also possible to turn this behavior off in /etc/audit/audit.conf. When I do so I get dramatically better real-time results on the test load: fsync = yes fsync=no audit disabled real: 3m12.907s 0m10.132s 0m09.420s user: 0m00.870s 0m01.810s 0m00.250s sys: 0m03.020s 0m06.080s 0m01.640s There is a risk of losing records left in the buffer cache on a system crash. Is this a risk you're willing to take? ccb
I've retested one of our regression tests with sync set to no. The performance is significantly better for our test case as well. If the audit daemon does not do an fsync after each record is written, roughly how many records might be left in the buffer cache if a system crashes? I'd like to better understand the risk before we decide whether to change the sync setting. Could we talk about the sync option and its impacts during the conference call tomorrow?
One thing it might help to review is the section on bdflush parameters in section 2.4 of /usr/src/linux-2.4/Documentation/filesystems/proc.txt. These control the execution of the bdflush and kupdated kernel threads, allowing you to specify how much of the buffer cache can be dirty before flushing buffers out, how old buffers have to be to be automatic candidates for flushing, etc. The long and short of it is that a system crash will likely cause the loss of audit records. When the sync parameter is ON, you can loose as many as max-messages records waiting for auditd to copy them to user space and push them back into the kernel for write. When the sync parameter is OFF they could either be in the kernel audit record buffer or in the filesystem buffer cache. Perhaps we'll get suitable performance by turning of sync and tuning the buffer cache so that the lossage is bounded by some predetermined upper limit.
Could you provide some documentation about the /etc/pam.d file changes required to enable audit? Also, have you received approval to modify the sync parameter to support intermittent syncs to disk (based on record count)?
Let me answer the second of these first. I have approval to modify the sync parameter *and* approval to set the default to "no". It is not necessary to have synchronous writes to the audit log to pass EAL3/CAPP certification. That being said, intermittent sync is in our EAL3 certification copy of laus. The Evaluated Configuration Guide states that synchronous writes are available at a substantial penalty to performance. I'm running with "sync-after = 20" and it's great. I've opened a bugzilla (123955) to cover the documentation inadaquacies.