From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041217 Description of problem: My server has been crippled after the past few weekends with the following symptoms. - New login attempts hang forever, although existing shells continue to funtion properly. - NFS serving still works OK. - Various cronjobs and CROND tasks that ran over the weekend are hung and unkillable even with kill -9 - strace of the hung processes hangs also, unable to be killed. - Running up2date hangs forever - From /var/log/messages: Aug 28 16:00:00 mtlsrvr1 audbin[12168]: saving binary audit log /var/log/audit.d/bin.3 Aug 28 16:00:00 mtlsrvr1 audbin[12168]: threshold 20.00 exceeded for filesystem /var/log/audit.d/. - free blocks down to 17.61% Aug 28 16:00:00 mtlsrvr1 auditd[6064]: Notify command /usr/sbin/audbin -S /var/log/audit.d/save.%u -C -T 20% exited with status 1 Aug 28 16:00:00 mtlsrvr1 auditd[6064]: output error Aug 28 16:00:00 mtlsrvr1 auditd[6064]: output error Aug 28 16:00:00 mtlsrvr1 auditd[6064]: output error; suspending execution Last week, I just rebooted the system to recover. This week I eventually tried "service audit stop", and everything unblocked, the cronjobs completed, and the system came back to normal again. auditd should not crowbar the system in this manner. Version-Release number of selected component (if applicable): laus-0.1-70RHEL3 How reproducible: Always Steps to Reproduce: Have your disk full beyond the configured threshold. Go home on friday and come back in on monday morning to find the system in a bad state. Actual Results: System has been messed up on mondays for the past 3 weeks. Expected Results: System should not be messed up. Additional info: # cat /etc/redhat-release Red Hat Enterprise Linux AS release 3 (Taroon Update 5) # uname -a Linux mtlsrvr1 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005 i686 i686 i386 GNU/Linux # rpm -qf /sbin/auditd laus-0.1-70RHEL3 audit.conf is the default version that comes with RHEL3 # ls -l /etc/audit/audit.conf -rw------- 1 root root 2375 Apr 6 22:24 /etc/audit/audit.conf
When auditd detects that saving a log file to be rotated would make the disk free space fall below a configurable threshold (the audbin '-T' option - see 'man audbin' - by default, 20%) it will enter "Suspend Mode" . Suspend Mode blocks the current audited operation and any subsequent audited operation until the disk free space is equal to or above the threshold (>= 20% by default); when sufficient disk free space exists, the rotated audit log is saved and all suspended operations are allowed to proceed . You can also specify an audbin '-N <cmd>' option, to specify that a command should be run on the oldest saved audit log when saving an audit log would cause the threshold to be exceeded; this could move the old saved audit log to a different partition, or process and remove the old saved audit log to free up space (see 'man audbin'). The default action for auditd to take when it detects that free space is exhausted is to enter suspend mode - this is configured with the audit.conf statement: '... error { action { type = suspend; }; ... ' - see 'man audit.conf' for details. Thus the default configuration for auditd is not to lose audit data if the disk usage thresholds are exceeded. By default, on a clean install of RHEL-3-U5+, the audit service is NOT enabled after installation of the laus package; however, if it was enabled pre-U5, it would remain enabled after U5 upgrade. You should only enable the audit system if you require audit data and have a mechanism in place to deal with the saved audit logs auditd creates, and have configured the audit system to suit your requirements. The audit system default configuration guarantees that, if enabled, no loss of audit data will occur, even when the disk usage threshold is exceeded; this is an essential requirement of the audit subsystem. So, if you do not require audit data, disable the audit system: # service audit stop; chkconfig --del audit If you do require audit data, then use the audbin '-N' and '-T' arguments to implement your preferred audit log rotation mechanism - eg., this setting in audit.conf : ' notify = "/usr/sbin/audbin -S /var/log/audit.d/save.%u -C -T 10% \ -N '/etc/my_audit_log_analyser %f'"; ' Would, when disk free space on the /var/log/audit.d filesystem falls below 10%, repeatedly run the "/etc/my_audit_log_analyser" program with an argument of the oldest saved audit log in /var/log/audit.d/ until the free space was equal or greater to 10% of the total disk space; the "my_audit_log_analyser" program would be expected to process and delete or move each old saved audit log from the saved audit log filesystem. If you do not mind loss of old audit data, then you could specify a notify command in audit.conf like: ' notify = "/usr/sbin/audbin -S /var/log/audit.d/save.%u -C -T 10% \ -N '/bin/rm -f %f'"; ' The default configuration of the audit system must be to avoid loss of audit data, by entering Suspend Mode when disk usage thresholds are exceeded for a saved audit log; audit is not enabled by default. There are many ways to configure the audit system to process and/or remove saved audit logs. You should not enable the audit system unless you require audit data and have configured it to implement a saved audit log rotation mechanism.
OK, I understand all that. But, the system gets into such a messed up state, that you can't even login to it!!! Is this the intent??? Surely, the goal of not losing the audit data can be achieved without hanging the system in this manner? It seems to me that even an audit data consumption task could hang at this point, thereby deadlocking the system. I consider this to be a bug. At the very least, it's a bug that was fixed in U5 when you turned it off by default. In any case, thanks for the info. I will disable auditd on my systems to avoid this problem. The system was installed at U3, so it came in enabled by default.
Yes, if audit finds it does not have sufficient disk space (as configured) to save an audit log file, the only reasonable action it can take, by default, is to enter suspend mode. Any other action would result in loss of audit data. People who care about auditing are quite happy that the system CANNOT BE USED AT ALL unless auditing is enabled and audit data is being saved successfully - this is what the audit system was designed to do. For those who do not care about auditing, I agree that this is a major pain - that is why audit is no longer enabled by default. We cannot ship an audit system that, by default, when enabled, has the potential to lose audit data; when enabled, suspend mode must be the default until users configure audit to implement a site specific audit log rotation mechanism.