Bug 167035 - auditd cripples system when disk space threshold exceeded
Summary: auditd cripples system when disk space threshold exceeded
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: laus
Version: 3.0
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: Jason Vas Dias
QA Contact: Jay Turner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-08-29 18:17 UTC by Stephen Malowany
Modified: 2015-01-08 00:10 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-08-29 19:17:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Stephen Malowany 2005-08-29 18:17:38 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041217

Description of problem:
My server has been crippled after the past few weekends with the
following symptoms.

- New login attempts hang forever, although existing shells continue
  to funtion properly.
- NFS serving still works OK.
- Various cronjobs and CROND tasks that ran over the weekend are hung
  and unkillable even with kill -9
- strace of the hung processes hangs also, unable to be killed.
- Running up2date hangs forever
- From /var/log/messages:

Aug 28 16:00:00 mtlsrvr1 audbin[12168]: saving binary audit log /var/log/audit.d/bin.3
Aug 28 16:00:00 mtlsrvr1 audbin[12168]: threshold 20.00 exceeded for filesystem /var/log/audit.d/. - free blocks down to 17.61%
Aug 28 16:00:00 mtlsrvr1 auditd[6064]: Notify command /usr/sbin/audbin -S /var/log/audit.d/save.%u -C -T 20% exited with status 1
Aug 28 16:00:00 mtlsrvr1 auditd[6064]: output error
Aug 28 16:00:00 mtlsrvr1 auditd[6064]: output error
Aug 28 16:00:00 mtlsrvr1 auditd[6064]: output error; suspending execution

Last week, I just rebooted the system to recover. This week I eventually
tried "service audit stop", and everything unblocked, the cronjobs completed,
and the system came back to normal again. auditd should not crowbar the system
in this manner.



Version-Release number of selected component (if applicable):
laus-0.1-70RHEL3

How reproducible:
Always

Steps to Reproduce:
Have your disk full beyond the configured threshold.
Go home on friday and come back in on monday morning to find
the system in a bad state.


Actual Results:  System has been messed up on mondays for the past 3 weeks.


Expected Results:  System should not be messed up.


Additional info:

# cat /etc/redhat-release 
Red Hat Enterprise Linux AS release 3 (Taroon Update 5)

# uname -a
Linux mtlsrvr1 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005 i686 i686 i386 GNU/Linux

# rpm -qf /sbin/auditd
laus-0.1-70RHEL3

audit.conf is the default version that comes with RHEL3
# ls -l /etc/audit/audit.conf
-rw-------    1 root     root         2375 Apr  6 22:24 /etc/audit/audit.conf

Comment 1 Jason Vas Dias 2005-08-29 19:17:29 UTC
When auditd detects that saving a log file to be rotated would make the disk
free space fall below a configurable threshold (the audbin '-T' option -
see 'man audbin' - by default, 20%) it will enter "Suspend Mode" .

Suspend Mode blocks the current audited operation and any subsequent audited
operation until the disk free space is equal to or above the threshold (>= 20%
by default); when sufficient disk free space exists, the rotated audit log is
saved and all suspended operations are allowed to proceed .

You can also specify an audbin '-N <cmd>' option, to specify that a command
should be run on the oldest saved audit log when saving an audit log would 
cause the threshold to be exceeded; this could move the old saved audit log
to a different partition, or process and remove the old saved audit log to
free up space (see 'man audbin').

The default action for auditd to take when it detects that free space is
exhausted is to enter suspend mode - this is configured with the audit.conf
statement: '... error { action { type = suspend; }; ... ' - see 'man audit.conf'
for details.

Thus the default configuration for auditd is not to lose audit data if 
the disk usage thresholds are exceeded.

By default, on a clean install of RHEL-3-U5+, the audit service is NOT enabled
after installation of the laus package; however, if it was enabled pre-U5, it 
would remain enabled after U5 upgrade.

You should only enable the audit system if you require audit data and have 
a mechanism in place to deal with the saved audit logs auditd creates, and
have configured the audit system to suit your requirements.

The audit system default configuration guarantees that, if enabled, no loss
of audit data will occur, even when the disk usage threshold is exceeded; 
this is an essential requirement of the audit subsystem.
 
So, if you do not require audit data, disable the audit system:
  # service audit stop; chkconfig --del audit

If you do require audit data, then use the audbin '-N' and '-T' arguments
to implement your preferred audit log rotation mechanism - eg., this setting
in audit.conf : 
'
  notify = "/usr/sbin/audbin -S /var/log/audit.d/save.%u -C -T 10% \
           -N '/etc/my_audit_log_analyser %f'";
'
Would, when disk free space on the /var/log/audit.d filesystem falls below 10%, 
repeatedly run the "/etc/my_audit_log_analyser" program with an argument
of the oldest saved audit log in /var/log/audit.d/ until the free space 
was equal or greater to 10% of the total disk space; the "my_audit_log_analyser"
program would be expected to process and delete or move each old saved audit log
from the saved audit log filesystem.

If you do not mind loss of old audit data, then you could specify a notify 
command in audit.conf like:
'  
notify = "/usr/sbin/audbin -S /var/log/audit.d/save.%u -C -T 10% \
           -N '/bin/rm -f %f'";
'

The default configuration of the audit system must be to avoid loss of audit
data, by entering Suspend Mode when disk usage thresholds are exceeded for 
a saved audit log; audit is not enabled by default.

There are many ways to configure the audit system to process and/or remove
saved audit logs.

You should not enable the audit system unless you require audit data and have
configured it to implement a saved audit log rotation mechanism.

Comment 2 Stephen Malowany 2005-08-29 19:42:55 UTC
OK, I understand all that. But, the system gets into such
a messed up state, that you can't even login to it!!!
Is this the intent??? Surely, the goal of not losing the audit
data can be achieved without hanging the system in this manner?
It seems to me that even an audit data consumption task
could hang at this point, thereby deadlocking the system.
I consider this to be a bug. At the very least, it's a bug
that was fixed in U5 when you turned it off by default.

In any case, thanks for the info. I will disable auditd
on my systems to avoid this problem. The system was installed
at U3, so it came in enabled by default.


Comment 3 Jason Vas Dias 2005-08-29 19:58:53 UTC
Yes, if audit finds it does not have sufficient disk space (as configured) to
save an audit log file, the only reasonable action it can take, by default, is
to enter suspend mode. Any other action would result in loss of audit data.
People who care about auditing are quite happy that the system CANNOT BE USED
AT ALL unless auditing is enabled and audit data is being saved successfully -
this is what the audit system was designed to do. For those who do not care
about auditing, I agree that this is a major pain - that is why audit is no
longer enabled by default. We cannot ship an audit system that, by default, when
enabled, has the potential to lose audit data; when enabled, suspend mode must
be the default until users configure audit to implement a site specific audit
log rotation mechanism.


Note You need to log in before you can comment on or make changes to this bug.