Bug 1345854

Summary:	non-deterministic calls of check_space_left()
Product:	Red Hat Enterprise Linux 7	Reporter:	Ondrej Moriš <omoris>
Component:	audit	Assignee:	Steve Grubb <sgrubb>
Status:	CLOSED ERRATA	QA Contact:	Ondrej Moriš <omoris>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.3	Keywords:	Regression
Target Milestone:	rc
Target Release:	7.3
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-04 06:13:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1256920

Description Ondrej Moriš 2016-06-13 10:35:11 UTC

Description of problem:

With audit-2.5 we observed that space_left and admin_space_left actions are _sometimes_ performed only after auditd restart and not immediately when system is (nearly) running low of space. Behaviour was oddly non-deterministic, sometimes it worked just fine for hours and sometimes it was broken for hours. It was a regression from 2.4.

Version-Release number of selected component (if applicable):

audit-2.5.2-1.el7

How reproducible:

Sometimes. Depends on size of audit events.

Steps to Reproduce:

1. Configure admin_space to X and admin_space_action (syslog).

2. Fill partition containing /var/log/audit by audit events so that it has less than X free space.

3. Check /var/log/messages to see that action was triggered.

Actual results:

Action is sometimes not triggered.

Expected results:

Action is always triggered. 

Additional info:

We tracked the issue to auditd-event.c write_to_log() function, its relevant part calling check_space_left() changed between 2.5 and 2.4 as follows:

(2.5)

397	                if (config->daemonize == D_BACKGROUND) {
...
403	                        log_size += rc;
404	                        check_log_file_size();
405	                        // Keep loose tabs on the free space
406	                        if (rc%2 == 0)
407	                                check_space_left();
408	                }

(2.4)

427	                /* check log file size & space left on partition */
428	                if (config->daemonize == D_BACKGROUND) {
...
434	                        log_size += rc;
435	                        check_log_file_size(data);
436	                        check_space_left(data->log_fd, data);
437	                }

Where rc = fprintf(log_file, "%s\n", buf). I am not sure if it is the safe approach to condition check_space_left() calls based either on rc or log_size. It might happen that rc/log_size is odd number during a long sequence of write_to_log() calls. In that case a system might be already well below (admin_)space limit before (admin_)space_action is triggered.

Comment 2 Steve Grubb 2016-06-13 14:03:10 UTC

Is (rc%3 >  2) more deterministic? The idea here is to try to avoid calling fstatfs all the time.

Comment 3 Ondrej Moriš 2016-06-15 11:57:11 UTC

(In reply to Steve Grubb from comment #2)
> Is (rc%3 >  2) more deterministic? The idea here is to try to avoid calling
> fstatfs all the time.

Well, conditioning check_space_left() calls by event length might go wrong when, for instance, a bunch of similar events are generated (with the same length). I understand your point of saving fstatfs calls. The best would be to condition by serial number of an event to make 100% sure that check_space_left() is called at least once during a sequence of events.

Comment 4 Steve Grubb 2016-06-15 16:07:19 UTC

This is the proposed fix:
https://fedorahosted.org/audit/changeset/1266

Comment 5 Ondrej Moriš 2016-06-29 08:35:56 UTC

(In reply to Steve Grubb from comment #4)
> This is the proposed fix:
> https://fedorahosted.org/audit/changeset/1266

Thanks Steve, it looks better now. There is still a very rare case when the function might not be called but a probability it might be hit in the production environment is very close to zero (FTR: I noticed that a typo in the patch was already corrected upstream).

Comment 8 Ondrej Moriš 2016-08-10 10:10:37 UTC

Verified by Common Criteria audit-test on all architectures except ppc64 where results are currently not available, results from x86_64 are as follow:

NEW (audit-2.6.5-2.el7)
=======================
Bucket:                          fail-safe
Started:                         Thu Aug  4 18:19:57 CEST 2016
Kernel:                          3.10.0-483.el7.x86_64
Architecture:                    x86_64
Mode:                            64
Hostname:                        cc-v0b.lab.eng.brq.redhat.com
Profile:                         capp
selinux-policy version:          selinux-policy-3.13.1-92.el7.noarch

SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Max kernel policy version:      28

[0] pam_loginuid success PASS
[1] pam_loginuid fail PASS
[2] max_log_file keep_logs PASS
[3] max_log_file rotate PASS
[4] max_log_file suspend PASS
[5] max_log_file syslog PASS
[6] admin_space_left halt PASS
[7] admin_space_left single PASS
[8] admin_space_left suspend PASS
[9] admin_space_left syslog PASS
[10] space_left halt PASS
[11] space_left single PASS
[12] space_left suspend PASS
[13] space_left syslog PASS
[14] disk_full_action halt PASS
[15] disk_full_action single PASS
[16] disk_full_action suspend PASS
[17] disk_full_action syslog PASS
[18] admin_space_left email PASS
[19] space_left email PASS

  20 pass (100%)
   0 fail (0%)
   0 error (0%)
------------------
  20 total (in 26s)

Comment 10 errata-xmlrpc 2016-11-04 06:13:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2418.html