Bug 2190166

Summary: Auditd.service "queue overflow" messages observed in compute nodes after node-provision (or) openstack-deployment
Product: Red Hat OpenStack Reporter: Vijayalakshmi Candappa <vcandapp>
Component: openstack-tripleo-commonAssignee: Julia Kreger <jkreger>
Status: CLOSED ERRATA QA Contact: James E. LaBarre <jlabarre>
Severity: low Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: hjensas, jkreger, jlabarre, mburns, nobody, rjarry, sbaker, slinaber
Target Milestone: gaKeywords: Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-15.4.1-1.20230505011956.00bc21d.el9ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-16 01:14:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vijayalakshmi Candappa 2023-04-27 11:15:53 UTC
Description of problem:
After node-provision (or) overcloud deployment, error message(s) "audit: kauditd hold queue overflow" are observed in dmesg of BM compute nodes. Also, the auditd.service status shows error:

[tripleo-admin@computehwoffload-r730 ~]$ sudo systemctl status auditd.service
● auditd.service - Security Auditing Service
     Loaded: loaded (/usr/lib/systemd/system/auditd.service; enabled; preset: enabled)
     Active: active (running) since Thu 2023-04-27 08:24:50 UTC; 2h 25min ago
       Docs: man:auditd(8)
             https://github.com/linux-audit/audit-documentation
   Main PID: 1351 (auditd)
      Tasks: 2 (limit: 838860)
     Memory: 8.2M
        CPU: 376ms
     CGroup: /system.slice/auditd.service
             └─1351 /sbin/auditd

Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: pid 1351
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: rate_limit 0
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog_limit 8192
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: lost 135
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog 0
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog_wait_time 60000
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog_wait_time_actual 0
Apr 27 08:24:50 computehwoffload-r730 systemd[1]: Started Security Auditing Service.
Apr 27 08:28:25 computehwoffload-r730 auditd[1351]: Error receiving audit netlink packet (No buffer space available)
Apr 27 10:35:20 computehwoffload-r730 auditd[1351]: Audit daemon rotating log files
[tripleo-admin@computehwoffload-r730 ~]$ 
[tripleo-admin@computehwoffload-r730 ~]$ sudo systemctl restart auditd.service
Failed to restart auditd.service: Operation refused, unit auditd.service may be requested by dependency only (it is configured to refuse manual start/stop).
See system logs and 'systemctl status auditd.service' for details.


Version-Release number of selected component (if applicable):
RHEL: Red Hat Enterprise Linux release 9.2 Beta (Plow)
Puddle: RHOS-17.1-RHEL-9-20230419.n.1

How reproducible:
Always

Steps to Reproduce:
1. "openstack overcloud node provision ...."
2. Login to compue node and check dmesg logs
3. 

Actual results:
Observed message: "audit: kauditd hold queue overflow"

Expected results:
Not sure if this has any impact, but not able to restart the service auditd.service. Also, not sure if we can increase the buffer size of audit_backlog_limit

Additional info:
This behaviour didn't seem to be observed in OSP17.0

Comment 3 Harald Jensås 2023-05-02 09:42:00 UTC
https://access.redhat.com/solutions/5736961


Looks like 8192 is the default that would be set by auditd after the service starts - it is substantially more than the default 64 ...
We could increase by adding audit_backlog_limit=8192 on the kernel command line as well.

File: /etc/audit/rules.d/audit.rules - contains:

## Increase the buffers to survive stress events.
## Make this bigger for busy systems
-b 8192


If we don't set audit=1 some processes will not be auditable without a reboot.

Comment 4 Harald Jensås 2023-05-02 10:23:16 UTC
Hi Vijayalakshmi,

Do you have this issue reproduced in some lab?

Can you try:
1. to remove the audit=1 on a reboot.
   Does it fix the issue with auditd.service not starting?
2. Keep audit=1 but add audit_backlog_limit=8192.
   Does it fix the issue with auditd.service not starting?


Can you grab the file: /etc/audit/auditd.conf and add the content to this BZ?
Also what is the space left on the partition where logs are written?

I wonder if we should do a SIGHUP to the auditd daemon after growvols? (AFICT we don't use a percentage, so we should not need that)

       space_left
              If  the  free space in the filesystem containing log_file drops below this value,
              the audit daemon takes the action specified by space_left_action.  If the value of
              space_left is specified as a whole number, it is interpreted as an absolute size
              in megabytes (MiB).  If the value is specified as a number between 1 and 99 followed
              by a percentage sign (e.g., 5%), the audit daemon calculates the absolute size in
              megabytes  based  on the size of the filesystem containing log_file.  (E.g., if the
              filesystem containing log_file is 2 gigabytes in size, and space_left is set to 25%,
              then the audit daemon sets space_left to approximately 500 megabytes.
              Note that this calculation is performed when the audit daemon starts, so if you resize
              the filesystem containing log_file while the audit daemon is running, you should send
              the audit daemon SIGHUP to re-read the configuration file and recalculate the correct
              percentage.

Comment 16 James E. LaBarre 2023-07-14 19:19:15 UTC
Sorry, modified the wrong field (the tabs opened in reverse order from my checklist...).  Couldn't see what I would be able to validate on it.  I presume I'm setting this correctly.

Comment 17 James E. LaBarre 2023-07-18 15:29:30 UTC
Verify error/audit messages no longer show up in dmesg.

From Undercloud, run dmesg for each compute and check output.

==========================================

(undercloud) [stack@undercloud-0 ~]$ for z in 0 1 2 3 4 5
> do
> echo "Host: compute-${z}"
> ssh tripleo-admin@compute-${z}.ctlplane "sudo dmesg | grep -E 'kauditd|overflow|Hostname'"
> done
Host: compute-0
Warning: Permanently added 'compute-0.ctlplane' (ED25519) to the list of known hosts.
Host: compute-1
Warning: Permanently added 'compute-1.ctlplane' (ED25519) to the list of known hosts.
Host: compute-2
Warning: Permanently added 'compute-2.ctlplane' (ED25519) to the list of known hosts.
Host: compute-3
Warning: Permanently added 'compute-3.ctlplane' (ED25519) to the list of known hosts.
Host: compute-4
Warning: Permanently added 'compute-4.ctlplane' (ED25519) to the list of known hosts.
Host: compute-5
Warning: Permanently added 'compute-5.ctlplane' (ED25519) to the list of known hosts.
(undercloud) [stack@undercloud-0 ~]$

Comment 23 errata-xmlrpc 2023-08-16 01:14:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577