2190166 – Auditd.service "queue overflow" messages observed in compute nodes after node-provision (or) openstack-deployment

Bug 2190166 - Auditd.service "queue overflow" messages observed in compute nodes after node-provision (or) openstack-deployment

Summary: Auditd.service "queue overflow" messages observed in compute nodes after node...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	17.1 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	ga
Target Release:	17.1
Assignee:	Julia Kreger
QA Contact:	James E. LaBarre
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-27 11:15 UTC by Vijayalakshmi Candappa
Modified:	2023-08-16 01:15 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-tripleo-common-15.4.1-1.20230505011956.00bc21d.el9ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-08-16 01:14:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	881737	None	MERGED	Remove audit=1 as it can overflow before the service starts	2023-05-19 05:52:57 UTC
Red Hat Issue Tracker	OSP-24616	None	None	None	2023-04-27 11:17:32 UTC
Red Hat Product Errata	RHEA-2023:4577	None	None	None	2023-08-16 01:15:18 UTC

Description Vijayalakshmi Candappa 2023-04-27 11:15:53 UTC

Description of problem:
After node-provision (or) overcloud deployment, error message(s) "audit: kauditd hold queue overflow" are observed in dmesg of BM compute nodes. Also, the auditd.service status shows error:

[tripleo-admin@computehwoffload-r730 ~]$ sudo systemctl status auditd.service
● auditd.service - Security Auditing Service
     Loaded: loaded (/usr/lib/systemd/system/auditd.service; enabled; preset: enabled)
     Active: active (running) since Thu 2023-04-27 08:24:50 UTC; 2h 25min ago
       Docs: man:auditd(8)
             https://github.com/linux-audit/audit-documentation
   Main PID: 1351 (auditd)
      Tasks: 2 (limit: 838860)
     Memory: 8.2M
        CPU: 376ms
     CGroup: /system.slice/auditd.service
             └─1351 /sbin/auditd

Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: pid 1351
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: rate_limit 0
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog_limit 8192
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: lost 135
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog 0
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog_wait_time 60000
Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog_wait_time_actual 0
Apr 27 08:24:50 computehwoffload-r730 systemd[1]: Started Security Auditing Service.
Apr 27 08:28:25 computehwoffload-r730 auditd[1351]: Error receiving audit netlink packet (No buffer space available)
Apr 27 10:35:20 computehwoffload-r730 auditd[1351]: Audit daemon rotating log files
[tripleo-admin@computehwoffload-r730 ~]$ 
[tripleo-admin@computehwoffload-r730 ~]$ sudo systemctl restart auditd.service
Failed to restart auditd.service: Operation refused, unit auditd.service may be requested by dependency only (it is configured to refuse manual start/stop).
See system logs and 'systemctl status auditd.service' for details.


Version-Release number of selected component (if applicable):
RHEL: Red Hat Enterprise Linux release 9.2 Beta (Plow)
Puddle: RHOS-17.1-RHEL-9-20230419.n.1

How reproducible:
Always

Steps to Reproduce:
1. "openstack overcloud node provision ...."
2. Login to compue node and check dmesg logs
3. 

Actual results:
Observed message: "audit: kauditd hold queue overflow"

Expected results:
Not sure if this has any impact, but not able to restart the service auditd.service. Also, not sure if we can increase the buffer size of audit_backlog_limit

Additional info:
This behaviour didn't seem to be observed in OSP17.0

Comment 3 Harald Jensås 2023-05-02 09:42:00 UTC

https://access.redhat.com/solutions/5736961


Looks like 8192 is the default that would be set by auditd after the service starts - it is substantially more than the default 64 ...
We could increase by adding audit_backlog_limit=8192 on the kernel command line as well.

File: /etc/audit/rules.d/audit.rules - contains:

## Increase the buffers to survive stress events.
## Make this bigger for busy systems
-b 8192


If we don't set audit=1 some processes will not be auditable without a reboot.

Comment 4 Harald Jensås 2023-05-02 10:23:16 UTC

Hi Vijayalakshmi,

Do you have this issue reproduced in some lab?

Can you try:
1. to remove the audit=1 on a reboot.
Does it fix the issue with auditd.service not starting?
2. Keep audit=1 but add audit_backlog_limit=8192.
Does it fix the issue with auditd.service not starting?

Can you grab the file: /etc/audit/auditd.conf and add the content to this BZ?
Also what is the space left on the partition where logs are written?

I wonder if we should do a SIGHUP to the auditd daemon after growvols? (AFICT we don't use a percentage, so we should not need that)

space_left
If the free space in the filesystem containing log_file drops below this value,
the audit daemon takes the action specified by space_left_action. If the value of
space_left is specified as a whole number, it is interpreted as an absolute size
in megabytes (MiB). If the value is specified as a number between 1 and 99 followed
by a percentage sign (e.g., 5%), the audit daemon calculates the absolute size in
megabytes based on the size of the filesystem containing log_file. (E.g., if the
filesystem containing log_file is 2 gigabytes in size, and space_left is set to 25%,
then the audit daemon sets space_left to approximately 500 megabytes.
Note that this calculation is performed when the audit daemon starts, so if you resize
the filesystem containing log_file while the audit daemon is running, you should send
the audit daemon SIGHUP to re-read the configuration file and recalculate the correct
percentage.

Comment 16 James E. LaBarre 2023-07-14 19:19:15 UTC

Sorry, modified the wrong field (the tabs opened in reverse order from my checklist...).  Couldn't see what I would be able to validate on it.  I presume I'm setting this correctly.

Comment 17 James E. LaBarre 2023-07-18 15:29:30 UTC

Verify error/audit messages no longer show up in dmesg.

From Undercloud, run dmesg for each compute and check output.

==========================================

(undercloud) [stack@undercloud-0 ~]$ for z in 0 1 2 3 4 5
> do
> echo "Host: compute-${z}"
> ssh tripleo-admin@compute-${z}.ctlplane "sudo dmesg | grep -E 'kauditd|overflow|Hostname'"
> done
Host: compute-0
Warning: Permanently added 'compute-0.ctlplane' (ED25519) to the list of known hosts.
Host: compute-1
Warning: Permanently added 'compute-1.ctlplane' (ED25519) to the list of known hosts.
Host: compute-2
Warning: Permanently added 'compute-2.ctlplane' (ED25519) to the list of known hosts.
Host: compute-3
Warning: Permanently added 'compute-3.ctlplane' (ED25519) to the list of known hosts.
Host: compute-4
Warning: Permanently added 'compute-4.ctlplane' (ED25519) to the list of known hosts.
Host: compute-5
Warning: Permanently added 'compute-5.ctlplane' (ED25519) to the list of known hosts.
(undercloud) [stack@undercloud-0 ~]$

Comment 23 errata-xmlrpc 2023-08-16 01:14:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577

Note You need to log in before you can comment on or make changes to this bug.