Description of problem: After node-provision (or) overcloud deployment, error message(s) "audit: kauditd hold queue overflow" are observed in dmesg of BM compute nodes. Also, the auditd.service status shows error: [tripleo-admin@computehwoffload-r730 ~]$ sudo systemctl status auditd.service ● auditd.service - Security Auditing Service Loaded: loaded (/usr/lib/systemd/system/auditd.service; enabled; preset: enabled) Active: active (running) since Thu 2023-04-27 08:24:50 UTC; 2h 25min ago Docs: man:auditd(8) https://github.com/linux-audit/audit-documentation Main PID: 1351 (auditd) Tasks: 2 (limit: 838860) Memory: 8.2M CPU: 376ms CGroup: /system.slice/auditd.service └─1351 /sbin/auditd Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: pid 1351 Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: rate_limit 0 Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog_limit 8192 Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: lost 135 Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog 0 Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog_wait_time 60000 Apr 27 08:24:50 computehwoffload-r730 augenrules[1365]: backlog_wait_time_actual 0 Apr 27 08:24:50 computehwoffload-r730 systemd[1]: Started Security Auditing Service. Apr 27 08:28:25 computehwoffload-r730 auditd[1351]: Error receiving audit netlink packet (No buffer space available) Apr 27 10:35:20 computehwoffload-r730 auditd[1351]: Audit daemon rotating log files [tripleo-admin@computehwoffload-r730 ~]$ [tripleo-admin@computehwoffload-r730 ~]$ sudo systemctl restart auditd.service Failed to restart auditd.service: Operation refused, unit auditd.service may be requested by dependency only (it is configured to refuse manual start/stop). See system logs and 'systemctl status auditd.service' for details. Version-Release number of selected component (if applicable): RHEL: Red Hat Enterprise Linux release 9.2 Beta (Plow) Puddle: RHOS-17.1-RHEL-9-20230419.n.1 How reproducible: Always Steps to Reproduce: 1. "openstack overcloud node provision ...." 2. Login to compue node and check dmesg logs 3. Actual results: Observed message: "audit: kauditd hold queue overflow" Expected results: Not sure if this has any impact, but not able to restart the service auditd.service. Also, not sure if we can increase the buffer size of audit_backlog_limit Additional info: This behaviour didn't seem to be observed in OSP17.0
https://access.redhat.com/solutions/5736961 Looks like 8192 is the default that would be set by auditd after the service starts - it is substantially more than the default 64 ... We could increase by adding audit_backlog_limit=8192 on the kernel command line as well. File: /etc/audit/rules.d/audit.rules - contains: ## Increase the buffers to survive stress events. ## Make this bigger for busy systems -b 8192 If we don't set audit=1 some processes will not be auditable without a reboot.
Hi Vijayalakshmi, Do you have this issue reproduced in some lab? Can you try: 1. to remove the audit=1 on a reboot. Does it fix the issue with auditd.service not starting? 2. Keep audit=1 but add audit_backlog_limit=8192. Does it fix the issue with auditd.service not starting? Can you grab the file: /etc/audit/auditd.conf and add the content to this BZ? Also what is the space left on the partition where logs are written? I wonder if we should do a SIGHUP to the auditd daemon after growvols? (AFICT we don't use a percentage, so we should not need that) space_left If the free space in the filesystem containing log_file drops below this value, the audit daemon takes the action specified by space_left_action. If the value of space_left is specified as a whole number, it is interpreted as an absolute size in megabytes (MiB). If the value is specified as a number between 1 and 99 followed by a percentage sign (e.g., 5%), the audit daemon calculates the absolute size in megabytes based on the size of the filesystem containing log_file. (E.g., if the filesystem containing log_file is 2 gigabytes in size, and space_left is set to 25%, then the audit daemon sets space_left to approximately 500 megabytes. Note that this calculation is performed when the audit daemon starts, so if you resize the filesystem containing log_file while the audit daemon is running, you should send the audit daemon SIGHUP to re-read the configuration file and recalculate the correct percentage.
Sorry, modified the wrong field (the tabs opened in reverse order from my checklist...). Couldn't see what I would be able to validate on it. I presume I'm setting this correctly.
Verify error/audit messages no longer show up in dmesg. From Undercloud, run dmesg for each compute and check output. ========================================== (undercloud) [stack@undercloud-0 ~]$ for z in 0 1 2 3 4 5 > do > echo "Host: compute-${z}" > ssh tripleo-admin@compute-${z}.ctlplane "sudo dmesg | grep -E 'kauditd|overflow|Hostname'" > done Host: compute-0 Warning: Permanently added 'compute-0.ctlplane' (ED25519) to the list of known hosts. Host: compute-1 Warning: Permanently added 'compute-1.ctlplane' (ED25519) to the list of known hosts. Host: compute-2 Warning: Permanently added 'compute-2.ctlplane' (ED25519) to the list of known hosts. Host: compute-3 Warning: Permanently added 'compute-3.ctlplane' (ED25519) to the list of known hosts. Host: compute-4 Warning: Permanently added 'compute-4.ctlplane' (ED25519) to the list of known hosts. Host: compute-5 Warning: Permanently added 'compute-5.ctlplane' (ED25519) to the list of known hosts. (undercloud) [stack@undercloud-0 ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:4577