2218449 – auditd fails to start with "Error setting audit daemon pid (No buffer space available)"

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2218449 - auditd fails to start with "Error setting audit daemon pid (No buffer space available)"

Summary: auditd fails to start with "Error setting audit daemon pid (No buffer space a...

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	9.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Richard Guy Briggs
QA Contact:	Kernel General QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-29 07:57 UTC by Renaud Métrich
Modified:	2023-09-25 21:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-25 21:03:57 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
Video recording showing how to reproduce (2.36 MB, video/webm) 2023-06-29 08:08 UTC, Renaud Métrich	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHEL-9096	0	None	Migrated	None	2023-09-25 21:03:53 UTC
Red Hat Issue Tracker	RHELPLAN-161159	0	None	None	None	2023-06-29 08:21:43 UTC

Description Renaud Métrich 2023-06-29 07:57:59 UTC

Description of problem:

We have a customer constantly getting the auditd service failing to start at boot with the following message:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
... auditd[987]: Error receiving audit netlink packet (No buffer space available)
... auditd[987]: Error setting audit daemon pid (No buffer space available)
... auditd[987]: Unable to set audit pid, exiting
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

This happens even when setting audit_backlog_limit=16384 on the kernel command line.
The service being configured to restart automatically, the next start works fine and auditd seems then to be stable.

ENOBUFS is expected to happen when netlink socket gets unsynchronized, which should not happen here with such large backlog, knowing that no rules are loaded at all yet (we booted the initramfs and are now just after switchroot).

A few seconds before auditd starts (and fail), we can see message suppressions:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
... kernel: kauditd_printk_skb: 161 callbacks suppressed
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

It looks like to me this is closely related: the customer always see suppressions, but doesn't reproduce the issue when there are less than 100 suppressions.


ANALYSIS AT USERSPACE LEVEL:

strace shows AUDIT_SET/AUDIT_STATUS_PID message is sent, then reply is ENOBUFS:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
08:14:30.025976 sendto(3<NETLINK:[AUDIT:1274]>, [{nlmsg_len=60, nlmsg_type=AUDIT_SET, nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK, nlmsg_seq=3, nlmsg_pid=0},
"\x04\x00\x00...\x00\xfa\x04\x00..."], 60, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 60 <0.000110>
 ^^^^               ^^^^^^^^
 AUDIT_STATUS_PID   PID of auditd
 :
08:14:30.026106 recvfrom(3<NETLINK:[AUDIT:1274]>, 0x7ffc23e7cd90, 8988, MSG_PEEK|MSG_DONTWAIT, 0x7ffc23e7a9dc, [12]) = -1 ENOBUFS (No buffer space available) <0.000001>
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

We have audit_set_pid() call audit_send() which in turn calls __audit_send(), the latter waiting for Acknowledge through calling check_ack():

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
 435 int audit_set_pid(int fd, uint32_t pid, rep_wait_t wmode)
 436 {
 437         struct audit_status s;
 438         struct audit_reply rep;
 439         struct pollfd pfd[1];
 440         int rc;
 441 
 442         memset(&s, 0, sizeof(s));
 443         s.mask    = AUDIT_STATUS_PID;
 444         s.pid     = pid;
 445         rc = audit_send(fd, AUDIT_SET, &s, sizeof(s));
 446         if (rc < 0) {
 447                 audit_msg(audit_priority(errno),
 448                         "Error setting audit daemon pid (%s)",
 449                         strerror(-rc));
 450                 return rc;
 451         }
  :

255 int audit_send(int fd, int type, const void *data, unsigned int size)
256 {
 :
260         rc = __audit_send(fd, type, data, size, &seq);
 :

204 int __audit_send(int fd, int type, const void *data, unsigned int size, int *seq)
205 {
 :
239         do {
240                 retval = sendto(fd, &req, req.nlh.nlmsg_len, 0,
241                         (struct sockaddr*)&addr, sizeof(addr));
242         } while (retval < 0 && errno == EINTR);
243         if (retval == (int)req.nlh.nlmsg_len)
244                 return check_ack(fd);
 :

271 static int check_ack(int fd)
272 {
 :
288         rc = audit_get_reply(fd, &rep, GET_REPLY_NONBLOCKING, MSG_PEEK);
 :
292         } else if (rc < 0)
293                 return rc;
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

We have check_ack() failing with ENOBUFS, which is returned by the kernel.
On the auditd level, I'm convinced auditd is just a victim of kernel's behavior, assuming it's expected the reply to be first on the netlink socket receive buffer.


(EARLY) ANALYSIS AT KERNEL LEVEL:

We have the following function being entered to process the message:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
1204 static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
1205 {
 :
1243         case AUDIT_SET: {
1244                 struct audit_status     s;
 :
1258                 if (s.mask & AUDIT_STATUS_PID) {
 :
1293                         if (new_pid) {
1294                                 /* register a new auditd connection */
1295                                 err = auditd_set(req_pid,
1296                                                  NETLINK_CB(skb).portid,
1297                                                  sock_net(NETLINK_CB(skb).sk));
1298                                 if (audit_enabled != AUDIT_OFF)
1299                                         audit_log_config_change("audit_pid",
1300                                                                 new_pid,
1301                                                                 auditd_pid,
1302                                                                 err ? 0 : 1);
1303                                 if (err)
1304                                         return err;
1305 
1306                                 /* try to process any backlog */
1307                                 wake_up_interruptible(&kauditd_wait);
1308                         } else {
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

On line 1295, auditd_set() is called, then later on line 1307 the backlog thread is unblocked.
It's unclear to me who sends the reply back, I don't see this in auditd_set() at all.
audit_send_reply() doesn't seem to be called as well.

I believe there is a race between waking up the backlog thread and sending the reply:
because the backlog buffer would be full, ENOBUFS would be returned somehow, before the AUDIT_SET reply would have been set.

It's just an hypothesis for now, I'm not able to find out how and by whom the reply is actually crafted and sent.

Version-Release number of selected component (if applicable):

RHEL9.2 kernels including latest 5.14.0-284.18.1.el9_2.x86_64

How reproducible:

Frequently on customer's system
I can reproduce the issue on my QEMU/KVM quite reliably, with audit_backlog_limit=320 but definitely not with customer's settings

Steps to Reproduce:
1. Install a QEMU/KVM with minimal profile

   1 CPU / 2GB memory

2. Disable automatic startup of auditd service and automatic restart and set some kernel command line parameters

   # systemctl disable auditd
   # mkdir -p /etc/systemd/system/auditd.service.d
   # echo -e "[Service]\nRestart=no" > /etc/systemd/system/auditd.service.d/no_restart.conf
   # grubby --update-kernel ALL --args="audit=1 audit_rate_limit=0 audit_backlog_limit=320 log_buf_len=15M ignore_loglevel printk.devkmsg=on"
   # reboot

3. Start auditd service by hand just after suppressions were seen on the console at login prompt (~10 seconds after login prompt appears)

   $ ssh vm-rhel9 systemctl start auditd
   Job for auditd.service failed because the control process exited with error code.
   See "systemctl status auditd.service" and "journalctl -xeu auditd.service" for details.

   $ ssh vm-rhel9 journalctl -b -u auditd
   Jun 29 09:52:20 vm-rhel9 systemd[1]: Starting Security Auditing Service...
   Jun 29 09:52:20 vm-rhel9 auditd[1147]: No plugins found, not dispatching events
   Jun 29 09:52:20 vm-rhel9 auditd[1147]: Error receiving audit netlink packet (No buffer space available)
   Jun 29 09:52:20 vm-rhel9 auditd[1147]: Error setting audit daemon pid (No buffer space available)
   Jun 29 09:52:20 vm-rhel9 auditd[1147]: Unable to set audit pid, exiting
   Jun 29 09:52:20 vm-rhel9 auditd[1146]: Cannot daemonize (Success)
   Jun 29 09:52:20 vm-rhel9 auditd[1146]: The audit daemon is exiting.
   Jun 29 09:52:20 vm-rhel9 systemd[1]: auditd.service: Control process exited, code=exited, status=1/FAILURE

Actual results:

See above

Expected results:

auditd started properly

Comment 1 Renaud Métrich 2023-06-29 08:07:59 UTC

EDIT:

I can reproduce even with a large audit buffer audit_backlog_limit=16384, the size of the buffer doesn't have an impact, which confirms it's not related to having the buffer being full.

The important thing to reproduce is to start auditd service just after seeing the suppression messages on the console, e.g.:

-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
vm-rhel9 login: [   14.622648] kauditd_printk_skb: 190 callbacks suppressed
[   14.622650] audit: type=1131 audit(1688025723.078:143): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Immediately after seeing, start auditd and it will wait 100% of the time for me:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
$ ssh vm-rhel9 systemctl start auditd
Job for auditd.service failed because the control process exited with error code.
See "systemctl status auditd.service" and "journalctl -xeu auditd.service" for details.
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

See the attached video recording (VM serial console on the top, prepared command on the bottom).

Comment 2 Renaud Métrich 2023-06-29 08:08:55 UTC

Created attachment 1973132 [details]
Video recording showing how to reproduce

Comment 3 Renaud Métrich 2023-06-29 12:03:08 UTC

Code ack'ing the AUDIT_SET was in front on my blurry eyes:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
1542 static void audit_receive(struct sk_buff  *skb)
1543 {
 :
1556         while (nlmsg_ok(nlh, len)) {
1557                 err = audit_receive_msg(skb, nlh);
1558                 /* if err or if this message says it wants a response */
1559                 if (err || (nlh->nlmsg_flags & NLM_F_ACK))
1560                         netlink_ack(skb, nlh, err, NULL);
1561 
1562                 nlh = nlmsg_next(nlh, &len);
1563         }
 :
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------

Comment 4 RHEL Program Management 2023-09-25 20:07:35 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 5 RHEL Program Management 2023-09-25 21:03:57 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.