2140163 – error 524 from seccomp(2) when trying to load filter

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2140163 - error 524 from seccomp(2) when trying to load filter

Summary: error 524 from seccomp(2) when trying to load filter

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	8.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Yauheni Kaliuta
QA Contact:	Eirik Fuller
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2152138 2152139
TreeView+	depends on / blocked

Reported:	2022-11-04 17:05 UTC by Kir Kolyshkin
Modified:	2023-09-19 08:17 UTC (History)
CC List:	12 users (show)
Fixed In Version:	kernel-4.18.0-452.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2152138 2152139 (view as bug list)
Environment:
Last Closed:	2023-05-16 08:55:50 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
payload to reproduce on ocp 4.11.0-aarch64 (3.04 KB, text/plain) 2022-11-15 20:00 UTC, aleskandro	no flags	Details
packages diff (rhcos in ocp 4.10.41 vs rhcos in 4.11.0) (14.78 KB, text/plain) 2022-11-15 20:10 UTC, aleskandro	no flags	Details
dmesg (142.73 KB, text/plain) 2022-11-17 13:18 UTC, aleskandro	no flags	Details
The runtime/default seccomp profile (19.41 KB, text/plain) 2022-11-22 15:05 UTC, aleskandro	no flags	Details
The allow_by_default bpf program (212 bytes, text/plain) 2022-11-22 15:06 UTC, aleskandro	no flags	Details
C snippet to reproduce (1.21 KB, text/x-csrc) 2022-11-22 15:14 UTC, aleskandro	no flags	Details
C snippet to reproduce (1.56 KB, text/x-csrc) 2022-11-23 11:21 UTC, aleskandro	no flags	Details
dmesg output for an arm64 m6g.2xlarge machine running the code in comment #16 with the script in comment#15 (55.79 KB, text/plain) 2022-11-23 11:22 UTC, aleskandro	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	redhat/rhel/src/kernel rhel-8 merge_requests 3907	None	None	None	2022-12-06 18:43:10 UTC
Red Hat Issue Tracker	RHELPLAN-138386	None	None	None	2022-11-04 17:09:12 UTC
Red Hat Knowledge Base (Solution)	7030968	None	None	None	2023-09-19 08:17:46 UTC
Red Hat Product Errata	RHSA-2023:2951	None	None	None	2023-05-16 08:56:32 UTC

Description Kir Kolyshkin 2022-11-04 17:05:28 UTC

> Description of problem:

We have observed that during the upgrade from OCP 4.11 to 4.12 sometimes a container fails to start, with the following error from runc:

runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524"
Version-Release number of selected component (if applicable):

This only happens on arm64, and haven't seen before.

> How reproducible:

Sometimes.

> Steps to Reproduce:

Alas I don't have a simple repro. Hope that multiarch and QE teams will help.

> Actual results:

runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524"

> Expected results:

No error (seccomp rules are loaded successfully, container is started etc)

> Version-Release number of selected component (if applicable):

I think that the kernel used is 4.18.0-372.19.1.el8_6.aarch64

runc version 1.1.3 or 1.1.4.

> Additional info:

See the following bugs: 
https://issues.redhat.com/browse/OCPBUGS-1882
https://issues.redhat.com/browse/OCPBUGS-2302
https://issues.redhat.com/browse/OCPBUGS-2637
https://issues.redhat.com/browse/OCPBUGS-708

> runc notes:

What runc does is it creates an EBPF to allow specific syscalls (using libseccomp-golang which uses libseccomp), when patches it to make sure -ENOSYS (rather than the default -EPERM) is returned for unknown syscalls (where sysno > last known syscall), when loads the EBPF into the kernel using either prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog) or, if flags != 0, seccomp(SET_MODE_FILTER, flags, prog).

This is where it gets errno 524, which is ENOTSUPP, a Linux kernel internal error (not to be mixed with ENOTSUP not EOPNOTSUP), defined in include/linux/errno.h (i.e. it is not in the uapi). This code is not supposed to be returned to the userspace, and yet it is.

Since this is only happens in arm64 we think it's a kernel issue.

The runc code in question is in https://github.com/opencontainers/runc/tree/main/libcontainer/seccomp (in particular, ENOSYS patching is done in https://github.com/opencontainers/runc/blob/main/libcontainer/seccomp/patchbpf/enosys_linux.go)

Comment 1 Kir Kolyshkin 2022-11-04 17:13:46 UTC

@rphillips pointed out that BFP selftests on arm64, when enabled initially, revealed a number of cases which fails with -524. See https://lore.kernel.org/bpf/20221021210701.728135-1-chantr4@gmail.com/

Comment 2 aleskandro 2022-11-15 19:59:36 UTC

On OCP, the issue is always reproducible in the following scenario:

- 4.11.13-aarch64
- 3 masters m6g.xlarge
- 2 workers
- 1 tainted worker with either m6g.xlarge, m6g.2xlarge, or m6g.4xlarge as instanceType.
- Using the payload that I'm attaching herewith, consisting of a namespace, an ImageStream, and a deployment with a pod made of 10 containers that sleep.

Steps to reproduce:

1. oc apply -f deployment.yaml
2. oc project my-project
3. oc scale deployment/my-deployment --replicas=45 # or more

Change the replicas parameter so that the tainted worker gets up to 472 containers regardless of the chosen instance type (sometimes I got more containers, but still around that number, +- 10): I counted them by using

oc debug node/my-worker
chroot /host
watch 'echo $(( $(crictl ps | wc -l) - 1 )) - $(find /var/run/crio -type l ! -readable | wc -l)'

After reaching the 472nd (?) created container, the new ones never get created; the pending pods and the journal in the node report:

Error: container create failed: 
time="2022-11-11T20:30:20Z" level=error msg="runc create failed: unable 
to start container process: unable to init seccomp: error loading 
seccomp filter into kernel: error loading seccomp filter: errno 524"


Other info:

- It seems that, when the issue appears, the number of available inodes in the /var/run FS starts to decrease linearly, down to 0. That's because of symbolic links being created in /var/run/crio that I found to be mostly broken; they could be created and not removed when the exception occurs. 
- Trying an old runc version (1.0.3), the issue changes to pods being in CrashLoopBackoff and their log printing

standard_init_linux.go:216: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: errno 524


- With runc 1.0.3 and only one pod in crashloopbackoff, the number of Restarts due to the CrashLoopBackoff status is equal to the number of broken symlinks in /var/run/crio,


- The above steps could be just one of the ways to reproduce: I'm facing the linear increase of broken symlinks in some masters as well (with far fewer total containers than this "472")

- The issue doesn't happen on the latest arm64 4.10 nightly: I'm attaching a diff of the installed packages.

- The following ostree overrides were not enough to overcome the issue:
   - kernel-modules-extra kernel kernel-core kernel-modules 4.18.0-372.19.1.el8_6 -> 4.18.0-305.25.1.el8_4
   - libseccomp 2.5.2-1.el8 -> 2.5.1-1.el8
   - runc 3:1.1.2-1.rhaos4.11.el8 -> 1:1.0.3-2.module+el8.6.0+14877+f643d2d6

Comment 3 aleskandro 2022-11-15 20:00:22 UTC

Created attachment 1924530 [details]
payload to reproduce on ocp 4.11.0-aarch64

Comment 4 aleskandro 2022-11-15 20:10:12 UTC

Created attachment 1924532 [details]
packages diff (rhcos in ocp 4.10.41 vs rhcos in 4.11.0)

Comment 5 Yauheni Kaliuta 2022-11-17 09:46:37 UTC

Only on arm? Do you have any unsupported opcode report in dmesg?

Comment 6 Yauheni Kaliuta 2022-11-17 09:49:12 UTC

Ah, it's RHEL8, bpf for non-x86 is in tech preview status there afair. I'll try to help as much as I can, but if it's due to missing implementation, there is not a lot of room for improvement.

Comment 7 aleskandro 2022-11-17 12:19:45 UTC

> Only on arm? Do you have any unsupported opcode report in dmesg?

Hi Yauheni Kaliuta, yes, we didn't face it on x86_64 ocp yet.

Today, I could reproduce the issue on the previous rhcos version run by ocp4.10, of which I posted the packages-diff with the rhcos on ocp4.11.

I added the following stanza to the SecurityContextConstraint (SCC) "restricted", taken from the default SCC in OCP 4.11:

```
seccompProfiles:
- runtime/default 
```

I tried the following custom seccompprofile:

sh-4.4# cat /var/lib/kubelet/seccomp/allowed.json 
{
  "defaultAction": "SCMP_ACT_ALLOW",
  "archMap": [
    {
      "architecture": "SCMP_ARCH_X86_64",
      "subArchitectures": [
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
      ]
    },
    {
      "architecture": "SCMP_ARCH_AARCH64",
      "subArchitectures": [
        "SCMP_ARCH_ARM"
      ]
    },
    {
      "architecture": "SCMP_ARCH_MIPS64",
      "subArchitectures": [
        "SCMP_ARCH_MIPS",
        "SCMP_ARCH_MIPS64N32"
      ]
    },
    {
      "architecture": "SCMP_ARCH_MIPS64N32",
      "subArchitectures": [
        "SCMP_ARCH_MIPS",
        "SCMP_ARCH_MIPS64"
      ]
    },
    {
      "architecture": "SCMP_ARCH_MIPSEL64",
      "subArchitectures": [
        "SCMP_ARCH_MIPSEL",
        "SCMP_ARCH_MIPSEL64N32"
      ]
    },
    {
      "architecture": "SCMP_ARCH_MIPSEL64N32",
      "subArchitectures": [
        "SCMP_ARCH_MIPSEL",
        "SCMP_ARCH_MIPSEL64"
      ]
    },
    {
      "architecture": "SCMP_ARCH_S390X",
      "subArchitectures": [
        "SCMP_ARCH_S390"
      ]
    }
  ],
  "syscalls": [
  ]
}
sh-4.4# 


And also, the default from the branch rhaos-4.11-rhel-8 of git://pkgs.devel.redhat.com/rpms/cri-o, but nothing has changed yet.


Finally, running a container with Podman doesn't fail either:

podman --runtime /usr/bin/runc run --security-opt seccomp=/var/lib/kubelet/seccomp/rpmrio.json fedora echo "hello"

Comment 8 aleskandro 2022-11-17 13:16:47 UTC

> Do you have any unsupported opcode report in dmesg?

No errors from dmesg. Attaching it as well.

Comment 9 aleskandro 2022-11-17 13:18:04 UTC

Created attachment 1925006 [details]
dmesg

Comment 10 Yauheni Kaliuta 2022-11-17 15:31:27 UTC

Ok. I know barely nothing about OCP :( Can you get the actual bpf program that I can load it on my setup?

Comment 11 aleskandro 2022-11-22 15:05:28 UTC

Created attachment 1926409 [details]
The runtime/default seccomp profile

Comment 12 aleskandro 2022-11-22 15:06:12 UTC

Created attachment 1926410 [details]
The allow_by_default bpf program

Comment 13 aleskandro 2022-11-22 15:14:45 UTC

Created attachment 1926412 [details]
C snippet to reproduce

I attached the two bpf programs extracted from the runs on OCP. The first one (runtime/default) is the one set by default by ocp 4.11+. The second one is the output from the above allowed.json seccompprofile.

I'm able to reproduce the error on a rhcos box by using the c snippet I'm also attaching (failure.c).

```
gcc failure.c
./a.out
```

In a rhcos box, it fails after the 512th installation. In a rhel box, it failed after the 959th.

You might need to adjust the for loop bounds according to your env.

Comment 14 Yauheni Kaliuta 2022-11-22 20:01:07 UTC

I get out of memory error with the C snippet on `Filter n. 957 installed: retcode: 1`

[ 1034.778424] vmap allocation for size 131072 failed: use vmalloc=<size> to increase size
[ 1034.779043] a.out: vmalloc: allocation failure: 65536 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
[ 1034.779786] CPU: 3 PID: 16753 Comm: a.out Kdump: loaded Not tainted 4.18.0-438.el8.kpq0.ge543.aarch64 #1
[ 1034.780396] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 1034.780835] Call trace:
[ 1034.780993]  dump_backtrace+0x0/0x178
[ 1034.781233]  show_stack+0x28/0x38
[ 1034.781456]  dump_stack+0x5c/0x74
[ 1034.781660]  warn_alloc+0x10c/0x190
[ 1034.781871]  __vmalloc_node_range+0x218/0x2e0
[ 1034.782144]  bpf_jit_alloc_exec+0x90/0xb0

which is probably expected if the programs are not freed.

From the original report I actually got that the program is rejected at the first try (so it's not supported).

Is it a valid usecase?

Comment 15 aleskandro 2022-11-23 11:18:17 UTC

Hi, I'm uploading a new snippet.

> From the original report I actually got that the program is rejected at the first try (so it's not supported).

Yes, it fails on the first try, but after a given number of processes (containers) are spawned. All the processes must be considered concurrent and long-running, and each loads a new instance of the same seccomp profile using a bpf program like the ones I sent.

> Is it a valid use case?

I'm not sure of the code I wrote to reproduce. Testing multiple processes with 1 call should be closer to what kubelet does via conman and runc.

On OCP, we don't currently set a maximum limit of containers per host. 
Instead, we have a maximum number of pods per host, which is 250 by default. Users can extend up to 500.

In the default case, it's enough to run 250 pods with 2 containers each.
If a user sets the maximum to 500 pods per node, the arm64 clusters will fail with ~500 pods, each with only one container.

The updated code can run on both x86 and arm64 and includes a parameter to set the number of iterations to run.

I'm executing on .2xlarge AWS RHEL8.6 VMs with RHEL 8.6 (arm64 m6g.2xlarge, x86_64 m5.2xlarge).

They both provide 8 vCPU and 32 GB RAM.

arm64 kernel:
Linux ip-10-0-233-251.us-east-2.compute.internal 4.18.0-369.el8.aarch64 #1 SMP Mon Feb 21 11:02:03 EST 2022 aarch64 aarch64 aarch64 GNU/Linux

amd64 kernel:
Linux ip-10-0-163-230.us-east-2.compute.internal 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

As root, I'm executing the following on both nodes:

gcc main.c # the latest one
FILTERS=1000 # 2500 on amd64
echo "[DEBUG] Single process, ${FILTERS} installs" > /dev/kmsg
./my_binary ${FILTERS}
[CTRL+c]
echo "[DEBUG] ${FILTERS} processes, 1 install" > /dev/kmsg
for i in `seq 1 ${FILTERS}`; do ./my_binary 1 & done
killall my_binary

dmesg on arm64: https://termbin.com/iycm2 (uploading here as well)

output  on arm64, single process running 1000 calls to install_filter:
...
Filter n. 960 installed: retcode: 0
Filter n. 961 installed: retcode: 0
Filter n. 962 installed: retcode: 0
Value of errno: 524
prctl(PR_SET_SECCOMP): Unknown error 524
Value of errnum: Unknown error 524
Filter n. 963 installed: retcode: 1
Value of errno: 524
prctl(PR_SET_SECCOMP): Unknown error 524
Value of errnum: Unknown error 524
Filter n. 964 installed: retcode: 1
Value of errno: 524
[/output]

output on arm64, 1000 processes running 1 call to install_filter:

Value of errno: 524
prctl(PR_SET_SECCOMP): Unknown error 524
Value of errnum: Unknown error 524
Filter n. 0 installed: retcode: 1
Value of errno: 524
prctl(PR_SET_SECCOMP): Unknown error 524
Value of errnum: Unknown error 524
Filter n. 0 installed: retcode: 1
Value of errno: 524
prctl(PR_SET_SECCOMP): Unknown error 524
Value of errnum: Unknown error 524
Filter n. 0 installed: retcode: 1

output on amd64, single process running 2500 calls to install_filter

Filter n. 2182 installed: retcode: 0
Filter n. 2183 installed: retcode: 0
Value of errno: 12
prctl(PR_SET_SECCOMP): Cannot allocate memory
Value of errnum: Cannot allocate memory
Filter n. 2184 installed: retcode: 1
Value of errno: 12
prctl(PR_SET_SECCOMP): Cannot allocate memory
Value of errnum: Cannot allocate memory
Filter n. 2185 installed: retcode: 1
Value of errno: 12
prctl(PR_SET_SECCOMP): Cannot allocate memory
Value of errnum: Cannot allocate memory
Filter n. 2186 installed: retcode: 1
Value of errno: 12

output on amd64, 2500 processes running 1 call to install_filter

all the filters are installed successfully, even with a greater FILTERS value.

dmesg on x86_64:

[root@ip-10-0-163-230 ec2-user]# dmesg
....
[ 3454.110203] [DEBUG] Single process, 2500 installs
[ 3472.846369] [DEBUG] 2500 processes, 1 install
[root@ip-10-0-163-230 ec2-user]#

Comment 16 aleskandro 2022-11-23 11:21:12 UTC

Created attachment 1926629 [details]
C snippet to reproduce

Comment 17 aleskandro 2022-11-23 11:22:36 UTC

Created attachment 1926630 [details]
dmesg output for an arm64 m6g.2xlarge machine running the code in comment #16 with the script in comment#15

Comment 18 Yauheni Kaliuta 2022-11-23 13:51:22 UTC

Ok, thanks, it's a valid usecase. I'll think and discuss on bpf meeting what can be done.

Comment 19 Kir Kolyshkin 2022-11-30 01:34:06 UTC

I have some good and bad news.

The bad news is RHEL9 kernel might also be affected, see https://issues.redhat.com/browse/OCPBUGS-2637?focusedCommentId=21274886&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21274886

The good news is the issue was reported to kernel mailing list also:

https://lore.kernel.org/all/20221026160051.5340-1-mdecandia@gmail.com/

and seems there is a patch that fixes it

https://lore.kernel.org/lkml/20221124001123.3248571-1-risbhat@amazon.com/T/#t

Comment 20 Jiri Benc 2022-12-02 11:16:55 UTC

I guess we need this commit?

commit b89ddf4cca43f1269093942cf5c4e457fd45c335
Author: Russell King <russell.king>
Date:   Fri Nov 5 16:50:45 2021 +0000

    arm64/bpf: Remove 128MB limit for BPF JIT programs

    Commit 91fc957c9b1d ("arm64/bpf: don't allocate BPF JIT programs in module
    memory") restricts BPF JIT program allocation to a 128MB region to ensure
    BPF programs are still in branching range of each other. However this
    restriction should not apply to the aarch64 JIT, since BPF_JMP | BPF_CALL
    are implemented as a 64-bit move into a register and then a BLR instruction -
    which has the effect of being able to call anything without proximity
    limitation.

    The practical reason to relax this restriction on JIT memory is that 128MB of
    JIT memory can be quickly exhausted, especially where PAGE_SIZE is 64KB - one
    page is needed per program. In cases where seccomp filters are applied to
    multiple VMs on VM launch - such filters are classic BPF but converted to
    BPF - this can severely limit the number of VMs that can be launched. In a
    world where we support BPF JIT always on, turning off the JIT isn't always an
    option either.

    Fixes: 91fc957c9b1d ("arm64/bpf: don't allocate BPF JIT programs in module memory")
    Suggested-by: Ard Biesheuvel <ard.biesheuvel>
    Signed-off-by: Russell King <russell.king>
    Signed-off-by: Daniel Borkmann <daniel>
    Tested-by: Alan Maguire <alan.maguire>
    Link: https://lore.kernel.org/bpf/1636131046-5982-2-git-send-email-alan.maguire@oracle.com

Comment 21 Jiri Benc 2022-12-02 11:26:56 UTC

(In reply to Kir Kolyshkin from comment #0)
> This is where it gets errno 524, which is ENOTSUPP, a Linux kernel internal
> error (not to be mixed with ENOTSUP not EOPNOTSUP), defined in
> include/linux/errno.h (i.e. it is not in the uapi). This code is not
> supposed to be returned to the userspace, and yet it is.

You're 100% right.

This is orthogonal to the real issue, though. The root problem is there's not enough memory being available to bpf on aarch64. I suggest we continue with the real underlying bug here.

It would be great to clean up the incorrect ENOTSUPP usage in the bpf code in the kernel. However, the code is plagued with ENOTSUPP and it would not be a small task. The hardest part would likely be convincing the bpf maintainer that it is incorrect. Feel free to file a separate bug for that but don't expect much, interactions with bpf upstream on things like this tend to be difficult (to put it mildly).

Comment 22 Jiri Benc 2022-12-02 11:31:15 UTC

(In reply to Yauheni Kaliuta from comment #6)
> Ah, it's RHEL8, bpf for non-x86 is in tech preview status there afair.

Note that this is seccomp, i.e. cBPF, i.e. fully supported. The fact that internally, cBPF is translated to eBPF, is a kernel implementation detail from the user space point of view.

Comment 23 Jiri Benc 2022-12-02 11:38:29 UTC

(In reply to Jiri Benc from comment #20)
>     [...] However this
>     restriction should not apply to the aarch64 JIT, since BPF_JMP | BPF_CALL
>     are implemented as a 64-bit move into a register and then a BLR
> instruction -
>     which has the effect of being able to call anything without proximity
>     limitation.

It seems this is the case also in the RHEL 8.6 code. Meaning we should be able to safely apply the upstream fix.

Comment 24 Yauheni Kaliuta 2022-12-05 09:09:17 UTC

brew build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=49388763
But kABI is the question

Comment 26 aleskandro 2022-12-07 14:44:06 UTC

With the kernel provided above, the seccomp issue is gone on OCP arm64 clusters: in particular, I'm able to spawn ~2k containers (until the memory is full) and the seccomp issue seems gone, with the containers running fine.

Comment 27 Scott Dodson 2022-12-08 15:21:32 UTC

@ykaliuta Can you help me understand what a reasonable timeline for getting this into 8.6.z looks like? Is there a chance that this will make 8.6 Batch 6?

Also in case it's not clear we'll need this fixed in 9.2 and potentially 9.0 as well.

Comment 28 Jiri Benc 2022-12-08 16:05:27 UTC

(In reply to Scott Dodson from comment #27)
> Also in case it's not clear we'll need this fixed in 9.2

The fix is already in RHEL 9.2 via bug 2120352.

> and potentially 9.0 as well.

Please request a 9.0 z-stream in bug 2120352 if you want that.

Comment 32 Scott Dodson 2022-12-13 15:51:43 UTC

(In reply to Jiri Benc from comment #28)
> (In reply to Scott Dodson from comment #27)
> > Also in case it's not clear we'll need this fixed in 9.2
> 
> The fix is already in RHEL 9.2 via bug 2120352.
> 
> > and potentially 9.0 as well.
> 
> Please request a 9.0 z-stream in bug 2120352 if you want that.

Jiri,

Thanks, I've requested that in the bug referenced but I've left a comment that I'm only seeking fixes for this bug in the z-stream request as it appears that bug tracks a significantly larger rebase of MM.

Comment 40 Eirik Fuller 2023-01-25 02:51:11 UTC

A test which repeatedly sets a simple seccomp filter reported the following output with kernel 4.18.0-452.el8


cc     seccomp.c   -o seccomp
sysctl net.core.bpf_jit_limit
net.core.bpf_jit_limit = 34050467168256
dmesg -C
./seccomp
Iteration 2520 errno 12
dmesg


That output is from make, using the following Makefile.


test:	seccomp
	sysctl net.core.bpf_jit_limit
	dmesg -C
	./$<
	dmesg


The source for that seccomp binary follows.


#include <errno.h>
#include <linux/filter.h>
#include <stdio.h>
#include <stddef.h>
#include <sys/prctl.h>
#include <seccomp.h>

static struct sock_filter filter[] = {
	BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
	BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, __NR_syscalls, 0, 1),
	BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ERRNO|ENOSYS),
	BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
};

static struct sock_fprog prog = {
	.len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
	.filter = filter,
};

int main(int argc, char **argv) {
	int iterations = 8192;
	for (int i = 0; i < iterations; i++) {
		prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
		if (errno) {
			fprintf(stderr, "Iteration %d errno %d\n", i, errno);
			return 0;
		}
	}
    return 0;
}


That seccomp binary exits with an ENOMEM failure comparable to the amd64 test result mentioned in comment 15. No kernel messages were logged as a result of the seccomp binary invocation, and the bpf_jit_limit value shows the effect of the patch (and was not the likely cause of the ENOMEM).

It doesn't much matter what seccomp filter is used for the test, as long as it compiles, and doesn't interfere with the prctl call.

The same test reported the following output with kernel 4.18.0-451.el8


cc     seccomp.c   -o seccomp
sysctl net.core.bpf_jit_limit
net.core.bpf_jit_limit = 33554432
dmesg -C
./seccomp
Iteration 958 errno 524
dmesg
[   94.490934] vmap allocation for size 131072 failed: use vmalloc=<size> to increase size
[   94.498974] seccomp: vmalloc: allocation failure: 65536 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0-3
[   94.510826] CPU: 154 PID: 10092 Comm: seccomp Kdump: loaded Not tainted 4.18.0-451.el8.aarch64 #1
[   94.519688] Hardware name: WIWYNN Mt.Jade Server System B81.030Z1.0007/Mt.Jade Motherboard, BIOS 2.10.20220810 (SCP: 2.10.20220810) 2022/08/10
[   94.532458] Call trace:
[   94.534893]  dump_backtrace+0x0/0x178
[   94.538557]  show_stack+0x28/0x38
[   94.541861]  dump_stack+0x5c/0x74
[   94.545175]  warn_alloc+0x10c/0x190
[   94.548657]  __vmalloc_node_range+0x218/0x2e0
[   94.553006]  bpf_jit_alloc_exec+0x90/0xb0
[   94.557009]  bpf_jit_binary_alloc+0x6c/0xf0
[   94.561185]  bpf_int_jit_compile+0x3c8/0x4c8
[   94.565443]  bpf_prog_select_runtime+0xe4/0x130
[   94.569963]  bpf_prepare_filter+0x4c8/0x530
[   94.574138]  bpf_prog_create_from_user+0x104/0x1a8
[   94.578917]  seccomp_set_mode_filter+0x110/0x4e8
[   94.583526]  do_seccomp+0x1c8/0x238
[   94.587003]  prctl_set_seccomp+0x44/0x60
[   94.590913]  __se_sys_prctl+0x444/0x5e0
[   94.594742]  __arm64_sys_prctl+0x2c/0x38
[   94.598652]  do_el0_svc+0xb4/0x188
[   94.602046]  el0_sync_handler+0x88/0xac
[   94.605873]  el0_sync+0x140/0x180
[   94.609200] Mem-Info:
[   94.611505] active_anon:201 inactive_anon:3404 isolated_anon:0
                active_file:4242 inactive_file:6938 isolated_file:0
                unevictable:0 dirty:74 writeback:0
                slab_reclaimable:3025 slab_unreclaimable:24557
                mapped:1246 shmem:1033 pagetables:156 bounce:0
                free:8258865 free_pcp:3969 free_cma:0
[   94.643737] Node 0 active_anon:8640kB inactive_anon:152000kB active_file:117120kB inactive_file:131456kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:57984kB dirty:896kB writeback:0kB shmem:47744kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:62016kB pagetables:5184kB all_unreclaimable? no
[   94.673880] Node 1 active_anon:4224kB inactive_anon:67904kB active_file:158464kB inactive_file:312000kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:21824kB dirty:4032kB writeback:0kB shmem:18368kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:50432kB pagetables:4096kB all_unreclaimable? no
[   94.704023] Node 2 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:0kB pagetables:64kB all_unreclaimable? no
[   94.731216] Node 3 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:64kB pagetables:64kB all_unreclaimable? no
[   94.758507] Node 0 Normal free:263298752kB min:11511488kB low:14389312kB high:17267136kB active_anon:8640kB inactive_anon:150464kB active_file:120704kB inactive_file:132032kB unevictable:0kB writepending:896kB present:267386880kB managed:267025856kB mlocked:0kB bounce:0kB free_pcp:135616kB local_pcp:1024kB free_cma:0kB
[   94.786743] lowmem_reserve[]: 0 0 0
[   94.790230] Node 1 Normal free:264328640kB min:11510464kB low:14388032kB high:17265600kB active_anon:4160kB inactive_anon:67776kB active_file:158592kB inactive_file:311872kB unevictable:0kB writepending:0kB present:267386880kB managed:266996352kB mlocked:0kB bounce:0kB free_pcp:116160kB local_pcp:576kB free_cma:0kB
[   94.818121] lowmem_reserve[]: 0 0 0
[   94.821620] Node 2 DMA32 free:586112kB min:26880kB low:33600kB high:40320kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:787456kB managed:654464kB mlocked:0kB bounce:0kB free_pcp:128kB local_pcp:0kB free_cma:0kB
[   94.845949] lowmem_reserve[]: 0 0 0
[   94.849438] Node 3 DMA32 free:313088kB min:19712kB low:24640kB high:29568kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1015808kB managed:458688kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   94.873678] lowmem_reserve[]: 0 0 0
[   94.877157] Node 0 Normal: 14*64kB (M) 57*128kB (UM) 543*256kB (ME) 118*512kB (UM) 3*1024kB (M) 29*2048kB (UME) 14*4096kB (ME) 4*8192kB (M) 1*16384kB (U) 2*32768kB (UM) 1*65536kB (U) 1*131072kB (E) 3*262144kB (UME) 499*524288kB (M) = 263044864kB
[   94.898896] Node 1 Normal: 500*64kB (ME) 353*128kB (UM) 114*256kB (M) 168*512kB (UME) 9*1024kB (UM) 4*2048kB (M) 20*4096kB (UM) 7*8192kB (UM) 2*16384kB (UM) 1*32768kB (U) 3*65536kB (UME) 2*131072kB (ME) 5*262144kB (ME) 500*524288kB (M) = 264328064kB
[   94.920976] Node 2 DMA32: 20*64kB (UME) 19*128kB (UM) 17*256kB (UME) 15*512kB (UM) 13*1024kB (UM) 8*2048kB (UME) 8*4096kB (UME) 12*8192kB (UME) 7*16384kB (ME) 5*32768kB (ME) 2*65536kB (ME) 0*131072kB 0*262144kB 0*524288kB = 586112kB
[   94.941584] Node 3 DMA32: 12*64kB (U) 20*128kB (U) 12*256kB (U) 11*512kB (U) 12*1024kB (U) 3*2048kB (U) 3*4096kB (U) 1*8192kB (U) 0*16384kB 0*32768kB 0*65536kB 2*131072kB (U) 0*262144kB 0*524288kB = 313088kB
[   94.960018] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
[   94.968805] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
[   94.977416] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   94.985854] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
[   94.994638] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
[   95.003249] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   95.011686] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
[   95.020459] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
[   95.029074] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   95.037511] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=16777216kB
[   95.046299] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=524288kB
[   95.054910] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   95.063351] 12630 total pagecache pages
[   95.067177] 0 pages in swap cache
[   95.070481] Swap cache stats: add 0, delete 0, find 0/0
[   95.075709] Free swap  = 4194240kB
[   95.079099] Total swap = 4194240kB
[   95.082505] 8384016 pages RAM
[   95.085460] 0 pages HighMem/MovableOnly
[   95.089285] 22526 pages reserved
[   95.092513] 0 pages hwpoisoned


As with prior test results reported here, the errno was ENOTSUPP upon the prctl failure, with a lower iteration count, and the bpf_jit_limit value is markedly lower without the patch.

Moving to VERIFIED based on these results.

Comment 44 errata-xmlrpc 2023-05-16 08:55:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2951

Note You need to log in before you can comment on or make changes to this bug.