Description of problem: sibling to https://issues.redhat.com/browse/OCPBUGS-15043 similar to https://bugzilla.redhat.com/show_bug.cgi?id=2140163 except, the issue is hit again in openshift 4.13.3 which uses kernel 5.14.0-284.16.1.el9_2 as well as openshift 4.12.15 (which uses kernel 4.18.0-372.52.1.el8_6) I can't find a bug for 9.2 version, so I'm not sure if it's fixed already, but I'd like one to track regardless. I also wonder if a comparatively newer patch (https://lore.kernel.org/bpf/20230321170925.74358-1-kuniyu@amazon.com/) is relevant and what the status of backport is for that as well. Version-Release number of selected component (if applicable): 5.14.0-284.16.1.el9_2 4.18.0-372.52.1.el8_6 How reproducible: from the issue above ``` Currently, they are unable to spin up the pods on this specific worker. Although they do not see any resource overcommitment in the node describe. ``` Steps to Reproduce: 1. This worker node has 228 running pods 2. 3. Actual results: ``` When customers tried to start deployment, and that deployments tried to run pods on worker number 6, those pods entered "CreateContainerError". Upon checking the events of these pods, all presented the error: "runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524". ``` Expected results: from the jira ``` Container should not be stuck on this error since this issue was addressed in 4.12.2 errata earlier https://issues.redhat.com/browse/OCPBUGS-2637 https://issues.redhat.com/browse/RUN-1668 https://access.redhat.com/errata/RHBA-2023:0568 -> OCPBUGS-6981 - error 524 from seccomp(2) when trying to load filter [rhel-8.6.0.z] ``` interestingly, the kernel version listed in the bug above is newer than the one in 4.12.15, though the bug apparently went through errata. It may be an rhcos packaging problem, but wanted to open here to track el9 version as well Additional info:
Hi Peter, the above patch [1] that you mention looks like it could resolve the issue. It has been recently backported to CentOS Stream 9 as a part of our regular BPF subsystem rebase and will appear in RHEL 9.3. So, in case we confirm that it is the necessary fix, we will need to backport it to 8.6 and 9.2 z-streams. I crafted a Brew build for 9.2z with [1] included, so that we can test it: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=53691694 @Peter if you or someone could use this to check if it resolves the problem, that would be great. I'll post it to the original OCP Jira issue, too. FWIW, the 5.14.0-284.16.1.el9_2 kernel also suffers from a memleak introduced by upstream commit [2]. There is a fix for it already [3], so we should backport that one to 9.2 z-stream, too. But since this issue appears on 4.18.0-372.52.1.el8_6, too, which doesn't have [2], I'm fairly confident that we'll need to backport [1] anyways. [1] https://github.com/torvalds/linux/commit/10ec8ca8ec1a2f04c4ed90897225231c58c124a7 [2] https://github.com/torvalds/linux/commit/3a15fb6ed92cb32b0a83f406aa4a96f28c9adbc3 [3] https://github.com/torvalds/linux/commit/a1140cb215fa13dcec06d12ba0c3ee105633b7c4
I also crafted a Brew build for 8.6z with the mentioned fix included, in case it helps with testing: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=53692906
Since the same issue likely affects RHEL 8, too, I created a copy of this bug for it: bz#2219567.
Since this bug is for RHEL9, I'm going to use it to backport the memleak fix a1140cb215fa ("seccomp: Move copy_seccomp() to no failure path.") into 9.3.