Description of problem: The pod ocp4-pci-dss-modified-api-checks-pod is OOMkilled and remains in CrashLoopBackoff state Version-Release number of selected component (if applicable): 0.1.52 It seems that when the amount of date in MachineConfig is to high. the fetching uri '/apis/machineconfiguration.openshift.io/v1/machineconfigs' is provoking this incident. Customer has managed to workaround it temporarily by editing the pod spec and changing the memory limits to more than 600Mi and before the operator is resetting it, it's working fine. The way to reproduce this is just by setting large MachineConfig resource definitions. For instance, at customer side: oc get mc -o json | pv -b >/dev/null 96.9MiB while in a recently installed cluster it's: oc get mc -o yaml | pv -b > /dev/null 351KiB This could be related to Bug 2042235 - Compliance Operator default memory limits cause OOMKilled CrashLoopBackOff https://bugzilla.redhat.com/show_bug.cgi?id=2042235 that is currently under investigation. If it's the case that the root cause is the same, feel free to close this bug.
Thanks for the hint with the MCs! Summarizing the discussion we had on Slack with the other CO developers: - we will fetch MCs using paging/continue. Page size tbd, but probably something small - for each MC, we'll strip the file contents because as far as CO is concerned, those are not interesting during the API checks (files are checked using a different kind of rules after they are rendered to the nodes) - we'll reconstruct the list of MCs w/o the file contents - run the filters and check on those
Quick update: the local test builds seem to work with up to 200MB of MCs. Only tests and code prettifying must be done now.
*** Bug 2070118 has been marked as a duplicate of this bug. ***
Without the patch, the bug was reproduced with compliance-operator.v0.1.52 + 115MiB MC. Verified pass with PR https://github.com/ComplianceAsCode/compliance-operator/pull/48, 190MiB MC and payload 4.11.0-0.nightly-2022-06-15-222801 # git log | head commit 8355d05f5f394f6ac582073517e3977e172d1a28 Author: Jakub Hrozek <jhrozek> Date: Thu Jun 16 13:26:10 2022 +0200 scan: Bump the memory limit of the api-resource collector to 200Mi Even with memory optimizations, the 100Mi limit might be too strict to list all objects in a cluster. Let's bump the limit to 200Mi. Jira: OCPBUGSM-45245 # oc get mc -o json | pv -b >/dev/null 245MiB # oc apply -f -<<EOF > apiVersion: compliance.openshift.io/v1alpha1 > kind: ScanSettingBinding > metadata: > name: my-ssb-r > profiles: > - name: ocp4-pci-dss > kind: Profile > apiGroup: compliance.openshift.io/v1alpha1 > - name: ocp4-pci-dss-node > kind: Profile > apiGroup: compliance.openshift.io/v1alpha1 > settingsRef: > name: default > kind: ScanSetting > apiGroup: compliance.openshift.io/v1alpha1 > EOF scansettingbinding.compliance.openshift.io/my-ssb-r created # oc get suite -w NAME PHASE RESULT my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r AGGREGATING NOT-AVAILABLE my-ssb-r DONE NON-COMPLIANT my-ssb-r DONE NON-COMPLIANT ^C# oc get pod NAME READY STATUS RESTARTS AGE compliance-operator-86795c6dc6-xdvmh 1/1 Running 1 (100m ago) 101m ocp4-openshift-compliance-pp-56f48b69d5-m4qx4 1/1 Running 0 100m rhcos4-openshift-compliance-pp-5d95675dfc-zv6x2 1/1 Running 0 100m # oc get ccr | head NAME STATUS SEVERITY ocp4-pci-dss-accounts-restrict-service-account-tokens MANUAL medium ocp4-pci-dss-accounts-unique-service-account MANUAL medium ocp4-pci-dss-api-server-admission-control-plugin-alwaysadmit PASS medium ocp4-pci-dss-api-server-admission-control-plugin-alwayspullimages PASS high ocp4-pci-dss-api-server-admission-control-plugin-namespacelifecycle PASS medium ocp4-pci-dss-api-server-admission-control-plugin-noderestriction PASS medium ocp4-pci-dss-api-server-admission-control-plugin-scc PASS medium ocp4-pci-dss-api-server-admission-control-plugin-securitycontextdeny PASS medium ocp4-pci-dss-api-server-admission-control-plugin-serviceaccount PASS medium # oc get cr | head NAME STATE ocp4-pci-dss-api-server-encryption-provider-cipher NotApplied ocp4-pci-dss-api-server-encryption-provider-config NotApplied ocp4-pci-dss-node-master-kubelet-configure-event-creation NotApplied ocp4-pci-dss-node-master-kubelet-configure-tls-cipher-suites NotApplied ocp4-pci-dss-node-master-kubelet-enable-iptables-util-chains NotApplied ocp4-pci-dss-node-master-kubelet-enable-protect-kernel-defaults NotApplied ocp4-pci-dss-node-master-kubelet-enable-protect-kernel-sysctl NotApplied ocp4-pci-dss-node-master-kubelet-eviction-thresholds-set-hard-imagefs-available NotApplied ocp4-pci-dss-node-master-kubelet-eviction-thresholds-set-hard-imagefs-available-1 NotApplied # oc get pod NAME READY STATUS RESTARTS AGE compliance-operator-86795c6dc6-xdvmh 1/1 Running 1 (113m ago) 114m ocp4-openshift-compliance-pp-56f48b69d5-m4qx4 1/1 Running 0 112m rhcos4-openshift-compliance-pp-5d95675dfc-zv6x2 1/1 Running 0 112m
Retest pass with latest code and payload 4.11.0-0.nightly-2022-06-25-081133 # oc get mc -o yaml | pv -b > /dev/null 228MiB # git log | head commit f891251c8c0d65a8240b1d90867b396778fcc003 Author: Jakub Hrozek <jhrozek> Date: Thu Jun 23 16:13:37 2022 +0200 tests/contrib: Add a helper script that populatest the cluster with many MCs commit 120271a1902c975e5893e561a66032a81dd850d9 Author: Jakub Hrozek <jhrozek> Date: Thu Jun 16 13:26:10 2022 +0200 # oc apply -f -<<EOF apiVersion: compliance.openshift.io/v1alpha1 kind: ScanSettingBinding metadata: name: my-ssb-r profiles: - name: ocp4-pci-dss kind: Profile apiGroup: compliance.openshift.io/v1alpha1 - name: ocp4-pci-dss-node kind: Profile apiGroup: compliance.openshift.io/v1alpha1 settingsRef: name: default kind: ScanSetting apiGroup: compliance.openshift.io/v1alpha1 EOF scansettingbinding.compliance.openshift.io/my-ssb-r created # oc get suite -w NAME PHASE RESULT my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r AGGREGATING NOT-AVAILABLE my-ssb-r DONE NON-COMPLIANT my-ssb-r DONE NON-COMPLIANT ^C
Verification pass with compliance-operator.v0.1.53 and 4.11.0-rc.1 $ oc get mc -o json | pv -b >/dev/null 245MiB $ oc get ip NAME CSV APPROVAL APPROVED install-hksfh compliance-operator.v0.1.53 Automatic true $ oc get csv NAME DISPLAY VERSION REPLACES PHASE compliance-operator.v0.1.53 Compliance Operator 0.1.53 Succeeded elasticsearch-operator.v5.5.0 OpenShift Elasticsearch Operator 5.5.0 Succeeded $ oc apply -f -<<EOF > apiVersion: compliance.openshift.io/v1alpha1 > kind: ScanSettingBinding > metadata: > name: my-ssb-r > profiles: > - name: ocp4-pci-dss > kind: Profile > apiGroup: compliance.openshift.io/v1alpha1 > - name: ocp4-pci-dss-node > kind: Profile > apiGroup: compliance.openshift.io/v1alpha1 > settingsRef: > name: default > kind: ScanSetting > apiGroup: compliance.openshift.io/v1alpha1 > EOF scansettingbinding.compliance.openshift.io/my-ssb-r created $ oc get suite -w NAME PHASE RESULT my-ssb-r LAUNCHING NOT-AVAILABLE my-ssb-r LAUNCHING NOT-AVAILABLE my-ssb-r LAUNCHING NOT-AVAILABLE my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r AGGREGATING NOT-AVAILABLE my-ssb-r DONE NON-COMPLIANT my-ssb-r DONE NON-COMPLIANT
Sorry, wrong operation, should move to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Compliance Operator bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5537