Bug 2094854
Summary: | ocp4-pci-dss-modified-api-checks-pod in a CrashLoopBackoff state because OOM. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | German Parente <gparente> |
Component: | Compliance Operator | Assignee: | Jakub Hrozek <jhrozek> |
Status: | CLOSED ERRATA | QA Contact: | xiyuan |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.10 | CC: | agawand, igreen, jhrozek, lbragsta, mrogers, suprs, vyoganan, wenshen, xiyuan |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The compliance-operator held references to machine configuration data, significantly increasing memory usage.
Consequence: The compliance operator would fail with CrashLoopBackoffs beause of out-of-memory exceptions.
Fix: Use an updated version of compliance-operator (0.1.53), which includes better handling of large machine configuration data sets in memory.
Result: The compliance operator should continue to run when dealing with large machine configuration data sets.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-07-14 12:40:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
German Parente
2022-06-08 12:59:01 UTC
Thanks for the hint with the MCs! Summarizing the discussion we had on Slack with the other CO developers: - we will fetch MCs using paging/continue. Page size tbd, but probably something small - for each MC, we'll strip the file contents because as far as CO is concerned, those are not interesting during the API checks (files are checked using a different kind of rules after they are rendered to the nodes) - we'll reconstruct the list of MCs w/o the file contents - run the filters and check on those Quick update: the local test builds seem to work with up to 200MB of MCs. Only tests and code prettifying must be done now. *** Bug 2070118 has been marked as a duplicate of this bug. *** Without the patch, the bug was reproduced with compliance-operator.v0.1.52 + 115MiB MC. Verified pass with PR https://github.com/ComplianceAsCode/compliance-operator/pull/48, 190MiB MC and payload 4.11.0-0.nightly-2022-06-15-222801 # git log | head commit 8355d05f5f394f6ac582073517e3977e172d1a28 Author: Jakub Hrozek <jhrozek> Date: Thu Jun 16 13:26:10 2022 +0200 scan: Bump the memory limit of the api-resource collector to 200Mi Even with memory optimizations, the 100Mi limit might be too strict to list all objects in a cluster. Let's bump the limit to 200Mi. Jira: OCPBUGSM-45245 # oc get mc -o json | pv -b >/dev/null 245MiB # oc apply -f -<<EOF > apiVersion: compliance.openshift.io/v1alpha1 > kind: ScanSettingBinding > metadata: > name: my-ssb-r > profiles: > - name: ocp4-pci-dss > kind: Profile > apiGroup: compliance.openshift.io/v1alpha1 > - name: ocp4-pci-dss-node > kind: Profile > apiGroup: compliance.openshift.io/v1alpha1 > settingsRef: > name: default > kind: ScanSetting > apiGroup: compliance.openshift.io/v1alpha1 > EOF scansettingbinding.compliance.openshift.io/my-ssb-r created # oc get suite -w NAME PHASE RESULT my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r AGGREGATING NOT-AVAILABLE my-ssb-r DONE NON-COMPLIANT my-ssb-r DONE NON-COMPLIANT ^C# oc get pod NAME READY STATUS RESTARTS AGE compliance-operator-86795c6dc6-xdvmh 1/1 Running 1 (100m ago) 101m ocp4-openshift-compliance-pp-56f48b69d5-m4qx4 1/1 Running 0 100m rhcos4-openshift-compliance-pp-5d95675dfc-zv6x2 1/1 Running 0 100m # oc get ccr | head NAME STATUS SEVERITY ocp4-pci-dss-accounts-restrict-service-account-tokens MANUAL medium ocp4-pci-dss-accounts-unique-service-account MANUAL medium ocp4-pci-dss-api-server-admission-control-plugin-alwaysadmit PASS medium ocp4-pci-dss-api-server-admission-control-plugin-alwayspullimages PASS high ocp4-pci-dss-api-server-admission-control-plugin-namespacelifecycle PASS medium ocp4-pci-dss-api-server-admission-control-plugin-noderestriction PASS medium ocp4-pci-dss-api-server-admission-control-plugin-scc PASS medium ocp4-pci-dss-api-server-admission-control-plugin-securitycontextdeny PASS medium ocp4-pci-dss-api-server-admission-control-plugin-serviceaccount PASS medium # oc get cr | head NAME STATE ocp4-pci-dss-api-server-encryption-provider-cipher NotApplied ocp4-pci-dss-api-server-encryption-provider-config NotApplied ocp4-pci-dss-node-master-kubelet-configure-event-creation NotApplied ocp4-pci-dss-node-master-kubelet-configure-tls-cipher-suites NotApplied ocp4-pci-dss-node-master-kubelet-enable-iptables-util-chains NotApplied ocp4-pci-dss-node-master-kubelet-enable-protect-kernel-defaults NotApplied ocp4-pci-dss-node-master-kubelet-enable-protect-kernel-sysctl NotApplied ocp4-pci-dss-node-master-kubelet-eviction-thresholds-set-hard-imagefs-available NotApplied ocp4-pci-dss-node-master-kubelet-eviction-thresholds-set-hard-imagefs-available-1 NotApplied # oc get pod NAME READY STATUS RESTARTS AGE compliance-operator-86795c6dc6-xdvmh 1/1 Running 1 (113m ago) 114m ocp4-openshift-compliance-pp-56f48b69d5-m4qx4 1/1 Running 0 112m rhcos4-openshift-compliance-pp-5d95675dfc-zv6x2 1/1 Running 0 112m Retest pass with latest code and payload 4.11.0-0.nightly-2022-06-25-081133 # oc get mc -o yaml | pv -b > /dev/null 228MiB # git log | head commit f891251c8c0d65a8240b1d90867b396778fcc003 Author: Jakub Hrozek <jhrozek> Date: Thu Jun 23 16:13:37 2022 +0200 tests/contrib: Add a helper script that populatest the cluster with many MCs commit 120271a1902c975e5893e561a66032a81dd850d9 Author: Jakub Hrozek <jhrozek> Date: Thu Jun 16 13:26:10 2022 +0200 # oc apply -f -<<EOF apiVersion: compliance.openshift.io/v1alpha1 kind: ScanSettingBinding metadata: name: my-ssb-r profiles: - name: ocp4-pci-dss kind: Profile apiGroup: compliance.openshift.io/v1alpha1 - name: ocp4-pci-dss-node kind: Profile apiGroup: compliance.openshift.io/v1alpha1 settingsRef: name: default kind: ScanSetting apiGroup: compliance.openshift.io/v1alpha1 EOF scansettingbinding.compliance.openshift.io/my-ssb-r created # oc get suite -w NAME PHASE RESULT my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r AGGREGATING NOT-AVAILABLE my-ssb-r DONE NON-COMPLIANT my-ssb-r DONE NON-COMPLIANT ^C Verification pass with compliance-operator.v0.1.53 and 4.11.0-rc.1
$ oc get mc -o json | pv -b >/dev/null
245MiB
$ oc get ip
NAME CSV APPROVAL APPROVED
install-hksfh compliance-operator.v0.1.53 Automatic true
$ oc get csv
NAME DISPLAY VERSION REPLACES PHASE
compliance-operator.v0.1.53 Compliance Operator 0.1.53 Succeeded
elasticsearch-operator.v5.5.0 OpenShift Elasticsearch Operator 5.5.0 Succeeded
$ oc apply -f -<<EOF
> apiVersion: compliance.openshift.io/v1alpha1
> kind: ScanSettingBinding
> metadata:
> name: my-ssb-r
> profiles:
> - name: ocp4-pci-dss
> kind: Profile
> apiGroup: compliance.openshift.io/v1alpha1
> - name: ocp4-pci-dss-node
> kind: Profile
> apiGroup: compliance.openshift.io/v1alpha1
> settingsRef:
> name: default
> kind: ScanSetting
> apiGroup: compliance.openshift.io/v1alpha1
> EOF
scansettingbinding.compliance.openshift.io/my-ssb-r created
$ oc get suite -w
NAME PHASE RESULT
my-ssb-r LAUNCHING NOT-AVAILABLE
my-ssb-r LAUNCHING NOT-AVAILABLE
my-ssb-r LAUNCHING NOT-AVAILABLE
my-ssb-r RUNNING NOT-AVAILABLE
my-ssb-r RUNNING NOT-AVAILABLE
my-ssb-r RUNNING NOT-AVAILABLE
my-ssb-r RUNNING NOT-AVAILABLE
my-ssb-r RUNNING NOT-AVAILABLE
my-ssb-r AGGREGATING NOT-AVAILABLE
my-ssb-r DONE NON-COMPLIANT
my-ssb-r DONE NON-COMPLIANT
Sorry, wrong operation, should move to VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Compliance Operator bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5537 |