Bug 2094854 - ocp4-pci-dss-modified-api-checks-pod in a CrashLoopBackoff state because OOM.
Summary: ocp4-pci-dss-modified-api-checks-pod in a CrashLoopBackoff state because OOM.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Compliance Operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Jakub Hrozek
QA Contact: xiyuan
URL:
Whiteboard:
: 2070118 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-08 12:59 UTC by German Parente
Modified: 2022-07-14 12:41 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The compliance-operator held references to machine configuration data, significantly increasing memory usage. Consequence: The compliance operator would fail with CrashLoopBackoffs beause of out-of-memory exceptions. Fix: Use an updated version of compliance-operator (0.1.53), which includes better handling of large machine configuration data sets in memory. Result: The compliance operator should continue to run when dealing with large machine configuration data sets.
Clone Of:
Environment:
Last Closed: 2022-07-14 12:40:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ComplianceAsCode compliance-operator pull 48 0 None open Bug 2094854: api-resource-collector: Page MachineConfigs when listing them and throw away the file contents 2022-06-16 11:45:45 UTC
Red Hat Product Errata RHBA-2022:5537 0 None None None 2022-07-14 12:41:05 UTC

Description German Parente 2022-06-08 12:59:01 UTC
Description of problem:

The pod ocp4-pci-dss-modified-api-checks-pod is OOMkilled and remains in  CrashLoopBackoff state 

Version-Release number of selected component (if applicable):  0.1.52

It seems that when the amount of date in MachineConfig is to high. the fetching uri '/apis/machineconfiguration.openshift.io/v1/machineconfigs' is provoking this incident.

Customer has managed to workaround it temporarily by editing the pod spec and changing the memory limits to more than 600Mi and before the operator is resetting it, it's working fine.

The way to reproduce this is just by setting large MachineConfig resource definitions. For instance, at customer side:

oc get mc -o json | pv -b >/dev/null
96.9MiB

while in a recently installed cluster it's:

oc get mc  -o yaml | pv -b > /dev/null
351KiB

This could be related to 

Bug 2042235 - Compliance Operator default memory limits cause OOMKilled CrashLoopBackOff
https://bugzilla.redhat.com/show_bug.cgi?id=2042235

that is currently under investigation. If it's the case that the root cause is the same, feel free to close this bug.

Comment 1 Jakub Hrozek 2022-06-10 17:31:43 UTC
Thanks for the hint with the MCs!

Summarizing the discussion we had on Slack with the other CO developers:
 - we will fetch MCs using paging/continue. Page size tbd, but probably something small
 - for each MC, we'll strip the file contents because as far as CO is concerned, those are not interesting during the API checks (files are checked using a different kind of rules after they are rendered to the nodes)
 - we'll reconstruct the list of MCs w/o the file contents
 - run the filters and check on those

Comment 2 Jakub Hrozek 2022-06-14 14:10:09 UTC
Quick update: the local test builds seem to work with up to 200MB of MCs. Only tests and code prettifying must be done now.

Comment 6 Jakub Hrozek 2022-06-16 12:54:47 UTC
*** Bug 2070118 has been marked as a duplicate of this bug. ***

Comment 7 xiyuan 2022-06-20 05:35:34 UTC
Without the patch, the bug was reproduced with compliance-operator.v0.1.52 + 115MiB MC.
Verified pass with PR https://github.com/ComplianceAsCode/compliance-operator/pull/48, 190MiB MC and payload 4.11.0-0.nightly-2022-06-15-222801
# git log | head
commit 8355d05f5f394f6ac582073517e3977e172d1a28
Author: Jakub Hrozek <jhrozek>
Date:   Thu Jun 16 13:26:10 2022 +0200

    scan: Bump the memory limit of the api-resource collector to 200Mi
    
    Even with memory optimizations, the 100Mi limit might be too strict to
    list all objects in a cluster. Let's bump the limit to 200Mi.
    
    Jira: OCPBUGSM-45245

#  oc get mc -o json | pv -b >/dev/null
 245MiB


# oc apply -f -<<EOF
> apiVersion: compliance.openshift.io/v1alpha1
> kind: ScanSettingBinding
> metadata:
>   name: my-ssb-r
> profiles:
>   - name: ocp4-pci-dss
>     kind: Profile
>     apiGroup: compliance.openshift.io/v1alpha1
>   - name: ocp4-pci-dss-node
>     kind: Profile
>     apiGroup: compliance.openshift.io/v1alpha1
> settingsRef:
>   name: default
>   kind: ScanSetting
>   apiGroup: compliance.openshift.io/v1alpha1
> EOF
scansettingbinding.compliance.openshift.io/my-ssb-r created
# oc get suite -w
NAME       PHASE     RESULT
my-ssb-r   RUNNING   NOT-AVAILABLE
my-ssb-r   AGGREGATING   NOT-AVAILABLE
my-ssb-r   DONE          NON-COMPLIANT
my-ssb-r   DONE          NON-COMPLIANT
^C# oc get pod
NAME                                              READY   STATUS    RESTARTS       AGE
compliance-operator-86795c6dc6-xdvmh              1/1     Running   1 (100m ago)   101m
ocp4-openshift-compliance-pp-56f48b69d5-m4qx4     1/1     Running   0              100m
rhcos4-openshift-compliance-pp-5d95675dfc-zv6x2   1/1     Running   0              100m
# oc get ccr | head
NAME                                                                               STATUS   SEVERITY
ocp4-pci-dss-accounts-restrict-service-account-tokens                              MANUAL   medium
ocp4-pci-dss-accounts-unique-service-account                                       MANUAL   medium
ocp4-pci-dss-api-server-admission-control-plugin-alwaysadmit                       PASS     medium
ocp4-pci-dss-api-server-admission-control-plugin-alwayspullimages                  PASS     high
ocp4-pci-dss-api-server-admission-control-plugin-namespacelifecycle                PASS     medium
ocp4-pci-dss-api-server-admission-control-plugin-noderestriction                   PASS     medium
ocp4-pci-dss-api-server-admission-control-plugin-scc                               PASS     medium
ocp4-pci-dss-api-server-admission-control-plugin-securitycontextdeny               PASS     medium
ocp4-pci-dss-api-server-admission-control-plugin-serviceaccount                    PASS     medium

# oc get cr | head
NAME                                                                                 STATE
ocp4-pci-dss-api-server-encryption-provider-cipher                                   NotApplied
ocp4-pci-dss-api-server-encryption-provider-config                                   NotApplied
ocp4-pci-dss-node-master-kubelet-configure-event-creation                            NotApplied
ocp4-pci-dss-node-master-kubelet-configure-tls-cipher-suites                         NotApplied
ocp4-pci-dss-node-master-kubelet-enable-iptables-util-chains                         NotApplied
ocp4-pci-dss-node-master-kubelet-enable-protect-kernel-defaults                      NotApplied
ocp4-pci-dss-node-master-kubelet-enable-protect-kernel-sysctl                        NotApplied
ocp4-pci-dss-node-master-kubelet-eviction-thresholds-set-hard-imagefs-available      NotApplied
ocp4-pci-dss-node-master-kubelet-eviction-thresholds-set-hard-imagefs-available-1    NotApplied

# oc get pod
NAME                                              READY   STATUS    RESTARTS       AGE
compliance-operator-86795c6dc6-xdvmh              1/1     Running   1 (113m ago)   114m
ocp4-openshift-compliance-pp-56f48b69d5-m4qx4     1/1     Running   0              112m
rhcos4-openshift-compliance-pp-5d95675dfc-zv6x2   1/1     Running   0              112m

Comment 10 xiyuan 2022-06-27 05:21:46 UTC
Retest pass with latest code and payload 4.11.0-0.nightly-2022-06-25-081133
# oc get mc  -o yaml | pv -b > /dev/null
 228MiB
# git log | head
commit f891251c8c0d65a8240b1d90867b396778fcc003
Author: Jakub Hrozek <jhrozek>
Date:   Thu Jun 23 16:13:37 2022 +0200

    tests/contrib: Add a helper script that populatest the cluster with many MCs

commit 120271a1902c975e5893e561a66032a81dd850d9
Author: Jakub Hrozek <jhrozek>
Date:   Thu Jun 16 13:26:10 2022 +0200
# oc apply -f -<<EOF
apiVersion: compliance.openshift.io/v1alpha1
kind: ScanSettingBinding
metadata:
  name: my-ssb-r
profiles:
  - name: ocp4-pci-dss
    kind: Profile
    apiGroup: compliance.openshift.io/v1alpha1
  - name: ocp4-pci-dss-node
    kind: Profile
    apiGroup: compliance.openshift.io/v1alpha1
settingsRef:
  name: default
  kind: ScanSetting
  apiGroup: compliance.openshift.io/v1alpha1
EOF
scansettingbinding.compliance.openshift.io/my-ssb-r created
# oc get suite -w
NAME       PHASE     RESULT
my-ssb-r   RUNNING   NOT-AVAILABLE
my-ssb-r   AGGREGATING   NOT-AVAILABLE
my-ssb-r   DONE          NON-COMPLIANT
my-ssb-r   DONE          NON-COMPLIANT
^C

Comment 13 xiyuan 2022-07-08 06:44:01 UTC
Verification pass with compliance-operator.v0.1.53 and 4.11.0-rc.1

$ oc get mc -o json | pv -b >/dev/null
 245MiB
$ oc get ip
NAME            CSV                           APPROVAL    APPROVED
install-hksfh   compliance-operator.v0.1.53   Automatic   true
$ oc get csv
NAME                            DISPLAY                            VERSION   REPLACES   PHASE
compliance-operator.v0.1.53     Compliance Operator                0.1.53               Succeeded
elasticsearch-operator.v5.5.0   OpenShift Elasticsearch Operator   5.5.0                Succeeded
$  oc apply -f -<<EOF
> apiVersion: compliance.openshift.io/v1alpha1
> kind: ScanSettingBinding
> metadata:
>   name: my-ssb-r
> profiles:
>   - name: ocp4-pci-dss
>     kind: Profile
>     apiGroup: compliance.openshift.io/v1alpha1
>   - name: ocp4-pci-dss-node
>     kind: Profile
>     apiGroup: compliance.openshift.io/v1alpha1
> settingsRef:
>   name: default
>   kind: ScanSetting
>   apiGroup: compliance.openshift.io/v1alpha1
> EOF

scansettingbinding.compliance.openshift.io/my-ssb-r created
$ oc get suite -w
NAME       PHASE       RESULT
my-ssb-r   LAUNCHING   NOT-AVAILABLE
my-ssb-r   LAUNCHING   NOT-AVAILABLE
my-ssb-r   LAUNCHING   NOT-AVAILABLE
my-ssb-r   RUNNING     NOT-AVAILABLE
my-ssb-r   RUNNING     NOT-AVAILABLE
my-ssb-r   RUNNING     NOT-AVAILABLE
my-ssb-r   RUNNING     NOT-AVAILABLE
my-ssb-r   RUNNING     NOT-AVAILABLE
my-ssb-r   AGGREGATING   NOT-AVAILABLE
my-ssb-r   DONE          NON-COMPLIANT
my-ssb-r   DONE          NON-COMPLIANT

Comment 14 xiyuan 2022-07-08 06:45:01 UTC
Sorry, wrong operation, should move to VERIFIED

Comment 16 errata-xmlrpc 2022-07-14 12:40:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Compliance Operator bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5537


Note You need to log in before you can comment on or make changes to this bug.