Bug 2009233

Summary: ACM policy object generated by PolicyGen conflicting with OLM Operator
Product: OpenShift Container Platform Reporter: Juan Manuel Parrilla Madrid <jparrill>
Component: Telco EdgeAssignee: Ian Miller <imiller>
Telco Edge sub component: ZTP QA Contact: yliu1
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: mcornea, tmulquee
Version: 4.8   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Generated policy has complianceType "mustonlyhave". OLM updates to metadata are then reverted as policy engine restored "desired" state of CR. Consequence: OLM and the policy engine continuously overwrite the metadata of the CR under conflict. High CPU use results. Fix: Change default complianceType to "musthave" Result: OLM and policy engine no longer conflict. CPU use returns to baseline.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:14:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2025082    

Description Juan Manuel Parrilla Madrid 2021-09-30 08:35:30 UTC
Description of problem:

Using the ZTP flow and the repo https://github.com/openshift-kni/cnf-features-deploy we deploy an SNO cluster and the policies associated with the environment. The thing is, when the ACM policies are created in the hub cluster has the behaviour "MustOnlyHave" this ends on an issue regarding the labels and the namespaces:

- The Policy creates a NS with concrete labels (E.G monitoring)
- The OLM tries to path that NS with a label (olm.operatorgroup.uid/bb373fcd-1a63-4b7a-83cd-011226dc71ad: "") (automatically generated by OLM operator
- The policy enter in NonCompliant state.
- The policy get applied again..
- The loop goes on

I'm using the Hooks for PolicyGen

Version-Release number of selected component (if applicable):

ACM 2.3.3
Hub 4.8.5
SNO 4.8.11

How reproducible:

Always


Steps to Reproduce:
1. Deploy ACM and the gitops-operator
2. Fill the code repo as exists on the cnf-feature-deploy repo
3. git push to the repo and then let the hooks deploy the SNO and the ACM Policies
4. Wait until it starts flapping

Actual results:

- Policy flapping between between NonCompliant and Compliant state
- Many errors on the OLM Operator Logs

Expected results:

No errors
Additional info:

- Logs on the OLM Operator:
time="2021-09-29T10:30:12Z" level=info msg="checking ptp-operator.4.8.0-202108312109"
time="2021-09-29T10:30:12Z" level=info msg="checking performance-addon-operator.v4.8.1"
{"level":"error","ts":1632911412.328675,"logger":"controllers.operator","msg":"Could not update Operator status","request":"/ptp-operator.openshift-ptp","error":"Operation cannot be fulfilled on operators.operators.coreos.com \"ptp-operator.openshift-ptp\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:293\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:248\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"}
time="2021-09-29T10:30:12Z" level=info msg="checking ptp-operator.4.8.0-202108312109"
{"level":"error","ts":1632911412.4046497,"logger":"controllers.operator","msg":"Could not update Operator status","request":"/local-storage-operator.openshift-local-storage","error":"Operation cannot be fulfilled on operators.operators.coreos.com \"local-storage-operator.openshift-local-storage\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:293\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:248\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"}
{"level":"error","ts":1632911412.4187012,"logger":"controllers.operator","msg":"Could not update Operator status","request":"/local-storage-operator.openshift-local-storage","error":"Operation cannot be fulfilled on operators.operators.coreos.com \"local-storage-operator.openshift-local-storage\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:293\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:248\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99"}
time="2021-09-29T10:30:12Z" level=info msg="checking sriov-fec.v1.3.0"
E0929 10:30:12.680671       1 queueinformer_operator.go:290] sync {"update" "openshift-performance-addon-operator"} failed: Operation cannot be fulfilled on namespaces "openshift-performance-addon-operator": the object has been modified; please apply your changes to the latest version and try again
E0929 10:30:12.730499       1 queueinformer_operator.go:290] sync {"update" "openshift-sriov-network-operator"} failed: Operation cannot be fulfilled on namespaces "openshift-sriov-network-operator": the object has been modified; please apply your changes to the latest version and try again
time="2021-09-29T10:30:13Z" level=info msg="checking ptp-operator.4.8.0-202108312109"
time="2021-09-29T10:30:13Z" level=info msg="checking performance-addon-operator.v4.8.1"
time="2021-09-29T10:30:14Z" level=info msg="checking performance-addon-operator.v4.8.1"

Comment 1 Juan Manuel Parrilla Madrid 2021-09-30 12:10:32 UTC
Patching the ACM policy with "complianceType: musthave" is a temp workaround that you can apply, but if you modify the repo this will be overrided by the hooks.

Comment 2 Juan Manuel Parrilla Madrid 2021-10-07 07:58:37 UTC
Hey folks, this is also happening with PVCs:


Error on the policy:

    - eventName: vz-wc-lab-policies.vz-wc-lab-image-registry-policy.16ab6a5510dab516
      lastTimestamp: "2021-10-07T05:40:23Z"
      message: 'NonCompliant; violation - Error updating the object `registry-storage`,
        the error is `Operation cannot be fulfilled on persistentvolumeclaims "registry-storage":
        the object has been modified; please apply your changes to the latest version
        and try again`; notification - configs [cluster] found as specified, therefore
        this Object template is compliant'
    - eventName: vz-wc-lab-policies.vz-wc-lab-image-registry-policy.16aba20db2e6167c
      lastTimestamp: "2021-10-07T05:35:06Z"
      message: "NonCompliant; violation - Error updating the object `registry-storage`,
        the error is `PersistentVolumeClaim \"registry-storage\" is invalid: spec:
        Forbidden: spec is immutable after creation except resources.requests for
        bound claims\n  core.PersistentVolumeClaimSpec{\n  \tAccessModes:      {\"ReadWriteOnce\"},\n
        \ \tSelector:         nil,\n  \tResources:        {Requests: {s\"storage\":
        {i: {...}, s: \"100Gi\", Format: \"BinarySI\"}}},\n- \tVolumeName:       \"\",\n+
        \tVolumeName:       \"local-pv-b908200e\",\n  \tStorageClassName: nil,\n  \tVolumeMode:
        \      &\"Filesystem\",\n  \tDataSource:       nil,\n  }\n`; notification
        - configs [cluster] found as specified, therefore this Object template is
        compliant"
    - eventName: vz-wc-lab-policies.vz-wc-lab-image-registry-policy.16ab6a5510dab516
      lastTimestamp: "2021-10-07T05:10:08Z"
      message: 'NonCompliant; violation - Error updating the object `registry-storage`,
        the error is `Operation cannot be fulfilled on persistentvolumeclaims "registry-storage":
        the object has been modified; please apply your changes to the latest version
        and try again`; notification - configs [cluster] found as specified, therefore
        this Object template is compliant'



This is the object that the policy want to enforce:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    volume.beta.kubernetes.io/storage-class: fs-lso
  creationTimestamp: "2021-10-06T10:04:09Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: registry-storage
  namespace: openshift-image-registry
  resourceVersion: "1623757"
  uid: 36578f14-2c57-4e46-b116-8aabedf759ed
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  volumeMode: Filesystem
  volumeName: local-pv-b908200e
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  phase: Bound


This is the object that other operator want to apply:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    volume.beta.kubernetes.io/storage-class: fs-lso
  creationTimestamp: "2021-10-06T10:04:09Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: registry-storage
  namespace: openshift-image-registry
  resourceVersion: "1623757"
  uid: 36578f14-2c57-4e46-b116-8aabedf759ed
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  volumeMode: Filesystem
  volumeName: local-pv-b908200e
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  phase: Bound

Comment 5 yliu1 2021-11-19 20:28:17 UTC
We currently don't have a formal test env to test ZTP for 4.10 nightly at the moment. Mark it as verified to unblock merge to 4.9, and will verify this change in 4.9.

Comment 6 Ian Miller 2021-11-24 12:36:07 UTC
Reopening. Further testing showed there is still excess CPU use.

Comment 9 Tony Mulqueen 2022-01-10 15:39:51 UTC
Doc Text would be helpful in documenting this in the 4.10 release notes. Please supply.

Comment 12 errata-xmlrpc 2022-03-10 16:14:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056