2002667 – Compliance state doesn't get updated after fixing the issue causing initially the policy not being able to update the managed object

Bug 2002667 - Compliance state doesn't get updated after fixing the issue causing initially the policy not being able to update the managed object

Summary: Compliance state doesn't get updated after fixing the issue causing initially...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	GRC & Policy
Sub Component:
Version:	rhacm-2.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	rhacm-2.3.3
Assignee:	Yu Cao
QA Contact:	Derek Ho
Docs Contact:	Mikela Dockery
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-09 13:14 UTC by Marius Cornea
Modified:	2021-10-19 22:07 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-06 21:53:22 UTC
Target Upstream Version:
Embargoed:
Flags:	ming: rhacm-2.3.z+

Attachments	(Terms of Use)
policy-describe.log (16.94 KB, text/plain) 2021-09-09 13:14 UTC, Marius Cornea	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	open-cluster-management backlog issues 16002	0	None	None	None	2021-09-09 15:40:37 UTC

Description Marius Cornea 2021-09-09 13:14:49 UTC

Created attachment 1821810 [details]
policy-describe.log

Description of problem:

Compliance state doesn't get updated after fixing the issue causing initially the policy not being able to update the managed object. 

Initial policy:

apiVersion: policy.open-cluster-management.io/v1
kind: Policy
metadata:
  annotations:
    policy.open-cluster-management.io/categories: CM Configuration Management
    policy.open-cluster-management.io/controls: CM-2 Baseline Configuration
    policy.open-cluster-management.io/standards: NIST SP 800-53
  creationTimestamp: "2021-09-09T12:16:20Z"
  generation: 3
  labels:
    policy.open-cluster-management.io/cluster-name: kni-qe-1
    policy.open-cluster-management.io/cluster-namespace: kni-qe-1
    policy.open-cluster-management.io/root-policy: kni-qe-1-policies.kni-qe-1-perfprofile-policy
  name: kni-qe-1-policies.kni-qe-1-perfprofile-policy
  namespace: kni-qe-1
  resourceVersion: "95012"
  uid: 4fe28ba1-a9c0-4ada-9732-2a4af22c0109
spec:
  disabled: false
  policy-templates:
  - objectDefinition:
      apiVersion: policy.open-cluster-management.io/v1
      kind: ConfigurationPolicy
      metadata:
        name: kni-qe-1-perfprofile-policy-config
      spec:
        namespaceselector:
          exclude:
          - kube-*
          include:
          - '*'
        object-templates:
        - complianceType: mustonlyhave
          objectDefinition:
            apiVersion: performance.openshift.io/v1
            kind: PerformanceProfile
            metadata:
              name: openshift-node-performance-profile
            spec:
              additionalKernelArgs:
              - idle=poll
              cpu:
                isolated: 2-23,25-47
                reserved: 0-1,24-25
              hugepages:
                defaultHugepagesSize: 1G
                pages:
                - count: 32
                  size: 1G
              machineConfigPoolSelector:
                pools.operator.machineconfiguration.openshift.io/master: ""
              net:
                userLevelNetworking: true
              nodeSelector:
                node-role.kubernetes.io/master: ""
              numa:
                topologyPolicy: restricted
              realTimeKernel:
                enabled: false
        remediationAction: enforce
        severity: low
  remediationAction: enforce


The policy cannot be applied due to a CPU overlap in the reserved and isolated CPUs:

status:
  compliant: NonCompliant
  details:
  - compliant: NonCompliant
    history:
    - eventName: kni-qe-1-policies.kni-qe-1-perfprofile-policy.16a3282cbb0b2f52
      lastTimestamp: "2021-09-09T12:51:18Z"
      message: 'NonCompliant; violation - Error updating the object `openshift-node-performance-profile`,
        the error is `admission webhook "vwb.performance.openshift.io" denied the
        request: PerformanceProfile.performance.openshift.io "openshift-node-performance-profile"
        is invalid: spec.cpu: Invalid value: v2.CPU{Reserved:(*v2.CPUSet)(0xc001364ad0),
        Isolated:(*v2.CPUSet)(0xc001364ac0), BalanceIsolated:(*bool)(nil)}: reserved
        and isolated cpus overlap: [25]`'


After fixing the issue so that the reserved and isolated CPUs do not overlap in the performanceprofile, the policy still remains in NonCompliant state, not reflecting the actual status of the performance profile.

Version-Release number of selected component (if applicable):
2.3.2-DOWNSTREAM-2021-08-26-01-04-22

How reproducible:
100%

Steps to Reproduce:
1. Create policy which manages an a performanceprofile object with overlapping reserved and isolated CPUs
2. Fix the user error in the policy configuration

Actual results:
Policy remains in NonCompliant state.

Expected results:
Policy should be in Compliant state, reflecting the actual performance profile configuration.

Additional info:

Attaching output of the policy describe.

After deleting/recreating the policy it reports Compliant state.

Comment 1 Mike Ng 2021-09-14 13:14:48 UTC

G2Bsync 918492362 comment 
 gparvin Mon, 13 Sep 2021 19:07:06 UTC 
 G2Bsync I was able to recreate this issue and discovered what appears to be the basic root cause.  I have assigned it to an engineer to take a deeper look into this problem.  Thanks for reporting this to us!

Comment 2 Mike Ng 2021-09-15 17:54:04 UTC

G2Bsync 920072429 comment 
 gparvin Wed, 15 Sep 2021 14:29:57 UTC 
 G2Bsync
Although we have been able to recreate this problem, it looks like the behavior of the vwb performance webhook is causing this problem.  Note the messages below:

```
Error updating the object `openshift-node-performance-profile`, the error is `admission webhook "vwb.performance.openshift.io" denied the request: PerformanceProfile.performance.openshift.io "openshift-node-performance-profile" is invalid: spec.cpu: Invalid value: v2.CPU{Reserved:(*v2.CPUSet)(0xc000519fc0), Isolated:(*v2.CPUSet)(0xc000519fb0), BalanceIsolated:(*bool)(nil)}: reserved and isolated cpus overlap: [25]`
Error updating the object `openshift-node-performance-profile`, the error is `admission webhook "vwb.performance.openshift.io" denied the request: PerformanceProfile.performance.openshift.io "openshift-node-performance-profile" is invalid: spec.cpu: Invalid value: v2.CPU{Reserved:(*v2.CPUSet)(0xc000510670), Isolated:(*v2.CPUSet)(0xc000510660), BalanceIsolated:(*bool)(nil)}: reserved and isolated cpus overlap: [25]`
```

Each time the configuration policy controller attempts to apply changes, it is getting back a different message due to what appears to be addresses being included in the message.  This prevents our controller from understanding it's really the same failure each time the configuration merge is attempted.  This results in our controller alerting that there's a NonCompliance frequently enough to cause events to be merged as being correlated.

At this point in time we are viewing this problem as an issue in the perf webhook, but it's important to note that you can recognize the configuration failure immediately when the policy is changed and becomes NonCompliant.  Instead of deleting and recreating the policy to resolve the discrepancy between the policy and the actual state you can "Disable" the policy and then re-enable it.  That prevents having to delete and recreate the policy which of course works to resolve the issue too.

Comment 5 Mike Ng 2021-10-07 01:50:00 UTC

G2Bsync 936584332 comment 
 willkutler Wed, 06 Oct 2021 16:11:38 UTC 
 G2Bsync Closing this issue as resolved by the workaround @gparvin posted since we have not received any feedback over the last 3 weeks

Note You need to log in before you can comment on or make changes to this bug.