Hide Forgot
Created attachment 1821810 [details] policy-describe.log Description of problem: Compliance state doesn't get updated after fixing the issue causing initially the policy not being able to update the managed object. Initial policy: apiVersion: policy.open-cluster-management.io/v1 kind: Policy metadata: annotations: policy.open-cluster-management.io/categories: CM Configuration Management policy.open-cluster-management.io/controls: CM-2 Baseline Configuration policy.open-cluster-management.io/standards: NIST SP 800-53 creationTimestamp: "2021-09-09T12:16:20Z" generation: 3 labels: policy.open-cluster-management.io/cluster-name: kni-qe-1 policy.open-cluster-management.io/cluster-namespace: kni-qe-1 policy.open-cluster-management.io/root-policy: kni-qe-1-policies.kni-qe-1-perfprofile-policy name: kni-qe-1-policies.kni-qe-1-perfprofile-policy namespace: kni-qe-1 resourceVersion: "95012" uid: 4fe28ba1-a9c0-4ada-9732-2a4af22c0109 spec: disabled: false policy-templates: - objectDefinition: apiVersion: policy.open-cluster-management.io/v1 kind: ConfigurationPolicy metadata: name: kni-qe-1-perfprofile-policy-config spec: namespaceselector: exclude: - kube-* include: - '*' object-templates: - complianceType: mustonlyhave objectDefinition: apiVersion: performance.openshift.io/v1 kind: PerformanceProfile metadata: name: openshift-node-performance-profile spec: additionalKernelArgs: - idle=poll cpu: isolated: 2-23,25-47 reserved: 0-1,24-25 hugepages: defaultHugepagesSize: 1G pages: - count: 32 size: 1G machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/master: "" net: userLevelNetworking: true nodeSelector: node-role.kubernetes.io/master: "" numa: topologyPolicy: restricted realTimeKernel: enabled: false remediationAction: enforce severity: low remediationAction: enforce The policy cannot be applied due to a CPU overlap in the reserved and isolated CPUs: status: compliant: NonCompliant details: - compliant: NonCompliant history: - eventName: kni-qe-1-policies.kni-qe-1-perfprofile-policy.16a3282cbb0b2f52 lastTimestamp: "2021-09-09T12:51:18Z" message: 'NonCompliant; violation - Error updating the object `openshift-node-performance-profile`, the error is `admission webhook "vwb.performance.openshift.io" denied the request: PerformanceProfile.performance.openshift.io "openshift-node-performance-profile" is invalid: spec.cpu: Invalid value: v2.CPU{Reserved:(*v2.CPUSet)(0xc001364ad0), Isolated:(*v2.CPUSet)(0xc001364ac0), BalanceIsolated:(*bool)(nil)}: reserved and isolated cpus overlap: [25]`' After fixing the issue so that the reserved and isolated CPUs do not overlap in the performanceprofile, the policy still remains in NonCompliant state, not reflecting the actual status of the performance profile. Version-Release number of selected component (if applicable): 2.3.2-DOWNSTREAM-2021-08-26-01-04-22 How reproducible: 100% Steps to Reproduce: 1. Create policy which manages an a performanceprofile object with overlapping reserved and isolated CPUs 2. Fix the user error in the policy configuration Actual results: Policy remains in NonCompliant state. Expected results: Policy should be in Compliant state, reflecting the actual performance profile configuration. Additional info: Attaching output of the policy describe. After deleting/recreating the policy it reports Compliant state.
G2Bsync 918492362 comment gparvin Mon, 13 Sep 2021 19:07:06 UTC G2Bsync I was able to recreate this issue and discovered what appears to be the basic root cause. I have assigned it to an engineer to take a deeper look into this problem. Thanks for reporting this to us!
G2Bsync 920072429 comment gparvin Wed, 15 Sep 2021 14:29:57 UTC G2Bsync Although we have been able to recreate this problem, it looks like the behavior of the vwb performance webhook is causing this problem. Note the messages below: ``` Error updating the object `openshift-node-performance-profile`, the error is `admission webhook "vwb.performance.openshift.io" denied the request: PerformanceProfile.performance.openshift.io "openshift-node-performance-profile" is invalid: spec.cpu: Invalid value: v2.CPU{Reserved:(*v2.CPUSet)(0xc000519fc0), Isolated:(*v2.CPUSet)(0xc000519fb0), BalanceIsolated:(*bool)(nil)}: reserved and isolated cpus overlap: [25]` Error updating the object `openshift-node-performance-profile`, the error is `admission webhook "vwb.performance.openshift.io" denied the request: PerformanceProfile.performance.openshift.io "openshift-node-performance-profile" is invalid: spec.cpu: Invalid value: v2.CPU{Reserved:(*v2.CPUSet)(0xc000510670), Isolated:(*v2.CPUSet)(0xc000510660), BalanceIsolated:(*bool)(nil)}: reserved and isolated cpus overlap: [25]` ``` Each time the configuration policy controller attempts to apply changes, it is getting back a different message due to what appears to be addresses being included in the message. This prevents our controller from understanding it's really the same failure each time the configuration merge is attempted. This results in our controller alerting that there's a NonCompliance frequently enough to cause events to be merged as being correlated. At this point in time we are viewing this problem as an issue in the perf webhook, but it's important to note that you can recognize the configuration failure immediately when the policy is changed and becomes NonCompliant. Instead of deleting and recreating the policy to resolve the discrepancy between the policy and the actual state you can "Disable" the policy and then re-enable it. That prevents having to delete and recreate the policy which of course works to resolve the issue too.
G2Bsync 936584332 comment willkutler Wed, 06 Oct 2021 16:11:38 UTC G2Bsync Closing this issue as resolved by the workaround @gparvin posted since we have not received any feedback over the last 3 weeks