Created attachment 1793483 [details] Inspect of the openshift-compliance namespace Description of problem: During the upgrade of a cluster, the compliance operator is pausing the MCP but never unpausing the MCP causing the cluster upgrade to look like it has failed. Checking the operator logs, you can see the operator pauses the MCP but is unable to unpause it: 2021-06-22T03:59:35.092967510Z {"level":"info","ts":1624334375.0929,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"worker"} 2021-06-22T03:59:35.392018016Z {"level":"info","ts":1624334375.391911,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"master"} 2021-06-22T04:00:02.297536328Z {"level":"info","ts":1624334402.2974684,"logger":"suitectrl","msg":"Unpausing pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"worker"} 2021-06-22T04:04:15.899181939Z {"level":"error","ts":1624334655.899022,"logger":"suitectrl","msg":"Could not unpause pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"worker","error":"Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io \"worker\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\ngithub.com/openshift/compliance-operator/pkg/controller/compliancesuite.(*ReconcileComplianceSuite).reconcileRemediations\n\t/remote-source/app/pkg/controller/compliancesuite/compliancesuite_controller.go:498\ngithub.com/openshift/compliance-operator/pkg/controller/compliancesuite.(*ReconcileComplianceSuite).Reconcile\n\t/remote-source/app/pkg/controller/compliancesuite/compliancesuite_controller.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:90"} 2021-06-22T04:04:15.899394986Z {"level":"error","ts":1624334655.8992772,"logger":"suitectrl","msg":"Retriable error","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","error":"Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io \"worker\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\ngithub.com/openshift/compliance-operator/pkg/controller/common.ReturnWithRetriableError\n\t/remote-source/app/pkg/controller/common/errors.go:101\ngithub.com/openshift/compliance-operator/pkg/controller/compliancesuite.(*ReconcileComplianceSuite).Reconcile\n\t/remote-source/app/pkg/controller/compliancesuite/compliancesuite_controller.go:167\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:90"} 2021-06-22T04:00:02.327702494Z {"level":"error","ts":1624334402.3276129,"logger":"controller","msg":"Reconciler error","controller":"compliancesuite-controller","name":"mcp-scansettingbinding-moderate","namespace":"openshift-compliance","error":"Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io \"worker\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:237\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:90"} The hopes of this ticket is to make the unpausing of the MCP more resilient. Also, what is interesting is that if the first MCP pool fails to unpause, the remaining pools are also pauses. I do not see any failure messages for the master pool not being able to be unpaused in the log output. Version-Release number of selected component (if applicable): Customer was migrating from OCP 4.6.21 to 4.6.30. Operator: build-date: "2021-03-24T02:07:20.335664" version: v4.6.0 release: 134 How reproducible: Steps to Reproduce: 1.N/A Actual results: MCP is stuck in paused state after compliance operator pauses it Expected results: MCP is unpaused after the compliance operator pauses it Additional info: Customer has hit this a couple of times. Seems unpausing the MCP manually works around the issue. Not sure how to exactly force the recreate. There is also a full must-gather of the cluster attached to the linked case.
verification pass with 4.9.0-0.nightly-2021-08-29-010334 and compliance-operator.v0.1.39 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-08-29-010334 True False 50m Cluster version is 4.9.0-0.nightly-2021-08-29-010334 1. install compliance-operator.v0.1.35: $ oc get ip NAME CSV APPROVAL APPROVED install-8wj87 compliance-operator.v0.1.35 Automatic true $ oc get csv NAME DISPLAY VERSION REPLACES PHASE compliance-operator.v0.1.35 Compliance Operator 0.1.35 Succeeded elasticsearch-operator.5.2.0-50 OpenShift Elasticsearch Operator 5.2.0-50 Succeeded $ oc get pod NAME READY STATUS RESTARTS AGE compliance-operator-56f55446bc-fkd2c 1/1 Running 0 3m42s ocp4-openshift-compliance-pp-947f5f866-wgl2p 1/1 Running 0 2m50s rhcos4-openshift-compliance-pp-677ddc757d-2br2q 1/1 Running 0 2m50s 2. trigger scan: $ oc create -f -<<EOF > apiVersion: compliance.openshift.io/v1alpha1 > kind: ScanSettingBinding > metadata: > name: my-ssb-r > profiles: > - name: ocp4-moderate > kind: Profile > apiGroup: compliance.openshift.io/v1alpha1 > - name: rhcos4-moderate > kind: Profile > apiGroup: compliance.openshift.io/v1alpha1 > settingsRef: > name: default-auto-apply > kind: ScanSetting > apiGroup: compliance.openshift.io/v1alpha1 > EOF scansettingbinding.compliance.openshift.io/my-ssb-r created 3. when compliance is DONE and remediation is ongoing, trigger compliance operator upgrade: $ oc get suite -w NAME PHASE RESULT my-ssb-r DONE NON-COMPLIANT $ oc apply -f - << EOF > apiVersion: operators.coreos.com/v1alpha1 > kind: CatalogSource > metadata: > name: compliance-operator > namespace: openshift-marketplace > spec: > displayName: openshift-compliance-operator > publisher: Red Hat > image: quay.io/openshift-qe-optional-operators/compliance-operator-index-0.1:v0.1.39-1 > EOF catalogsource.operators.coreos.com/compliance-operator created $ oc patch sub openshift-compliance-operator -p '{"spec":{"source":"compliance-operator"}}' --type='merge' subscription.operators.coreos.com/openshift-compliance-operator patched 4. check the result: $ oc get ip NAME CSV APPROVAL APPROVED install-ls7vd compliance-operator.v0.1.39 Automatic true install-lwn9h compliance-operator.v0.1.35 Automatic true $ oc get csv NAME DISPLAY VERSION REPLACES PHASE compliance-operator.v0.1.39 Compliance Operator 0.1.39 compliance-operator.v0.1.35 Succeeded elasticsearch-operator.5.2.0-50 OpenShift Elasticsearch Operator 5.2.0-50 Succeeded $ oc get pod NAME READY STATUS RESTARTS AGE compliance-operator-69d5d77c77-9vlrq 1/1 Running 0 8m19s ocp4-openshift-compliance-pp-6d7c7db4bd-rxf46 1/1 Running 0 8m19s rhcos4-openshift-compliance-pp-c7b548bd-xj4n8 1/1 Running 0 8m25s $ oc logs pod/compliance-operator-69d5d77c77-9vlrq --all-containers | grep -i "paus" {"level":"info","ts":1630296178.7892225,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"master"} {"level":"info","ts":1630296179.08657,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"worker"} {"level":"info","ts":1630296211.3550758,"logger":"suitectrl","msg":"Unpausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"master"} {"level":"info","ts":1630296211.378187,"logger":"suitectrl","msg":"Unpausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"worker"} $ oc get machineconfigpools master -o yaml | grep "\spaused" paused: false $ oc get machineconfigpools worker -o yaml | grep "\spaused" paused: false
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Compliance Operator bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3214