Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1975358

Summary: Compliance Operator does not unpause pool during OCP Upgrade
Product: OpenShift Container Platform Reporter: Neil Girard <ngirard>
Component: Compliance OperatorAssignee: Juan Antonio Osorio <josorior>
Status: CLOSED ERRATA QA Contact: Prashant Dhamdhere <pdhamdhe>
Severity: high Docs Contact:
Priority: medium    
Version: 4.6.zCC: josorior, mrogers, xiyuan
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-07 06:05:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1990836    
Bug Blocks:    
Attachments:
Description Flags
Inspect of the openshift-compliance namespace none

Description Neil Girard 2021-06-23 13:43:40 UTC
Created attachment 1793483 [details]
Inspect of the openshift-compliance namespace

Description of problem:
During the upgrade of a cluster, the compliance operator is pausing the MCP but never unpausing the MCP causing the cluster upgrade to look like it has failed.

Checking the operator logs, you can see the operator pauses the MCP but is unable to unpause it:

2021-06-22T03:59:35.092967510Z {"level":"info","ts":1624334375.0929,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"worker"}
2021-06-22T03:59:35.392018016Z {"level":"info","ts":1624334375.391911,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"master"}
2021-06-22T04:00:02.297536328Z {"level":"info","ts":1624334402.2974684,"logger":"suitectrl","msg":"Unpausing pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"worker"}


2021-06-22T04:04:15.899181939Z {"level":"error","ts":1624334655.899022,"logger":"suitectrl","msg":"Could not unpause pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"worker","error":"Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io \"worker\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\ngithub.com/openshift/compliance-operator/pkg/controller/compliancesuite.(*ReconcileComplianceSuite).reconcileRemediations\n\t/remote-source/app/pkg/controller/compliancesuite/compliancesuite_controller.go:498\ngithub.com/openshift/compliance-operator/pkg/controller/compliancesuite.(*ReconcileComplianceSuite).Reconcile\n\t/remote-source/app/pkg/controller/compliancesuite/compliancesuite_controller.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:90"}
2021-06-22T04:04:15.899394986Z {"level":"error","ts":1624334655.8992772,"logger":"suitectrl","msg":"Retriable error","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","error":"Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io \"worker\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\ngithub.com/openshift/compliance-operator/pkg/controller/common.ReturnWithRetriableError\n\t/remote-source/app/pkg/controller/common/errors.go:101\ngithub.com/openshift/compliance-operator/pkg/controller/compliancesuite.(*ReconcileComplianceSuite).Reconcile\n\t/remote-source/app/pkg/controller/compliancesuite/compliancesuite_controller.go:167\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:90"}
2021-06-22T04:00:02.327702494Z {"level":"error","ts":1624334402.3276129,"logger":"controller","msg":"Reconciler error","controller":"compliancesuite-controller","name":"mcp-scansettingbinding-moderate","namespace":"openshift-compliance","error":"Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io \"worker\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:237\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:90"}

The hopes of this ticket is to make the unpausing of the MCP more resilient. Also, what is interesting is that if the first MCP pool fails to unpause, the remaining pools are also pauses.  I do not see any failure messages for the master pool not being able to be unpaused in the log output.

Version-Release number of selected component (if applicable):
Customer was migrating from OCP 4.6.21 to 4.6.30.

Operator:
build-date: "2021-03-24T02:07:20.335664"
version: v4.6.0
release: 134

How reproducible:


Steps to Reproduce:
1.N/A

Actual results:
MCP is stuck in paused state after compliance operator pauses it

Expected results:
MCP is unpaused after the compliance operator pauses it

Additional info:
Customer has hit this a couple of times.  Seems unpausing the MCP manually works around the issue.  Not sure how to exactly force the recreate.

There is also a full must-gather of the cluster attached to the linked case.

Comment 5 xiyuan 2021-08-30 04:25:46 UTC
verification pass with 4.9.0-0.nightly-2021-08-29-010334 and compliance-operator.v0.1.39
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-29-010334   True        False         50m     Cluster version is 4.9.0-0.nightly-2021-08-29-010334

1. install compliance-operator.v0.1.35:
$ oc get ip
NAME            CSV                           APPROVAL    APPROVED
install-8wj87   compliance-operator.v0.1.35   Automatic   true
$ oc get csv
NAME                              DISPLAY                            VERSION    REPLACES   PHASE
compliance-operator.v0.1.35       Compliance Operator                0.1.35                Succeeded
elasticsearch-operator.5.2.0-50   OpenShift Elasticsearch Operator   5.2.0-50              Succeeded
$ oc get pod
NAME                                              READY   STATUS    RESTARTS   AGE
compliance-operator-56f55446bc-fkd2c              1/1     Running   0          3m42s
ocp4-openshift-compliance-pp-947f5f866-wgl2p      1/1     Running   0          2m50s
rhcos4-openshift-compliance-pp-677ddc757d-2br2q   1/1     Running   0          2m50s

2. trigger scan:
$ oc create -f -<<EOF
> apiVersion: compliance.openshift.io/v1alpha1
> kind: ScanSettingBinding
> metadata:
>   name: my-ssb-r
> profiles:
>   - name: ocp4-moderate
>     kind: Profile
>     apiGroup: compliance.openshift.io/v1alpha1
>   - name: rhcos4-moderate
>     kind: Profile
>     apiGroup: compliance.openshift.io/v1alpha1
> settingsRef:
>   name: default-auto-apply
>   kind: ScanSetting
>   apiGroup: compliance.openshift.io/v1alpha1
> EOF
scansettingbinding.compliance.openshift.io/my-ssb-r created

3. when compliance is DONE and remediation is ongoing, trigger compliance operator upgrade:
$ oc get suite -w
NAME       PHASE     RESULT
my-ssb-r   DONE          NON-COMPLIANT

$ oc apply -f - << EOF
> apiVersion: operators.coreos.com/v1alpha1
> kind: CatalogSource
> metadata:
>   name: compliance-operator  
>   namespace: openshift-marketplace
> spec:
>   displayName: openshift-compliance-operator  
>   publisher: Red Hat
>   image: quay.io/openshift-qe-optional-operators/compliance-operator-index-0.1:v0.1.39-1
> EOF
catalogsource.operators.coreos.com/compliance-operator created
$ oc patch sub openshift-compliance-operator  -p '{"spec":{"source":"compliance-operator"}}' --type='merge'
subscription.operators.coreos.com/openshift-compliance-operator patched


4. check the result:
$ oc get ip
NAME            CSV                           APPROVAL    APPROVED
install-ls7vd   compliance-operator.v0.1.39   Automatic   true
install-lwn9h   compliance-operator.v0.1.35   Automatic   true
$ oc get csv
NAME                              DISPLAY                            VERSION    REPLACES                      PHASE
compliance-operator.v0.1.39       Compliance Operator                0.1.39     compliance-operator.v0.1.35   Succeeded
elasticsearch-operator.5.2.0-50   OpenShift Elasticsearch Operator   5.2.0-50                                 Succeeded

$ oc get pod
NAME                                            READY   STATUS    RESTARTS   AGE
compliance-operator-69d5d77c77-9vlrq            1/1     Running   0          8m19s
ocp4-openshift-compliance-pp-6d7c7db4bd-rxf46   1/1     Running   0          8m19s
rhcos4-openshift-compliance-pp-c7b548bd-xj4n8   1/1     Running   0          8m25s
$ oc logs pod/compliance-operator-69d5d77c77-9vlrq --all-containers  | grep -i "paus"
{"level":"info","ts":1630296178.7892225,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"master"}
{"level":"info","ts":1630296179.08657,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"worker"}
{"level":"info","ts":1630296211.3550758,"logger":"suitectrl","msg":"Unpausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"master"}
{"level":"info","ts":1630296211.378187,"logger":"suitectrl","msg":"Unpausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"worker"}

$ oc get machineconfigpools master -o yaml | grep "\spaused"
  paused: false
$ oc get machineconfigpools worker -o yaml | grep "\spaused"
  paused: false

Comment 7 errata-xmlrpc 2021-09-07 06:05:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Compliance Operator bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3214