1975358 – Compliance Operator does not unpause pool during OCP Upgrade

Bug 1975358 - Compliance Operator does not unpause pool during OCP Upgrade

Summary: Compliance Operator does not unpause pool during OCP Upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Compliance Operator
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Juan Antonio Osorio
QA Contact:	Prashant Dhamdhere
Docs Contact:
URL:
Whiteboard:
Depends On:	1990836
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-23 13:43 UTC by Neil Girard
Modified:	2024-12-20 20:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-07 06:05:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Inspect of the openshift-compliance namespace (2.65 MB, application/x-tar) 2021-06-23 13:43 UTC, Neil Girard	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift compliance-operator pull 660	0	None	open	Bug 1975358: Refresh pool reference before trying to unpause it	2021-06-23 15:12:40 UTC
Red Hat Product Errata	RHBA-2021:3214	0	None	None	None	2021-09-07 06:05:28 UTC

Description Neil Girard 2021-06-23 13:43:40 UTC

Created attachment 1793483 [details]
Inspect of the openshift-compliance namespace

Description of problem:
During the upgrade of a cluster, the compliance operator is pausing the MCP but never unpausing the MCP causing the cluster upgrade to look like it has failed.

Checking the operator logs, you can see the operator pauses the MCP but is unable to unpause it:

2021-06-22T03:59:35.092967510Z {"level":"info","ts":1624334375.0929,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"worker"}
2021-06-22T03:59:35.392018016Z {"level":"info","ts":1624334375.391911,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"master"}
2021-06-22T04:00:02.297536328Z {"level":"info","ts":1624334402.2974684,"logger":"suitectrl","msg":"Unpausing pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"worker"}


2021-06-22T04:04:15.899181939Z {"level":"error","ts":1624334655.899022,"logger":"suitectrl","msg":"Could not unpause pool","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","MachineConfigPool.Name":"worker","error":"Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io \"worker\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\ngithub.com/openshift/compliance-operator/pkg/controller/compliancesuite.(*ReconcileComplianceSuite).reconcileRemediations\n\t/remote-source/app/pkg/controller/compliancesuite/compliancesuite_controller.go:498\ngithub.com/openshift/compliance-operator/pkg/controller/compliancesuite.(*ReconcileComplianceSuite).Reconcile\n\t/remote-source/app/pkg/controller/compliancesuite/compliancesuite_controller.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:90"}
2021-06-22T04:04:15.899394986Z {"level":"error","ts":1624334655.8992772,"logger":"suitectrl","msg":"Retriable error","Request.Namespace":"openshift-compliance","Request.Name":"mcp-scansettingbinding-moderate","error":"Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io \"worker\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\ngithub.com/openshift/compliance-operator/pkg/controller/common.ReturnWithRetriableError\n\t/remote-source/app/pkg/controller/common/errors.go:101\ngithub.com/openshift/compliance-operator/pkg/controller/compliancesuite.(*ReconcileComplianceSuite).Reconcile\n\t/remote-source/app/pkg/controller/compliancesuite/compliancesuite_controller.go:167\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:90"}
2021-06-22T04:00:02.327702494Z {"level":"error","ts":1624334402.3276129,"logger":"controller","msg":"Reconciler error","controller":"compliancesuite-controller","name":"mcp-scansettingbinding-moderate","namespace":"openshift-compliance","error":"Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io \"worker\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/remote-source/deps/gomod/pkg/mod/github.com/go-logr/zapr.0/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:237\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:209\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.2/pkg/internal/controller/controller.go:188\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/remote-source/deps/gomod/pkg/mod/k8s.io/apimachinery.5/pkg/util/wait/wait.go:90"}

The hopes of this ticket is to make the unpausing of the MCP more resilient. Also, what is interesting is that if the first MCP pool fails to unpause, the remaining pools are also pauses.  I do not see any failure messages for the master pool not being able to be unpaused in the log output.

Version-Release number of selected component (if applicable):
Customer was migrating from OCP 4.6.21 to 4.6.30.

Operator:
build-date: "2021-03-24T02:07:20.335664"
version: v4.6.0
release: 134

How reproducible:


Steps to Reproduce:
1.N/A

Actual results:
MCP is stuck in paused state after compliance operator pauses it

Expected results:
MCP is unpaused after the compliance operator pauses it

Additional info:
Customer has hit this a couple of times.  Seems unpausing the MCP manually works around the issue.  Not sure how to exactly force the recreate.

There is also a full must-gather of the cluster attached to the linked case.

Comment 5 xiyuan 2021-08-30 04:25:46 UTC

verification pass with 4.9.0-0.nightly-2021-08-29-010334 and compliance-operator.v0.1.39
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-29-010334   True        False         50m     Cluster version is 4.9.0-0.nightly-2021-08-29-010334

1. install compliance-operator.v0.1.35:
$ oc get ip
NAME            CSV                           APPROVAL    APPROVED
install-8wj87   compliance-operator.v0.1.35   Automatic   true
$ oc get csv
NAME                              DISPLAY                            VERSION    REPLACES   PHASE
compliance-operator.v0.1.35       Compliance Operator                0.1.35                Succeeded
elasticsearch-operator.5.2.0-50   OpenShift Elasticsearch Operator   5.2.0-50              Succeeded
$ oc get pod
NAME                                              READY   STATUS    RESTARTS   AGE
compliance-operator-56f55446bc-fkd2c              1/1     Running   0          3m42s
ocp4-openshift-compliance-pp-947f5f866-wgl2p      1/1     Running   0          2m50s
rhcos4-openshift-compliance-pp-677ddc757d-2br2q   1/1     Running   0          2m50s

2. trigger scan:
$ oc create -f -<<EOF
> apiVersion: compliance.openshift.io/v1alpha1
> kind: ScanSettingBinding
> metadata:
>   name: my-ssb-r
> profiles:
>   - name: ocp4-moderate
>     kind: Profile
>     apiGroup: compliance.openshift.io/v1alpha1
>   - name: rhcos4-moderate
>     kind: Profile
>     apiGroup: compliance.openshift.io/v1alpha1
> settingsRef:
>   name: default-auto-apply
>   kind: ScanSetting
>   apiGroup: compliance.openshift.io/v1alpha1
> EOF
scansettingbinding.compliance.openshift.io/my-ssb-r created

3. when compliance is DONE and remediation is ongoing, trigger compliance operator upgrade:
$ oc get suite -w
NAME       PHASE     RESULT
my-ssb-r   DONE          NON-COMPLIANT

$ oc apply -f - << EOF
> apiVersion: operators.coreos.com/v1alpha1
> kind: CatalogSource
> metadata:
>   name: compliance-operator  
>   namespace: openshift-marketplace
> spec:
>   displayName: openshift-compliance-operator  
>   publisher: Red Hat
>   image: quay.io/openshift-qe-optional-operators/compliance-operator-index-0.1:v0.1.39-1
> EOF
catalogsource.operators.coreos.com/compliance-operator created
$ oc patch sub openshift-compliance-operator  -p '{"spec":{"source":"compliance-operator"}}' --type='merge'
subscription.operators.coreos.com/openshift-compliance-operator patched


4. check the result:
$ oc get ip
NAME            CSV                           APPROVAL    APPROVED
install-ls7vd   compliance-operator.v0.1.39   Automatic   true
install-lwn9h   compliance-operator.v0.1.35   Automatic   true
$ oc get csv
NAME                              DISPLAY                            VERSION    REPLACES                      PHASE
compliance-operator.v0.1.39       Compliance Operator                0.1.39     compliance-operator.v0.1.35   Succeeded
elasticsearch-operator.5.2.0-50   OpenShift Elasticsearch Operator   5.2.0-50                                 Succeeded

$ oc get pod
NAME                                            READY   STATUS    RESTARTS   AGE
compliance-operator-69d5d77c77-9vlrq            1/1     Running   0          8m19s
ocp4-openshift-compliance-pp-6d7c7db4bd-rxf46   1/1     Running   0          8m19s
rhcos4-openshift-compliance-pp-c7b548bd-xj4n8   1/1     Running   0          8m25s
$ oc logs pod/compliance-operator-69d5d77c77-9vlrq --all-containers  | grep -i "paus"
{"level":"info","ts":1630296178.7892225,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"master"}
{"level":"info","ts":1630296179.08657,"logger":"suitectrl","msg":"Pausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"worker"}
{"level":"info","ts":1630296211.3550758,"logger":"suitectrl","msg":"Unpausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"master"}
{"level":"info","ts":1630296211.378187,"logger":"suitectrl","msg":"Unpausing pool","Request.Namespace":"openshift-compliance","Request.Name":"my-ssb-r","MachineConfigPool.Name":"worker"}

$ oc get machineconfigpools master -o yaml | grep "\spaused"
  paused: false
$ oc get machineconfigpools worker -o yaml | grep "\spaused"
  paused: false

Comment 7 errata-xmlrpc 2021-09-07 06:05:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Compliance Operator bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3214

Note You need to log in before you can comment on or make changes to this bug.