Bug 1889745 - Operator controller continuously busy looping
Summary: Operator controller continuously busy looping
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.6.z
Assignee: Nick Hale
QA Contact: kuiwang
URL:
Whiteboard:
Depends On: 1888073
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-20 13:45 UTC by OpenShift BugZilla Robot
Modified: 2020-11-16 14:38 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-16 14:37:43 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1852 None closed [release-4.6] Bug 1889745: fix(operators): prevent hotloop 2020-11-23 11:57:04 UTC
Red Hat Product Errata RHBA-2020:4987 None None None 2020-11-16 14:38:00 UTC

Description OpenShift BugZilla Robot 2020-10-20 13:45:16 UTC
+++ This bug was initially created as a clone of Bug #1888073 +++

i installed a 4.6 cluster and installed OCS and ACM operators.  olm-operator now uses a consistent 1 cpu of actual usage.  when i go into the logs, it is reconciling in a super tight loop.  i see this about 5-6 per second for both OCS and ACM

2020-10-13T13:44:47.147Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.195Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.338Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.400Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.487Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.588Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.653Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.734Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.819Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.874Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}

--- Additional comment from sjenning@redhat.com on 2020-10-14 02:56:07 UTC ---

Saw this as well.  Indicative of a racing change on the resource:

2020-10-14T02:54:25.816Z	DEBUG	controllers.operator	reconciling operator	{"request": "/advanced-cluster-management.open-cluster-management"}
2020-10-14T02:54:25.891Z	ERROR	controllers.operator	Could not update Operator status	{"request": "/advanced-cluster-management.open-cluster-management", "error": "Operation cannot be fulfilled on operators.operators.coreos.com \"advanced-cluster-management.open-cluster-management\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
	/build/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators.(*OperatorReconciler).Reconcile
	/build/pkg/controller/operators/operator_controller.go:148
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
2020-10-14T02:54:25.892Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "operator", "request": "/advanced-cluster-management.open-cluster-management", "error": "Operation cannot be fulfilled on operators.operators.coreos.com \"advanced-cluster-management.open-cluster-management\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
	/build/vendor/github.com/go-logr/zapr/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90

--- Additional comment from sjenning@redhat.com on 2020-10-14 14:57:58 UTC ---

I confirmed that if I do 

oc get operators.operators.coreos.com -w

I see rapid changes to the operators in the tight reconciliation loop.

I also change the apiserver audit profile to WriteRequestBodies and confirmed that all the updates are coming from the system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount and the "userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format"

Looking at the audit log, it looks like there is constant churn on status.components.refs

--- Additional comment from sjenning@redhat.com on 2020-10-14 18:40:53 UTC ---

Confirmed that is it the change to the operator itself that is triggering the Reconcile using

https://github.com/operator-framework/operator-lifecycle-manager/pull/1810/files

If it was caused by any of the Watchesif resources, we would see it being enqueued in the requests in mapComponentRequests

https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/operator_controller.go#L69

https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/operator_controller.go#L218-L228

--- Additional comment from sjenning@redhat.com on 2020-10-15 18:38:18 UTC ---

This is pretty bad.  Raising severity to high.

Comment 4 kuiwang 2020-11-06 02:57:48 UTC
verify it on 4.6. LGTM

--
[root@preserve-olm-env 1889745]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-11-05-171833   True        False         22m     Cluster version is 4.6.0-0.nightly-2020-11-05-171833
[root@preserve-olm-env 1889745]# oc get pod -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-67745dcff4-wxq7p   1/1     Running   0          15m
olm-operator-6f9b978c56-d9bxt       1/1     Running   0          15m
packageserver-7f9c7f74b-4cbn9       1/1     Running   0          15m
packageserver-7f9c7f74b-vvvx7       1/1     Running   0          18m

[root@preserve-olm-env 1889745]# oc exec olm-operator-6f9b978c56-d9bxt -n openshift-operator-lifecycle-manager -- olm --version
OLM version: 0.16.1
git commit: aba96a92b6e17452ed239598c1d266ba8d6a1b51

[root@preserve-olm-env 1889745]# oc new-project test-storage
Now using project "test-storage" on server "https://api.kui110625.qe.devcluster.openshift.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname


[root@preserve-olm-env 1889745]# cat og-ocs.yaml 
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: og-single
  namespace: test-storage
spec:
  targetNamespaces:
  - test-storage

[root@preserve-olm-env 1889745]# oc apply -f og-ocs.yaml 
operatorgroup.operators.coreos.com/og-single created

[root@preserve-olm-env 1889745]# cat sub-ocs.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: ocs-operator
  namespace: test-storage
spec:
  channel: stable-4.5
  installPlanApproval: Automatic
  name: ocs-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: ocs-operator.v4.5.1


[root@preserve-olm-env 1889745]# oc apply -f sub-ocs.yaml 
subscription.operators.coreos.com/ocs-operator created


[root@preserve-olm-env 1889745]# oc get csv
NAME                  DISPLAY                       VERSION   REPLACES   PHASE
ocs-operator.v4.5.1   OpenShift Container Storage   4.5.1                Installing
[root@preserve-olm-env 1889745]# oc get csv
NAME                  DISPLAY                       VERSION   REPLACES   PHASE
ocs-operator.v4.5.1   OpenShift Container Storage   4.5.1                Succeeded

[root@preserve-olm-env 1889745]# oc new-project open-cluster-management
Now using project "open-cluster-management" on server "https://api.kui110625.qe.devcluster.openshift.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname

[root@preserve-olm-env 1889745]# cat og-acm.yaml 
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: og-single
  namespace: open-cluster-management
spec:
  targetNamespaces:
  - open-cluster-management


[root@preserve-olm-env 1889745]# oc apply -f og-acm.yaml 
operatorgroup.operators.coreos.com/og-single created
[root@preserve-olm-env 1889745]# cat sub-acm.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: advanced-cluster-management
  namespace: open-cluster-management
spec:
  channel: release-2.0
  installPlanApproval: Automatic
  name: advanced-cluster-management
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: advanced-cluster-management.v2.0.3

[root@preserve-olm-env 1889745]# oc apply -f sub-acm.yaml 
subscription.operators.coreos.com/advanced-cluster-management created

[root@preserve-olm-env 1889745]# oc get csv -n open-cluster-management
NAME                                 DISPLAY                                      VERSION   REPLACES                             PHASE
advanced-cluster-management.v2.0.3   Advanced Cluster Management for Kubernetes   2.0.3     advanced-cluster-management.v2.0.2   Installing
advanced-cluster-management.v2.0.4   Advanced Cluster Management for Kubernetes   2.0.4     advanced-cluster-management.v2.0.3   

[root@preserve-olm-env 1889745]# oc get csv -n open-cluster-management
NAME                                 DISPLAY                                      VERSION   REPLACES                             PHASE
advanced-cluster-management.v2.0.3   Advanced Cluster Management for Kubernetes   2.0.3     advanced-cluster-management.v2.0.2   Replacing
advanced-cluster-management.v2.0.4   Advanced Cluster Management for Kubernetes   2.0.4     advanced-cluster-management.v2.0.3   Installing
[root@preserve-olm-env 1889745]# oc get csv -n open-cluster-management
NAME                                 DISPLAY                                      VERSION   REPLACES                             PHASE
advanced-cluster-management.v2.0.4   Advanced Cluster Management for Kubernetes   2.0.4     advanced-cluster-management.v2.0.3   Succeeded

[root@preserve-olm-env 1889745]# oc logs --since=5s olm-operator-6f9b978c56-d9bxt -n openshift-operator-lifecycle-manager|grep DEBUG|grep "ocs-operator.test-storage"|grep "reconciling operator"
2020-11-06T02:50:23.161Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.test-storage"}
2020-11-06T02:50:23.210Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.test-storage"}
2020-11-06T02:50:25.325Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.test-storage"}
2020-11-06T02:50:25.423Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.test-storage"}
2020-11-06T02:50:27.299Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.test-storage"}
2020-11-06T02:50:27.395Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.test-storage"}

[root@preserve-olm-env 1889745]# oc logs --since=180s olm-operator-6f9b978c56-d9bxt -n openshift-operator-lifecycle-manager|grep ERROR|grep "advanced-cluster-management.open-cluster-management"
[root@preserve-olm-env 1889745]#

Comment 7 errata-xmlrpc 2020-11-16 14:37:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.4 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4987


Note You need to log in before you can comment on or make changes to this bug.