Bug 1888073 - Operator controller continuously busy looping
Summary: Operator controller continuously busy looping
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.7.0
Assignee: Nick Hale
QA Contact: kuiwang
URL:
Whiteboard:
Depends On:
Blocks: 1889745
TreeView+ depends on / blocked
 
Reported: 2020-10-14 00:21 UTC by Ben Luddy
Modified: 2022-10-11 08:21 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:25:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1816 0 None closed Bug 1888073: prevent no-op hotlooping on Operators 2021-02-18 21:26:24 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:26:09 UTC

Internal Links: 1903034

Description Ben Luddy 2020-10-14 00:21:49 UTC
i installed a 4.6 cluster and installed OCS and ACM operators.  olm-operator now uses a consistent 1 cpu of actual usage.  when i go into the logs, it is reconciling in a super tight loop.  i see this about 5-6 per second for both OCS and ACM

2020-10-13T13:44:47.147Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.195Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.338Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.400Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.487Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.588Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.653Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.734Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.819Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-13T13:44:47.874Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}

Comment 1 Seth Jennings 2020-10-14 02:56:07 UTC
Saw this as well.  Indicative of a racing change on the resource:

2020-10-14T02:54:25.816Z	DEBUG	controllers.operator	reconciling operator	{"request": "/advanced-cluster-management.open-cluster-management"}
2020-10-14T02:54:25.891Z	ERROR	controllers.operator	Could not update Operator status	{"request": "/advanced-cluster-management.open-cluster-management", "error": "Operation cannot be fulfilled on operators.operators.coreos.com \"advanced-cluster-management.open-cluster-management\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
	/build/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators.(*OperatorReconciler).Reconcile
	/build/pkg/controller/operators/operator_controller.go:148
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
2020-10-14T02:54:25.892Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "operator", "request": "/advanced-cluster-management.open-cluster-management", "error": "Operation cannot be fulfilled on operators.operators.coreos.com \"advanced-cluster-management.open-cluster-management\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
	/build/vendor/github.com/go-logr/zapr/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90

Comment 2 Seth Jennings 2020-10-14 14:57:58 UTC
I confirmed that if I do 

oc get operators.operators.coreos.com -w

I see rapid changes to the operators in the tight reconciliation loop.

I also change the apiserver audit profile to WriteRequestBodies and confirmed that all the updates are coming from the system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount and the "userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format"

Looking at the audit log, it looks like there is constant churn on status.components.refs

Comment 3 Seth Jennings 2020-10-14 18:40:53 UTC
Confirmed that is it the change to the operator itself that is triggering the Reconcile using

https://github.com/operator-framework/operator-lifecycle-manager/pull/1810/files

If it was caused by any of the Watchesif resources, we would see it being enqueued in the requests in mapComponentRequests

https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/operator_controller.go#L69

https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/operator_controller.go#L218-L228

Comment 4 Seth Jennings 2020-10-15 18:38:18 UTC
This is pretty bad.  Raising severity to high.

Comment 6 kuiwang 2020-10-26 00:41:25 UTC
Verify it. LGTM

--
[root@preserve-olm-env 1888073]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-10-21-001511   True        False         6h4m    Cluster version is 4.7.0-0.nightly-2020-10-21-001511

[root@preserve-olm-env 1888073]# oc get pod -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-764f988744-qsrrf   1/1     Running   0          6h1m
olm-operator-5b867f9884-rbplb       1/1     Running   0          6h1m
packageserver-68c5c96db8-6cf8p      1/1     Running   0          5h59m
packageserver-68c5c96db8-8gf8w      1/1     Running   0          6h1m
[root@preserve-olm-env 1888073]# oc exec olm-operator-5b867f9884-rbplb -n openshift-operator-lifecycle-manager -- olm --version
OLM version: 0.16.1
git commit: 8979865647c29e2541c8efd65162cdba1f242f70

[root@preserve-olm-env 1888073]# cat og-ocs.yaml 
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: og-single
  namespace: openshift-storage
spec:
  targetNamespaces:
  - openshift-storage

[root@preserve-olm-env 1888073]# oc apply -f og-ocs.yaml 
operatorgroup.operators.coreos.com/og-single created

[root@preserve-olm-env 1888073]# cat sub-ocs.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: ocs-operator
  namespace: openshift-storage
spec:
  channel: stable-4.5
  installPlanApproval: Automatic
  name: ocs-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: ocs-operator.v4.5.1

[root@preserve-olm-env 1888073]# oc apply -f sub-ocs.yaml 
subscription.operators.coreos.com/ocs-operator created

[root@preserve-olm-env 1888073]# oc get csv -n openshift-storage
NAME                  DISPLAY                       VERSION   REPLACES   PHASE
ocs-operator.v4.5.1   OpenShift Container Storage   4.5.1                Succeeded

[root@preserve-olm-env 1888073]# cat og-acm.yaml 
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: og-single
  namespace: open-cluster-management
spec:
  targetNamespaces:
  - open-cluster-management

[root@preserve-olm-env 1888073]# oc apply -f og-acm.yaml 
operatorgroup.operators.coreos.com/og-single created

[root@preserve-olm-env 1888073]# cat sub-acm.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: advanced-cluster-management
  namespace: open-cluster-management
spec:
  channel: release-2.0
  installPlanApproval: Automatic
  name: advanced-cluster-management
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: advanced-cluster-management.v2.0.3


[root@preserve-olm-env 1888073]# oc apply -f sub-acm.yaml 
subscription.operators.coreos.com/advanced-cluster-management created

[root@preserve-olm-env 1888073]# oc get csv -n open-cluster-management
NAME                                 DISPLAY                                      VERSION   REPLACES                             PHASE
advanced-cluster-management.v2.0.3   Advanced Cluster Management for Kubernetes   2.0.3     advanced-cluster-management.v2.0.2   Succeeded

[root@preserve-olm-env 1888073]# oc logs --since=5s olm-operator-5b867f9884-rbplb -n openshift-operator-lifecycle-manager|grep DEBUG|grep "ocs-operator.openshift-storage"|grep "reconciling operator"
2020-10-22T05:45:36.629Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:36.676Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:38.135Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:38.215Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:39.622Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:39.695Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:41.137Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:41.201Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}

[root@preserve-olm-env 1888073]# oc logs --since=180s olm-operator-5b867f9884-rbplb -n openshift-operator-lifecycle-manager|grep ERROR|grep "advanced-cluster-management.open-cluster-management"
[root@preserve-olm-env 1888073]# 

--

Comment 19 Tereza 2021-01-20 17:08:32 UTC
*** Bug 1917537 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2021-02-24 15:25:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.