Bug 1888073
| Summary: | Operator controller continuously busy looping | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Luddy <bluddy> |
| Component: | OLM | Assignee: | Nick Hale <nhale> |
| OLM sub component: | OLM | QA Contact: | kuiwang |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | medium | CC: | cmoore, fmarting, jbeakley, krizza, nhale, rbobek, sjenning, tgabriel, vkochuku, xzha |
| Version: | 4.5 | Keywords: | UpcomingSprint |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:25:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1889745 | ||
Saw this as well. Indicative of a racing change on the resource:
2020-10-14T02:54:25.816Z DEBUG controllers.operator reconciling operator {"request": "/advanced-cluster-management.open-cluster-management"}
2020-10-14T02:54:25.891Z ERROR controllers.operator Could not update Operator status {"request": "/advanced-cluster-management.open-cluster-management", "error": "Operation cannot be fulfilled on operators.operators.coreos.com \"advanced-cluster-management.open-cluster-management\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/build/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators.(*OperatorReconciler).Reconcile
/build/pkg/controller/operators/operator_controller.go:148
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
2020-10-14T02:54:25.892Z ERROR controller-runtime.controller Reconciler error {"controller": "operator", "request": "/advanced-cluster-management.open-cluster-management", "error": "Operation cannot be fulfilled on operators.operators.coreos.com \"advanced-cluster-management.open-cluster-management\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/build/vendor/github.com/go-logr/zapr/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/build/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
I confirmed that if I do oc get operators.operators.coreos.com -w I see rapid changes to the operators in the tight reconciliation loop. I also change the apiserver audit profile to WriteRequestBodies and confirmed that all the updates are coming from the system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount and the "userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format" Looking at the audit log, it looks like there is constant churn on status.components.refs Confirmed that is it the change to the operator itself that is triggering the Reconcile using https://github.com/operator-framework/operator-lifecycle-manager/pull/1810/files If it was caused by any of the Watchesif resources, we would see it being enqueued in the requests in mapComponentRequests https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/operator_controller.go#L69 https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/operator_controller.go#L218-L228 This is pretty bad. Raising severity to high. Verify it. LGTM
--
[root@preserve-olm-env 1888073]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.nightly-2020-10-21-001511 True False 6h4m Cluster version is 4.7.0-0.nightly-2020-10-21-001511
[root@preserve-olm-env 1888073]# oc get pod -n openshift-operator-lifecycle-manager
NAME READY STATUS RESTARTS AGE
catalog-operator-764f988744-qsrrf 1/1 Running 0 6h1m
olm-operator-5b867f9884-rbplb 1/1 Running 0 6h1m
packageserver-68c5c96db8-6cf8p 1/1 Running 0 5h59m
packageserver-68c5c96db8-8gf8w 1/1 Running 0 6h1m
[root@preserve-olm-env 1888073]# oc exec olm-operator-5b867f9884-rbplb -n openshift-operator-lifecycle-manager -- olm --version
OLM version: 0.16.1
git commit: 8979865647c29e2541c8efd65162cdba1f242f70
[root@preserve-olm-env 1888073]# cat og-ocs.yaml
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
name: og-single
namespace: openshift-storage
spec:
targetNamespaces:
- openshift-storage
[root@preserve-olm-env 1888073]# oc apply -f og-ocs.yaml
operatorgroup.operators.coreos.com/og-single created
[root@preserve-olm-env 1888073]# cat sub-ocs.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: ocs-operator
namespace: openshift-storage
spec:
channel: stable-4.5
installPlanApproval: Automatic
name: ocs-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: ocs-operator.v4.5.1
[root@preserve-olm-env 1888073]# oc apply -f sub-ocs.yaml
subscription.operators.coreos.com/ocs-operator created
[root@preserve-olm-env 1888073]# oc get csv -n openshift-storage
NAME DISPLAY VERSION REPLACES PHASE
ocs-operator.v4.5.1 OpenShift Container Storage 4.5.1 Succeeded
[root@preserve-olm-env 1888073]# cat og-acm.yaml
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
name: og-single
namespace: open-cluster-management
spec:
targetNamespaces:
- open-cluster-management
[root@preserve-olm-env 1888073]# oc apply -f og-acm.yaml
operatorgroup.operators.coreos.com/og-single created
[root@preserve-olm-env 1888073]# cat sub-acm.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: advanced-cluster-management
namespace: open-cluster-management
spec:
channel: release-2.0
installPlanApproval: Automatic
name: advanced-cluster-management
source: redhat-operators
sourceNamespace: openshift-marketplace
startingCSV: advanced-cluster-management.v2.0.3
[root@preserve-olm-env 1888073]# oc apply -f sub-acm.yaml
subscription.operators.coreos.com/advanced-cluster-management created
[root@preserve-olm-env 1888073]# oc get csv -n open-cluster-management
NAME DISPLAY VERSION REPLACES PHASE
advanced-cluster-management.v2.0.3 Advanced Cluster Management for Kubernetes 2.0.3 advanced-cluster-management.v2.0.2 Succeeded
[root@preserve-olm-env 1888073]# oc logs --since=5s olm-operator-5b867f9884-rbplb -n openshift-operator-lifecycle-manager|grep DEBUG|grep "ocs-operator.openshift-storage"|grep "reconciling operator"
2020-10-22T05:45:36.629Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:36.676Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:38.135Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:38.215Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:39.622Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:39.695Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:41.137Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"}
2020-10-22T05:45:41.201Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"}
[root@preserve-olm-env 1888073]# oc logs --since=180s olm-operator-5b867f9884-rbplb -n openshift-operator-lifecycle-manager|grep ERROR|grep "advanced-cluster-management.open-cluster-management"
[root@preserve-olm-env 1888073]#
--
*** Bug 1917537 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
i installed a 4.6 cluster and installed OCS and ACM operators. olm-operator now uses a consistent 1 cpu of actual usage. when i go into the logs, it is reconciling in a super tight loop. i see this about 5-6 per second for both OCS and ACM 2020-10-13T13:44:47.147Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"} 2020-10-13T13:44:47.195Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "operator", "request": "/ocs-operator.openshift-storage"} 2020-10-13T13:44:47.338Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"} 2020-10-13T13:44:47.400Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "operator", "request": "/ocs-operator.openshift-storage"} 2020-10-13T13:44:47.487Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"} 2020-10-13T13:44:47.588Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "operator", "request": "/ocs-operator.openshift-storage"} 2020-10-13T13:44:47.653Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"} 2020-10-13T13:44:47.734Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "operator", "request": "/ocs-operator.openshift-storage"} 2020-10-13T13:44:47.819Z DEBUG controllers.operator reconciling operator {"request": "/ocs-operator.openshift-storage"} 2020-10-13T13:44:47.874Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "operator", "request": "/ocs-operator.openshift-storage"}