Description of problem: The expected behavior of an operator is to spend some time doing work when a primary resource (for HCO this is the HyperConverged CustomResource) is created, then reach a steady state where Reconciliation is happening occasionally. Currently, the HCO continuously reconciles. How reproducible: Always Steps to Reproduce: 1. Create an OpenShift cluster 2. Deploy HCO 3. Get logs (`kubectl get logs -n $HCO_NAMESPACE $HCO_OPERATOR_POD -f`) or Get HCO's metrics (`kubectl run -n kubevirt-hyperconverged it --rm --restart=Never hco-metrics --image=registry.access.redhat.com/ubi7/ubi-minimal:latest -- curl http://hyperconverged-cluster-operator-metrics.$HCO_NAMESPACE.svc.cluster.local:8383/metrics | grep 'reconcile'`) Actual results: With the HyperConverged resource existing for ~6minutes the HCO has already reconciled ~200times (26 error, 1 requeue, 172 success). That's about 33 reconciles per minute Expected results: HCO should be reconciling no more than 3 times per minute with only one primary resource to reconcile.
Operator-SDK provides a GenerationChangedPredicate that allows us to filter out updates to our Status/Metadata: https://github.com/operator-framework/operator-sdk/blob/947a464dbe968b8af147049e76e40f787ccb0847/pkg/predicate/predicate.go#L27 Newer versions of controller-runtime have the GenerationChangedPredicate: https://godoc.org/github.com/kubernetes-sigs/controller-runtime/pkg/predicate#GenerationChangedPredicate
https://github.com/kubevirt/hyperconverged-cluster-operator/pull/320 https://github.com/kubevirt/hyperconverged-cluster-operator/pull/318 (on master)
Verification logs attached. Per communication with Simone, 4k reconciliation runs in 12 days with more than 500 restarts of pods in openshift-cnv namespace is a good result. Note: Restarts may be caused by the the cluster issues, namely API and machine-config: $ oc get co --all-namespaces NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-0.nightly-2020-03-02-011520 True False True 12d cloud-credential 4.4.0-0.nightly-2020-03-02-011520 True False False 12d cluster-autoscaler 4.4.0-0.nightly-2020-03-02-011520 True False False 12d console 4.4.0-0.nightly-2020-03-02-011520 True False False 10d csi-snapshot-controller 4.4.0-0.nightly-2020-03-02-011520 True False False 25h dns 4.4.0-0.nightly-2020-03-02-011520 True False False 12d etcd 4.4.0-0.nightly-2020-03-02-011520 True False True 12d image-registry 4.4.0-0.nightly-2020-03-02-011520 True False False 5d17h ingress 4.4.0-0.nightly-2020-03-02-011520 True False False 5d17h insights 4.4.0-0.nightly-2020-03-02-011520 True False False 12d kube-apiserver 4.4.0-0.nightly-2020-03-02-011520 True False True 12d kube-controller-manager 4.4.0-0.nightly-2020-03-02-011520 True False True 12d kube-scheduler 4.4.0-0.nightly-2020-03-02-011520 True False True 12d kube-storage-version-migrator 4.4.0-0.nightly-2020-03-02-011520 True False False 2d15h machine-api 4.4.0-0.nightly-2020-03-02-011520 True False False 12d machine-config 4.4.0-0.nightly-2020-03-02-011520 False False True 39m marketplace 4.4.0-0.nightly-2020-03-02-011520 True False False 12d monitoring 4.4.0-0.nightly-2020-03-02-011520 False True True 48m network 4.4.0-0.nightly-2020-03-02-011520 True True True 12d node-tuning 4.4.0-0.nightly-2020-03-02-011520 True False False 12d openshift-apiserver 4.4.0-0.nightly-2020-03-02-011520 True False True 41h openshift-controller-manager 4.4.0-0.nightly-2020-03-02-011520 True False False 12d openshift-samples 4.4.0-0.nightly-2020-03-02-011520 True False False 12d operator-lifecycle-manager 4.4.0-0.nightly-2020-03-02-011520 True False False 12d operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-03-02-011520 True False False 12d operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2020-03-02-011520 True False False 5d16h service-ca 4.4.0-0.nightly-2020-03-02-011520 True False False 12d service-catalog-apiserver 4.4.0-0.nightly-2020-03-02-011520 True False False 12d service-catalog-controller-manager 4.4.0-0.nightly-2020-03-02-011520 True False False 12d storage 4.4.0-0.nightly-2020-03-02-011520 True False False 12d [cnv-qe-jenkins@cnv-executor-ysegev-4-3 ~]$ oc get pods --all-namespaces | grep Terminating openshift-apiserver apiserver-7dc8755f76-44x55 1/1 Terminating 0 42h openshift-machine-config-operator etcd-quorum-guard-64c6489cb7-bbgl9 1/1 Terminating 0 42h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:2011