Bug 1757784

Summary: HCO reconciles continuously
Product: Container Native Virtualization (CNV) Reporter: David Zager <dzager>
Component: InstallationAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED ERRATA QA Contact: Irina Gulina <igulina>
Severity: high Docs Contact:
Priority: high    
Version: 2.1.1CC: cnv-qe-bugs, ncredi, rhallise, talayan
Target Milestone: ---   
Target Release: 2.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: hyperconverged-cluster-operator-container-v2.3.0-47 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 19:10:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Zager 2019-10-02 12:57:31 UTC
Description of problem: The expected behavior of an operator is to spend some time doing work when a primary resource (for HCO this is the HyperConverged CustomResource) is created, then reach a steady state where Reconciliation is happening occasionally. Currently, the HCO continuously reconciles.


How reproducible: Always


Steps to Reproduce:
1. Create an OpenShift cluster
2. Deploy HCO
3. Get logs (`kubectl get logs -n $HCO_NAMESPACE $HCO_OPERATOR_POD -f`) or Get HCO's metrics (`kubectl run -n kubevirt-hyperconverged it --rm --restart=Never hco-metrics --image=registry.access.redhat.com/ubi7/ubi-minimal:latest -- curl http://hyperconverged-cluster-operator-metrics.$HCO_NAMESPACE.svc.cluster.local:8383/metrics | grep 'reconcile'`)

Actual results:
With the HyperConverged resource existing for ~6minutes the HCO has already reconciled ~200times (26 error, 1 requeue, 172 success). That's about 33 reconciles per minute


Expected results:
HCO should be reconciling no more than 3 times per minute with only one primary resource to reconcile.

Comment 3 David Zager 2019-10-07 18:05:22 UTC
Operator-SDK provides a GenerationChangedPredicate that allows us to filter out updates to our Status/Metadata: https://github.com/operator-framework/operator-sdk/blob/947a464dbe968b8af147049e76e40f787ccb0847/pkg/predicate/predicate.go#L27

Newer versions of controller-runtime have the GenerationChangedPredicate: https://godoc.org/github.com/kubernetes-sigs/controller-runtime/pkg/predicate#GenerationChangedPredicate

Comment 10 Irina Gulina 2020-03-16 13:36:57 UTC
Verification logs attached. Per communication with Simone, 4k reconciliation runs in 12 days with more than 500 restarts of pods in openshift-cnv namespace is a good result. 

Note: Restarts may be caused by the the cluster issues, namely API and machine-config:

$ oc get co --all-namespaces
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-03-02-011520   True        False         True       12d
cloud-credential                           4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
cluster-autoscaler                         4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
console                                    4.4.0-0.nightly-2020-03-02-011520   True        False         False      10d
csi-snapshot-controller                    4.4.0-0.nightly-2020-03-02-011520   True        False         False      25h
dns                                        4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
etcd                                       4.4.0-0.nightly-2020-03-02-011520   True        False         True       12d
image-registry                             4.4.0-0.nightly-2020-03-02-011520   True        False         False      5d17h
ingress                                    4.4.0-0.nightly-2020-03-02-011520   True        False         False      5d17h
insights                                   4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
kube-apiserver                             4.4.0-0.nightly-2020-03-02-011520   True        False         True       12d
kube-controller-manager                    4.4.0-0.nightly-2020-03-02-011520   True        False         True       12d
kube-scheduler                             4.4.0-0.nightly-2020-03-02-011520   True        False         True       12d
kube-storage-version-migrator              4.4.0-0.nightly-2020-03-02-011520   True        False         False      2d15h
machine-api                                4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
machine-config                             4.4.0-0.nightly-2020-03-02-011520   False       False         True       39m
marketplace                                4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
monitoring                                 4.4.0-0.nightly-2020-03-02-011520   False       True          True       48m
network                                    4.4.0-0.nightly-2020-03-02-011520   True        True          True       12d
node-tuning                                4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
openshift-apiserver                        4.4.0-0.nightly-2020-03-02-011520   True        False         True       41h
openshift-controller-manager               4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
openshift-samples                          4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
operator-lifecycle-manager                 4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-03-02-011520   True        False         False      5d16h
service-ca                                 4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
service-catalog-apiserver                  4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
service-catalog-controller-manager         4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
storage                                    4.4.0-0.nightly-2020-03-02-011520   True        False         False      12d
[cnv-qe-jenkins@cnv-executor-ysegev-4-3 ~]$ oc get pods --all-namespaces | grep Terminating
openshift-apiserver                                     apiserver-7dc8755f76-44x55                                        1/1     Terminating         0          42h
openshift-machine-config-operator                       etcd-quorum-guard-64c6489cb7-bbgl9                                1/1     Terminating         0          42h

Comment 13 errata-xmlrpc 2020-05-04 19:10:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:2011