Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1749890 - When deleting objects owned by the CVO, some components are not recreated for 30-40 minutes [NEEDINFO]
Summary: When deleting objects owned by the CVO, some components are not recreated for...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2.0
Assignee: Clayton Coleman
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-06 16:47 UTC by Clayton Coleman
Modified: 2020-03-13 09:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:40:50 UTC
Target Upstream Version:
jiajliu: needinfo? (ccoleman)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 245 0 None closed Bug 1749890: Iterate through the payload in chunks during reconciliation 2020-06-18 11:14:36 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:41:01 UTC

Description Clayton Coleman 2019-09-06 16:47:54 UTC
While testing recovery in mass deletion scenarios on the cluster, the current CVO reconcile algorithm takes too long to fully recover elements it owns.  For instance, deleting all operator deployments in openshift-* took over 40m to recover because many of the operators weren't recreated until 16+ syncs had happened.

We should change the reconcile distribution algorithm to bound how likely this is to happen.

Important for 4.2 for adding tests to verify recovery.

Comment 2 liujia 2019-09-17 09:01:24 UTC
Can not reproduce it on 4.2.0-0.nightly-2019-09-04-102339.

Write a script to delete all operators deployments owned by cvo, and then check the recover time for all these deployments.
Here is the test log(refer to my script in attachment)
////
There are 27 deployments needed to be deleted......
Start!!!
deployment.extensions "authentication-operator" deleted
deployment.extensions "cloud-credential-operator" deleted
deployment.extensions "cluster-autoscaler-operator" deleted
deployment.extensions "console-operator" deleted
deployment.extensions "dns-operator" deleted
deployment.extensions "cluster-image-registry-operator" deleted
deployment.extensions "ingress-operator" deleted
deployment.extensions "insights-operator" deleted
deployment.extensions "kube-apiserver-operator" deleted
deployment.extensions "kube-controller-manager-operator" deleted
deployment.extensions "openshift-kube-scheduler-operator" deleted
deployment.extensions "machine-api-operator" deleted
deployment.extensions "machine-config-operator" deleted
deployment.extensions "marketplace-operator" deleted
deployment.extensions "cluster-monitoring-operator" deleted
deployment.extensions "network-operator" deleted
deployment.extensions "cluster-node-tuning-operator" deleted
deployment.extensions "openshift-apiserver-operator" deleted
deployment.extensions "openshift-controller-manager-operator" deleted
deployment.extensions "cluster-samples-operator" deleted
deployment.extensions "olm-operator" deleted
deployment.extensions "catalog-operator" deleted
deployment.extensions "packageserver" deleted
deployment.extensions "service-ca-operator" deleted
deployment.extensions "openshift-service-catalog-apiserver-operator" deleted
deployment.extensions "openshift-service-catalog-controller-manager-operator" deleted
deployment.extensions "cluster-storage-operator" deleted
Waiting for the deployment authentication-operator recover......
Waiting for the deployment authentication-operator recover......
Waiting for the deployment authentication-operator recover......
Deployment authentication-operator recovered!
next deployment......
Deployment cloud-credential-operator recovered!
next deployment......
Deployment cluster-autoscaler-operator recovered!
next deployment......
Deployment console-operator recovered!
next deployment......
Deployment dns-operator recovered!
next deployment......
Deployment cluster-image-registry-operator recovered!
next deployment......
Deployment ingress-operator recovered!
next deployment......
Deployment insights-operator recovered!
next deployment......
Deployment kube-apiserver-operator recovered!
next deployment......
Deployment kube-controller-manager-operator recovered!
next deployment......
Deployment openshift-kube-scheduler-operator recovered!
next deployment......
Deployment machine-api-operator recovered!
next deployment......
Deployment machine-config-operator recovered!
next deployment......
Deployment marketplace-operator recovered!
next deployment......
Deployment cluster-monitoring-operator recovered!
next deployment......
Deployment network-operator recovered!
next deployment......
Deployment cluster-node-tuning-operator recovered!
next deployment......
Deployment openshift-apiserver-operator recovered!
next deployment......
Deployment openshift-controller-manager-operator recovered!
next deployment......
Deployment cluster-samples-operator recovered!
next deployment......
Deployment olm-operator recovered!
next deployment......
Deployment catalog-operator recovered!
next deployment......
Deployment packageserver recovered!
next deployment......
Deployment service-ca-operator recovered!
next deployment......
Deployment openshift-service-catalog-apiserver-operator recovered!
next deployment......
Deployment openshift-service-catalog-controller-manager-operator recovered!
next deployment......
Deployment cluster-storage-operator recovered!
next deployment......
End!!!
Recover time:00:03:20
////

# oc get deployment --all-namespaces
NAMESPACE                                               NAME                                                    READY   UP-TO-DATE   AVAILABLE   AGE
openshift-apiserver-operator                            openshift-apiserver-operator                            1/1     1            1           2m5s
openshift-authentication-operator                       authentication-operator                                 1/1     1            1           116s
openshift-authentication                                oauth-openshift                                         2/2     2            2           14m
openshift-cloud-credential-operator                     cloud-credential-operator                               1/1     1            1           2m9s
openshift-cluster-machine-approver                      machine-approver                                        1/1     1            1           24m
openshift-cluster-node-tuning-operator                  cluster-node-tuning-operator                            1/1     1            1           119s
openshift-cluster-samples-operator                      cluster-samples-operator                                1/1     1            1           2m1s
openshift-cluster-storage-operator                      cluster-storage-operator                                1/1     1            1           2m1s
openshift-cluster-version                               cluster-version-operator                                1/1     1            1           25m
openshift-console-operator                              console-operator                                        1/1     1            1           2m8s
openshift-console                                       console                                                 2/2     2            2           14m
openshift-console                                       downloads                                               2/2     2            2           18m
openshift-controller-manager-operator                   openshift-controller-manager-operator                   1/1     1            1           2m9s
openshift-dns-operator                                  dns-operator                                            1/1     1            1           118s
openshift-image-registry                                cluster-image-registry-operator                         1/1     1            1           2m11s
openshift-image-registry                                image-registry                                          1/1     1            1           15m
openshift-ingress-operator                              ingress-operator                                        1/1     1            1           2m7s
openshift-ingress                                       router-default                                          1/2     2            1           18m
openshift-insights                                      insights-operator                                       1/1     1            1           2m11s
openshift-kube-apiserver-operator                       kube-apiserver-operator                                 1/1     1            1           2m2s
openshift-kube-controller-manager-operator              kube-controller-manager-operator                        1/1     1            1           115s
openshift-kube-scheduler-operator                       openshift-kube-scheduler-operator                       1/1     1            1           117s
openshift-machine-api                                   cluster-autoscaler-operator                             1/1     1            1           2m5s
openshift-machine-api                                   machine-api-controllers                                 1/1     1            1           22m
openshift-machine-api                                   machine-api-operator                                    1/1     1            1           2m2s
openshift-machine-config-operator                       etcd-quorum-guard                                       3/3     3            3           22m
openshift-machine-config-operator                       machine-config-controller                               1/1     1            1           22m
openshift-machine-config-operator                       machine-config-operator                                 1/1     1            1           2m4s
openshift-marketplace                                   certified-operators                                     1/1     1            1           18m
openshift-marketplace                                   community-operators                                     1/1     1            1           18m
openshift-marketplace                                   marketplace-operator                                    1/1     1            1           119s
openshift-marketplace                                   redhat-operators                                        1/1     1            1           18m
openshift-monitoring                                    cluster-monitoring-operator                             1/1     1            1           118s
openshift-monitoring                                    grafana                                                 1/1     1            1           14m
openshift-monitoring                                    kube-state-metrics                                      1/1     1            1           15m
openshift-monitoring                                    openshift-state-metrics                                 1/1     1            1           15m
openshift-monitoring                                    prometheus-adapter                                      2/2     2            2           13m
openshift-monitoring                                    prometheus-operator                                     1/1     1            1           15m
openshift-monitoring                                    telemeter-client                                        1/1     1            1           15m
openshift-network-operator                              network-operator                                        1/1     1            1           2m5s
openshift-operator-lifecycle-manager                    catalog-operator                                        1/1     1            1           115s
openshift-operator-lifecycle-manager                    olm-operator                                            1/1     1            1           115s
openshift-operator-lifecycle-manager                    packageserver                                           2/2     2            2           106s
openshift-service-ca-operator                           service-ca-operator                                     1/1     1            1           2m4s
openshift-service-ca                                    apiservice-cabundle-injector                            1/1     1            1           22m
openshift-service-ca                                    configmap-cabundle-injector                             1/1     1            1           22m
openshift-service-ca                                    service-serving-cert-signer                             1/1     1            1           22m
openshift-service-catalog-apiserver-operator            openshift-service-catalog-apiserver-operator            1/1     1            1           2m
openshift-service-catalog-controller-manager-operator   openshift-service-catalog-controller-manager-operator   1/1     1            1           2m8s

According to above test result, the recover time is about 3+ mins. 

I re-test several rounds against the same cluster, the recover time are all < 10min.
Recover time:00:03:20
Recover time:00:09:23
Recover time:00:07:22
Recover time:00:08:22

@Clayton Coleman
Could u confirm if my reproduce step is correct?

Comment 4 liujia 2019-09-24 02:06:15 UTC
Since qe can not reproduce the bug, so just check pr#245 merged into v4.2. 

[root@preserve-jliu-worker cluster-version-operator]# git branch -a --contains=0725bd53c7
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/master
  remotes/origin/release-4.2
  remotes/origin/release-4.3
  remotes/origin/release-4.4

If hit the issue again, feel free to re-open it.

Comment 5 errata-xmlrpc 2019-10-16 06:40:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.