While testing recovery in mass deletion scenarios on the cluster, the current CVO reconcile algorithm takes too long to fully recover elements it owns. For instance, deleting all operator deployments in openshift-* took over 40m to recover because many of the operators weren't recreated until 16+ syncs had happened. We should change the reconcile distribution algorithm to bound how likely this is to happen. Important for 4.2 for adding tests to verify recovery.
Can not reproduce it on 4.2.0-0.nightly-2019-09-04-102339. Write a script to delete all operators deployments owned by cvo, and then check the recover time for all these deployments. Here is the test log(refer to my script in attachment) //// There are 27 deployments needed to be deleted...... Start!!! deployment.extensions "authentication-operator" deleted deployment.extensions "cloud-credential-operator" deleted deployment.extensions "cluster-autoscaler-operator" deleted deployment.extensions "console-operator" deleted deployment.extensions "dns-operator" deleted deployment.extensions "cluster-image-registry-operator" deleted deployment.extensions "ingress-operator" deleted deployment.extensions "insights-operator" deleted deployment.extensions "kube-apiserver-operator" deleted deployment.extensions "kube-controller-manager-operator" deleted deployment.extensions "openshift-kube-scheduler-operator" deleted deployment.extensions "machine-api-operator" deleted deployment.extensions "machine-config-operator" deleted deployment.extensions "marketplace-operator" deleted deployment.extensions "cluster-monitoring-operator" deleted deployment.extensions "network-operator" deleted deployment.extensions "cluster-node-tuning-operator" deleted deployment.extensions "openshift-apiserver-operator" deleted deployment.extensions "openshift-controller-manager-operator" deleted deployment.extensions "cluster-samples-operator" deleted deployment.extensions "olm-operator" deleted deployment.extensions "catalog-operator" deleted deployment.extensions "packageserver" deleted deployment.extensions "service-ca-operator" deleted deployment.extensions "openshift-service-catalog-apiserver-operator" deleted deployment.extensions "openshift-service-catalog-controller-manager-operator" deleted deployment.extensions "cluster-storage-operator" deleted Waiting for the deployment authentication-operator recover...... Waiting for the deployment authentication-operator recover...... Waiting for the deployment authentication-operator recover...... Deployment authentication-operator recovered! next deployment...... Deployment cloud-credential-operator recovered! next deployment...... Deployment cluster-autoscaler-operator recovered! next deployment...... Deployment console-operator recovered! next deployment...... Deployment dns-operator recovered! next deployment...... Deployment cluster-image-registry-operator recovered! next deployment...... Deployment ingress-operator recovered! next deployment...... Deployment insights-operator recovered! next deployment...... Deployment kube-apiserver-operator recovered! next deployment...... Deployment kube-controller-manager-operator recovered! next deployment...... Deployment openshift-kube-scheduler-operator recovered! next deployment...... Deployment machine-api-operator recovered! next deployment...... Deployment machine-config-operator recovered! next deployment...... Deployment marketplace-operator recovered! next deployment...... Deployment cluster-monitoring-operator recovered! next deployment...... Deployment network-operator recovered! next deployment...... Deployment cluster-node-tuning-operator recovered! next deployment...... Deployment openshift-apiserver-operator recovered! next deployment...... Deployment openshift-controller-manager-operator recovered! next deployment...... Deployment cluster-samples-operator recovered! next deployment...... Deployment olm-operator recovered! next deployment...... Deployment catalog-operator recovered! next deployment...... Deployment packageserver recovered! next deployment...... Deployment service-ca-operator recovered! next deployment...... Deployment openshift-service-catalog-apiserver-operator recovered! next deployment...... Deployment openshift-service-catalog-controller-manager-operator recovered! next deployment...... Deployment cluster-storage-operator recovered! next deployment...... End!!! Recover time:00:03:20 //// # oc get deployment --all-namespaces NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE openshift-apiserver-operator openshift-apiserver-operator 1/1 1 1 2m5s openshift-authentication-operator authentication-operator 1/1 1 1 116s openshift-authentication oauth-openshift 2/2 2 2 14m openshift-cloud-credential-operator cloud-credential-operator 1/1 1 1 2m9s openshift-cluster-machine-approver machine-approver 1/1 1 1 24m openshift-cluster-node-tuning-operator cluster-node-tuning-operator 1/1 1 1 119s openshift-cluster-samples-operator cluster-samples-operator 1/1 1 1 2m1s openshift-cluster-storage-operator cluster-storage-operator 1/1 1 1 2m1s openshift-cluster-version cluster-version-operator 1/1 1 1 25m openshift-console-operator console-operator 1/1 1 1 2m8s openshift-console console 2/2 2 2 14m openshift-console downloads 2/2 2 2 18m openshift-controller-manager-operator openshift-controller-manager-operator 1/1 1 1 2m9s openshift-dns-operator dns-operator 1/1 1 1 118s openshift-image-registry cluster-image-registry-operator 1/1 1 1 2m11s openshift-image-registry image-registry 1/1 1 1 15m openshift-ingress-operator ingress-operator 1/1 1 1 2m7s openshift-ingress router-default 1/2 2 1 18m openshift-insights insights-operator 1/1 1 1 2m11s openshift-kube-apiserver-operator kube-apiserver-operator 1/1 1 1 2m2s openshift-kube-controller-manager-operator kube-controller-manager-operator 1/1 1 1 115s openshift-kube-scheduler-operator openshift-kube-scheduler-operator 1/1 1 1 117s openshift-machine-api cluster-autoscaler-operator 1/1 1 1 2m5s openshift-machine-api machine-api-controllers 1/1 1 1 22m openshift-machine-api machine-api-operator 1/1 1 1 2m2s openshift-machine-config-operator etcd-quorum-guard 3/3 3 3 22m openshift-machine-config-operator machine-config-controller 1/1 1 1 22m openshift-machine-config-operator machine-config-operator 1/1 1 1 2m4s openshift-marketplace certified-operators 1/1 1 1 18m openshift-marketplace community-operators 1/1 1 1 18m openshift-marketplace marketplace-operator 1/1 1 1 119s openshift-marketplace redhat-operators 1/1 1 1 18m openshift-monitoring cluster-monitoring-operator 1/1 1 1 118s openshift-monitoring grafana 1/1 1 1 14m openshift-monitoring kube-state-metrics 1/1 1 1 15m openshift-monitoring openshift-state-metrics 1/1 1 1 15m openshift-monitoring prometheus-adapter 2/2 2 2 13m openshift-monitoring prometheus-operator 1/1 1 1 15m openshift-monitoring telemeter-client 1/1 1 1 15m openshift-network-operator network-operator 1/1 1 1 2m5s openshift-operator-lifecycle-manager catalog-operator 1/1 1 1 115s openshift-operator-lifecycle-manager olm-operator 1/1 1 1 115s openshift-operator-lifecycle-manager packageserver 2/2 2 2 106s openshift-service-ca-operator service-ca-operator 1/1 1 1 2m4s openshift-service-ca apiservice-cabundle-injector 1/1 1 1 22m openshift-service-ca configmap-cabundle-injector 1/1 1 1 22m openshift-service-ca service-serving-cert-signer 1/1 1 1 22m openshift-service-catalog-apiserver-operator openshift-service-catalog-apiserver-operator 1/1 1 1 2m openshift-service-catalog-controller-manager-operator openshift-service-catalog-controller-manager-operator 1/1 1 1 2m8s According to above test result, the recover time is about 3+ mins. I re-test several rounds against the same cluster, the recover time are all < 10min. Recover time:00:03:20 Recover time:00:09:23 Recover time:00:07:22 Recover time:00:08:22 @Clayton Coleman Could u confirm if my reproduce step is correct?
Since qe can not reproduce the bug, so just check pr#245 merged into v4.2. [root@preserve-jliu-worker cluster-version-operator]# git branch -a --contains=0725bd53c7 * master remotes/origin/HEAD -> origin/master remotes/origin/master remotes/origin/release-4.2 remotes/origin/release-4.3 remotes/origin/release-4.4 If hit the issue again, feel free to re-open it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days