Description of problem: The rollback job https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 consistently fails rollbacks, albeit on different operators per run. Example snippet: Feb 10 12:34:22.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350) --------------------------------------------------------- Received interrupt. Running AfterSuite... ^C again to terminate immediately Feb 10 12:34:25.365: INFO: Running AfterSuite actions on all nodes Feb 10 12:34:25.365: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready Feb 10 12:34:25.721: INFO: Condition Ready of node ip-10-0-149-249.us-west-1.compute.internal is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status. Feb 10 12:34:25.721: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready Feb 10 12:34:26.006: INFO: Condition Ready of node ip-10-0-149-249.us-west-1.compute.internal is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status. STEP: Destroying namespace "e2e-k8s-sig-apps-daemonset-upgrade-8560" for this suite. Feb 10 12:34:32.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied Feb 10 12:34:32.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350) Feb 10 12:34:42.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied Feb 10 12:34:42.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350) Feb 10 12:34:52.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied Feb 10 12:34:52.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350) Feb 10 12:35:02.996: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied Feb 10 12:35:02.996: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350) Feb 10 12:35:12.996: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied Feb 10 12:35:12.996: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350) Feb 10 12:35:14.241: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered Feb 10 12:35:20.109: INFO: namespace e2e-k8s-sig-apps-daemonset-upgrade-8560 deletion completed in 54.102780159s Feb 10 12:35:20.109: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-k8s-service-upgrade-4709" for this suite. Feb 10 12:35:22.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied Feb 10 12:35:22.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350) SIGABRT: abort PC=0x462161 m=0 sigcode=0 This run was with kube-controller-manager, but each run has a different operator failing (openshift-apiserver-operator, openshift-service-catalog-apiserver-operator, etc.) This may be a duplicate bug of https://bugzilla.redhat.com/show_bug.cgi?id=1791863 but I cannot say for sure, so opening to track
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/458/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-2-2020-02-10-210857-sha256-01670016606aad0ada65e65764488dfdc6d448d65cf91273226674e934e70679/namespaces/openshift-apiserver-operator/core/pods.yaml the deployment is not available. The pod is Ready false, with no error or restarts and no probes defined. Moving to Node to see what's up.
The node status going to false could mean an overloaded node. This PR [1] went in today to reserve more cpu and memory for kubelet and crio. This should help. 1. https://github.com/openshift/machine-config-operator/pull/1450
Note: the fix went into master so far, so the 4.1 to 4.2 upgrade job would not see benefits yet. Going to check the release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3-to-4.4 bot.
*** This bug has been marked as a duplicate of bug 1800319 ***