Bug 1801771

Summary: [buildcop] e2e-aws-upgrade-rollback-4.1-to-4.2 consistently broken on Could not update deployment xxx
Product: OpenShift Container Platform Reporter: Yu Qi Zhang <jerzhang>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED DUPLICATE QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.zCC: aos-bugs, jokerman
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-24 19:54:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yu Qi Zhang 2020-02-11 15:48:04 UTC
Description of problem:

The rollback job https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 consistently fails rollbacks, albeit on different operators per run. Example snippet:

Feb 10 12:34:22.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)

---------------------------------------------------------
Received interrupt.  Running AfterSuite...
^C again to terminate immediately
Feb 10 12:34:25.365: INFO: Running AfterSuite actions on all nodes
Feb 10 12:34:25.365: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Feb 10 12:34:25.721: INFO: Condition Ready of node ip-10-0-149-249.us-west-1.compute.internal is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status.
Feb 10 12:34:25.721: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Feb 10 12:34:26.006: INFO: Condition Ready of node ip-10-0-149-249.us-west-1.compute.internal is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status.
STEP: Destroying namespace "e2e-k8s-sig-apps-daemonset-upgrade-8560" for this suite.
Feb 10 12:34:32.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:34:32.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:34:42.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:34:42.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:34:52.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:34:52.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:35:02.996: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:35:02.996: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:35:12.996: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:35:12.996: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:35:14.241: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Feb 10 12:35:20.109: INFO: namespace e2e-k8s-sig-apps-daemonset-upgrade-8560 deletion completed in 54.102780159s
Feb 10 12:35:20.109: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-k8s-service-upgrade-4709" for this suite.
Feb 10 12:35:22.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:35:22.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
SIGABRT: abort
PC=0x462161 m=0 sigcode=0


This run was with kube-controller-manager, but each run has a different operator failing (openshift-apiserver-operator, openshift-service-catalog-apiserver-operator, etc.)

This may be a duplicate bug of https://bugzilla.redhat.com/show_bug.cgi?id=1791863 but I cannot say for sure, so opening to track

Comment 2 Ryan Phillips 2020-02-11 20:35:01 UTC
The node status going to false could mean an overloaded node. This PR [1] went in today to reserve more cpu and memory for kubelet and crio. This should help.

1. https://github.com/openshift/machine-config-operator/pull/1450

Comment 3 Ryan Phillips 2020-02-11 20:37:58 UTC
Note: the fix went into master so far, so the 4.1 to 4.2 upgrade job would not see benefits yet. Going to check the release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3-to-4.4 bot.

Comment 4 Ryan Phillips 2020-02-24 19:54:58 UTC

*** This bug has been marked as a duplicate of bug 1800319 ***