Bug 1801771 - [buildcop] e2e-aws-upgrade-rollback-4.1-to-4.2 consistently broken on Could not update deployment xxx
Summary: [buildcop] e2e-aws-upgrade-rollback-4.1-to-4.2 consistently broken on Could n...
Keywords:
Status: CLOSED DUPLICATE of bug 1800319
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-11 15:48 UTC by Yu Qi Zhang
Modified: 2020-02-24 19:54 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-24 19:54:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Yu Qi Zhang 2020-02-11 15:48:04 UTC
Description of problem:

The rollback job https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 consistently fails rollbacks, albeit on different operators per run. Example snippet:

Feb 10 12:34:22.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)

---------------------------------------------------------
Received interrupt.  Running AfterSuite...
^C again to terminate immediately
Feb 10 12:34:25.365: INFO: Running AfterSuite actions on all nodes
Feb 10 12:34:25.365: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Feb 10 12:34:25.721: INFO: Condition Ready of node ip-10-0-149-249.us-west-1.compute.internal is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status.
Feb 10 12:34:25.721: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Feb 10 12:34:26.006: INFO: Condition Ready of node ip-10-0-149-249.us-west-1.compute.internal is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status.
STEP: Destroying namespace "e2e-k8s-sig-apps-daemonset-upgrade-8560" for this suite.
Feb 10 12:34:32.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:34:32.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:34:42.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:34:42.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:34:52.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:34:52.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:35:02.996: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:35:02.996: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:35:12.996: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:35:12.996: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:35:14.241: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Feb 10 12:35:20.109: INFO: namespace e2e-k8s-sig-apps-daemonset-upgrade-8560 deletion completed in 54.102780159s
Feb 10 12:35:20.109: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-k8s-service-upgrade-4709" for this suite.
Feb 10 12:35:22.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:35:22.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
SIGABRT: abort
PC=0x462161 m=0 sigcode=0


This run was with kube-controller-manager, but each run has a different operator failing (openshift-apiserver-operator, openshift-service-catalog-apiserver-operator, etc.)

This may be a duplicate bug of https://bugzilla.redhat.com/show_bug.cgi?id=1791863 but I cannot say for sure, so opening to track

Comment 2 Ryan Phillips 2020-02-11 20:35:01 UTC
The node status going to false could mean an overloaded node. This PR [1] went in today to reserve more cpu and memory for kubelet and crio. This should help.

1. https://github.com/openshift/machine-config-operator/pull/1450

Comment 3 Ryan Phillips 2020-02-11 20:37:58 UTC
Note: the fix went into master so far, so the 4.1 to 4.2 upgrade job would not see benefits yet. Going to check the release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3-to-4.4 bot.

Comment 4 Ryan Phillips 2020-02-24 19:54:58 UTC

*** This bug has been marked as a duplicate of bug 1800319 ***


Note You need to log in before you can comment on or make changes to this bug.