Bug 1801771

Summary:	[buildcop] e2e-aws-upgrade-rollback-4.1-to-4.2 consistently broken on Could not update deployment xxx
Product:	OpenShift Container Platform	Reporter:	Yu Qi Zhang <jerzhang>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Status:	CLOSED DUPLICATE	QA Contact:	Sunil Choudhary <schoudha>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1.z	CC:	aos-bugs, jokerman
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-02-24 19:54:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yu Qi Zhang 2020-02-11 15:48:04 UTC

Description of problem:

The rollback job https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 consistently fails rollbacks, albeit on different operators per run. Example snippet:

Feb 10 12:34:22.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)

---------------------------------------------------------
Received interrupt.  Running AfterSuite...
^C again to terminate immediately
Feb 10 12:34:25.365: INFO: Running AfterSuite actions on all nodes
Feb 10 12:34:25.365: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Feb 10 12:34:25.721: INFO: Condition Ready of node ip-10-0-149-249.us-west-1.compute.internal is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status.
Feb 10 12:34:25.721: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Feb 10 12:34:26.006: INFO: Condition Ready of node ip-10-0-149-249.us-west-1.compute.internal is false instead of true. Reason: NodeStatusUnknown, message: Kubelet stopped posting node status.
STEP: Destroying namespace "e2e-k8s-sig-apps-daemonset-upgrade-8560" for this suite.
Feb 10 12:34:32.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:34:32.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:34:42.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:34:42.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:34:52.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:34:52.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:35:02.996: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:35:02.996: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:35:12.996: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:35:12.996: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
Feb 10 12:35:14.241: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Feb 10 12:35:20.109: INFO: namespace e2e-k8s-sig-apps-daemonset-upgrade-8560 deletion completed in 54.102780159s
Feb 10 12:35:20.109: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-k8s-service-upgrade-4709" for this suite.
Feb 10 12:35:22.994: INFO: cluster upgrade is Progressing: Unable to apply 4.1.31: the update could not be applied
Feb 10 12:35:22.994: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (57 of 350)
SIGABRT: abort
PC=0x462161 m=0 sigcode=0


This run was with kube-controller-manager, but each run has a different operator failing (openshift-apiserver-operator, openshift-service-catalog-apiserver-operator, etc.)

This may be a duplicate bug of https://bugzilla.redhat.com/show_bug.cgi?id=1791863 but I cannot say for sure, so opening to track

Comment 1 Abhinav Dahiya 2020-02-11 17:41:28 UTC

> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/458/artifacts/e2e-aws-upgrade/must-gather/registry-svc-ci-openshift-org-ocp-4-2-2020-02-10-210857-sha256-01670016606aad0ada65e65764488dfdc6d448d65cf91273226674e934e70679/namespaces/openshift-apiserver-operator/core/pods.yaml

the deployment is not available.

The pod is Ready false, with no error or restarts and no probes defined. Moving to Node to see what's up.

Comment 2 Ryan Phillips 2020-02-11 20:35:01 UTC

The node status going to false could mean an overloaded node. This PR [1] went in today to reserve more cpu and memory for kubelet and crio. This should help.

1. https://github.com/openshift/machine-config-operator/pull/1450

Comment 3 Ryan Phillips 2020-02-11 20:37:58 UTC

Note: the fix went into master so far, so the 4.1 to 4.2 upgrade job would not see benefits yet. Going to check the release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3-to-4.4 bot.

Comment 4 Ryan Phillips 2020-02-24 19:54:58 UTC


*** This bug has been marked as a duplicate of bug 1800319 ***