Bug 1703878

Summary:	[upgrade] Pod behind service load balancer becomes unavailable during cluster upgrade
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Cloud Compute	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED ERRATA	QA Contact:	Jianwei Hou <jhou>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	agarcial, gblomqui
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:48:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-04-29 01:29:09 UTC

During 1/5-1/6 e2e runs the upgrade service load balancer test fails (verifying that a pod behind the LB is still reachable continuously during upgrade). This could be multiple things:

1. The PDB part of the test isn't running except on GCE - I will patch this out shortly to ensure we run with PDB tests in our environment
2. Some variation of https://bugzilla.redhat.com/show_bug.cgi?id=1702414 could be impacting us
3. The MCD is not draining properly
4. The test may have an assumption that doesn't hold on OpenShift (if so, we need to discuss how to fix)

Needs investigation to determine whether we are disrupting workloads incorrectly during upgrade.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/883

Apr 28 00:25:37.121: Could not reach HTTP service through ad9d4be89694811e985461212303fb68-66553889.us-east-1.elb.amazonaws.com:80 after 2m0s

github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*ServiceTestJig).TestReachableHTTPWithRetriableErrorCodes(0xc0023f0880, 0xc0029d15e0, 0x45, 0x50, 0x909c4e0, 0x0, 0x0, 0x1bf08eb000)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service_util.go:855 +0x33c
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*ServiceTestJig).TestReachableHTTP(0xc0023f0880, 0xc0029d15e0, 0x45, 0x50, 0x1bf08eb000)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service_util.go:847 +0x75
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test.func1()

Comment 1 Clayton Coleman 2019-04-29 18:20:39 UTC

https://github.com/openshift/origin/pull/22711 will enable PDBs

Comment 2 Jan Chaloupka 2019-05-02 08:09:04 UTC

PR https://github.com/openshift/origin/pull/22711 got merged on May 1, 2019, 5:42 AM GMT+2. I went through PRs that are already merged and all the `ci/prow/e2e-aws-upgrade` tests went green after the timestamp. At the same time all the failed `ci/prow/e2e-aws-upgrade` runs were red and failed for the reason as described in this report. I have also checked new PRs and did not see any `ci/prow/e2e-aws-upgrade` test failing for the reasons mentioned in this report.

Comment 4 Clayton Coleman 2019-05-03 16:40:18 UTC

Seeing this https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1146 in recent runs which looks like a different failure:

May  3 15:34:41.108: Timed out waiting for service "service-test" to have a load balancer

github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*ServiceTestJig).waitForConditionOrFail(0xc002ba92c0, 0xc0029edfc0, 0x1f, 0xc0023f01a0, 0xc, 0x1176592e000, 0x4e4686e, 0x14, 0x50911d0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service_util.go:589 +0x1e9
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*ServiceTestJig).WaitForLoadBalancerOrFail(0xc002ba92c0, 0xc0029edfc0, 0x1f, 0xc0023f01a0, 0xc, 0x1176592e000, 0x25)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service_util.go:548 +0x15d
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).Setup(0xc001e585d0, 0xc0023a42c0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:52 +0x195
github.com/openshift/origin/test/e2e/upgrade.(*chaosMonkeyAdapter).Test(0xc002475840, 0xc002142960)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/e2e/upgrade/upgrade.go:165 +0x180
github.com/openshift/origin/test/e2e/upgrade.(*chaosMonkeyAdapter).Test-fm(0xc002142960)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/e2e/upgrade/upgrade.go:245 +0x34
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do.func1(0xc002142960, 0xc0027b5d50)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:89 +0x76
created by github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:86 +0xa7
				

Will spawn a separate bug for that.

Comment 5 Clayton Coleman 2019-05-03 16:48:52 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1706155

Comment 7 errata-xmlrpc 2019-06-04 10:48:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758