Bug 1703878

Summary: [upgrade] Pod behind service load balancer becomes unavailable during cluster upgrade
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Cloud ComputeAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: agarcial, gblomqui
Target Milestone: ---Keywords: Upgrades
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-04-29 01:29:09 UTC
During 1/5-1/6 e2e runs the upgrade service load balancer test fails (verifying that a pod behind the LB is still reachable continuously during upgrade). This could be multiple things:

1. The PDB part of the test isn't running except on GCE - I will patch this out shortly to ensure we run with PDB tests in our environment
2. Some variation of https://bugzilla.redhat.com/show_bug.cgi?id=1702414 could be impacting us
3. The MCD is not draining properly
4. The test may have an assumption that doesn't hold on OpenShift (if so, we need to discuss how to fix)

Needs investigation to determine whether we are disrupting workloads incorrectly during upgrade.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/883

Apr 28 00:25:37.121: Could not reach HTTP service through ad9d4be89694811e985461212303fb68-66553889.us-east-1.elb.amazonaws.com:80 after 2m0s

github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*ServiceTestJig).TestReachableHTTPWithRetriableErrorCodes(0xc0023f0880, 0xc0029d15e0, 0x45, 0x50, 0x909c4e0, 0x0, 0x0, 0x1bf08eb000)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service_util.go:855 +0x33c
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*ServiceTestJig).TestReachableHTTP(0xc0023f0880, 0xc0029d15e0, 0x45, 0x50, 0x1bf08eb000)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service_util.go:847 +0x75
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test.func1()

Comment 1 Clayton Coleman 2019-04-29 18:20:39 UTC
https://github.com/openshift/origin/pull/22711 will enable PDBs

Comment 2 Jan Chaloupka 2019-05-02 08:09:04 UTC
PR https://github.com/openshift/origin/pull/22711 got merged on May 1, 2019, 5:42 AM GMT+2. I went through PRs that are already merged and all the `ci/prow/e2e-aws-upgrade` tests went green after the timestamp. At the same time all the failed `ci/prow/e2e-aws-upgrade` runs were red and failed for the reason as described in this report. I have also checked new PRs and did not see any `ci/prow/e2e-aws-upgrade` test failing for the reasons mentioned in this report.

Comment 4 Clayton Coleman 2019-05-03 16:40:18 UTC
Seeing this https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1146 in recent runs which looks like a different failure:

May  3 15:34:41.108: Timed out waiting for service "service-test" to have a load balancer

github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*ServiceTestJig).waitForConditionOrFail(0xc002ba92c0, 0xc0029edfc0, 0x1f, 0xc0023f01a0, 0xc, 0x1176592e000, 0x4e4686e, 0x14, 0x50911d0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service_util.go:589 +0x1e9
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework.(*ServiceTestJig).WaitForLoadBalancerOrFail(0xc002ba92c0, 0xc0029edfc0, 0x1f, 0xc0023f01a0, 0xc, 0x1176592e000, 0x25)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service_util.go:548 +0x15d
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).Setup(0xc001e585d0, 0xc0023a42c0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:52 +0x195
github.com/openshift/origin/test/e2e/upgrade.(*chaosMonkeyAdapter).Test(0xc002475840, 0xc002142960)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/e2e/upgrade/upgrade.go:165 +0x180
github.com/openshift/origin/test/e2e/upgrade.(*chaosMonkeyAdapter).Test-fm(0xc002142960)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/e2e/upgrade/upgrade.go:245 +0x34
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do.func1(0xc002142960, 0xc0027b5d50)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:89 +0x76
created by github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*chaosmonkey).Do
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:86 +0xa7
				

Will spawn a separate bug for that.

Comment 5 Clayton Coleman 2019-05-03 16:48:52 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1706155

Comment 7 errata-xmlrpc 2019-06-04 10:48:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758