1793635 – Workloads on GCP are unreachable during 4.2.x to 4.3.0 upgrade sometimes

Bug 1793635 - Workloads on GCP are unreachable during 4.2.x to 4.3.0 upgrade sometimes

Summary: Workloads on GCP are unreachable during 4.2.x to 4.3.0 upgrade sometimes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Ben Bennett
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:	1785457
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-21 18:01 UTC by Stephen Cuppett
Modified:	2023-09-07 21:32 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1785457
Environment:
Last Closed:	2020-03-24 14:32:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0858	0	None	None	None	2020-03-24 14:33:02 UTC

Description Stephen Cuppett 2020-01-21 18:01:49 UTC

+++ This bug was initially created as a clone of Bug #1785457 +++

Description of problem:

Upgrade tests for 4.2.12 to 4.3.0 in GCP are panicking [1] [2]


How reproducible:

Intermittent

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/166
[2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/167

Error:

Dec 19 19:12:32.526: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-19-131558: 84% complete
Dec 19 19:12:42.732: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-19-131558: 84% complete
Dec 19 19:12:52.526: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-19-131558: 84% complete
Dec 19 19:12:59.890: INFO: Poke("http://34.74.108.57:80/echo?msg=hello"): Get http://34.74.108.57:80/echo?msg=hello: dial tcp 34.74.108.57:80: i/o timeout
Dec 19 19:12:59.894: INFO: Could not reach HTTP service through 34.74.108.57:80 after 2m0s
E1219 19:12:59.915508     263 runtime.go:78] Observed a panic: ginkgowrapper.FailurePanic{Message:"Dec 19 19:12:59.894: Could not reach HTTP service through 34.74.108.57:80 after 2m0s", Filename:"/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go", Line:915, FullStackTrace:"github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTPWithRetriableErrorCodes(0xc003dec240, 0xc004a8ab90, 0xc, 0x50, 0xa59baa0, 0x0, 0x0, 0x1bf08eb000)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:915 +0x306\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTP(...)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:896\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test.func1()\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:106 +0xa6\ngithub.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc004f6fea0)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54\ngithub.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc004f6fea0, 0x77359400, 0x0, 0x1, 0xc0022742a0)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8\ngithub.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0xc004080101)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:105 +0xaf\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).Test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0x2)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:87 +0x54\ngithub.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc003f6d4c0, 0xc004086560)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/disruption/disruption.go:119 +0x33f\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1(0xc004086560, 0xc003f83740)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x76\ncreated by github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:87 +0xa7"} (
Your test failed.
Ginkgo panics to prevent subsequent assertions from running.
Normally Ginkgo rescues this panic so you shouldn't see it.
But, if you make an assertion in a goroutine, Ginkgo can't capture the panic.
To circumvent this, you should call
	defer GinkgoRecover()
at the top of the goroutine that caused this panic.
)
goroutine 281 [running]:
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x5219a80, 0xc00423c480)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x5219a80, 0xc00423c480)
	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper.Fail.func1(0xc0021d4060, 0x54, 0x9fc703a, 0x8f, 0x393, 0xc0040f2c00, 0xbc8)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper/wrapper.go:63 +0xa1
panic(0x49a4360, 0x60dca90)
	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/origin/vendor/github.com/onsi/ginkgo.Fail(0xc0021d4060, 0x54, 0xc004f6fb60, 0x1, 0x1)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/ginkgo_dsl.go:266 +0xc8
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper.Fail(0xc0021d4060, 0x54, 0xc004f6fc08, 0x1, 0x1)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper/wrapper.go:67 +0x19b

--- Additional comment from W. Trevor King on 2019-12-19 23:11:19 UTC ---

Prettified version of the full stacktrace with \n -> real newlines:

github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTPWithRetriableErrorCodes(0xc003dec240, 0xc004a8ab90, 0xc, 0x50, 0xa59baa0, 0x0, 0x0, 0x1bf08eb000)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:915 +0x306
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTP(...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:896
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test.func1()
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:106 +0xa6
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc004f6fea0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc004f6fea0, 0x77359400, 0x0, 0x1, 0xc0022742a0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0xc004080101)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:105 +0xaf
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).Test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0x2)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:87 +0x54
github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc003f6d4c0, 0xc004086560)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/disruption/disruption.go:119 +0x33f
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1(0xc004086560, 0xc003f83740)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x76
created by github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:87 +0xa7

--- Additional comment from W. Trevor King on 2019-12-19 23:17:37 UTC ---

That jib.go line is intentionally failing [1], and as Ginko told us, we're missing a 'defer GinkgoRecover' in chaosmonkey.  But looks like we have one [2]?  Ah well, hopefully the test-suite folks understand this better than I do ;).

[1]: https://github.com/openshift/origin/blob/592a4a9d5d65cf50f78b01c8ecb4a99086835ca1/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go#L915
[2]: https://github.com/openshift/origin/blob/592a4a9d5d65cf50f78b01c8ecb4a99086835ca1/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go#L87-L88

--- Additional comment from Maru Newby on 2019-12-21 00:58:03 UTC ---

Though the logs suggest that a panic to be concerned about, this is not actually the case. The panic entries in the referenced logs are the unfortunate result of an interaction between the following:

 - ginkgo uses panic() [1] to propagate assertion failures up the call stack
 - wait.* polling methods (e.g. wait.Until, wait.PollImmediate) call runtime.HandleCrash [2] 
 - runtime.HandleCrash logs panics [3]

So if a call to ginkgo.Fail is made within a wait.* polling loop, the resulting panic will be logged before ginkgo has a chance to recover it higher up the call stack. As per this bz, this behavior is likely to be a source of confusion for the uninitiated.

It's not clear why it is important for wait.* polling methods to log panics and re-raise them, and ideally an upstream proposal would disable that behavior. Someone who wants to log a panic could always defer HandleCrash to run after the wait loop. 

1: https://github.com/openshift/origin/blob/master/vendor/github.com/onsi/ginkgo/ginkgo_dsl.go#L266
2: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go#L151
3: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go#L47

--- Additional comment from Lalatendu Mohanty on 2019-12-23 16:28:53 UTC ---

As per discussion with Maru and Clayton we should look in to increasing the timeout for gcp as timeout was bumped for AWS.

--- Additional comment from W. Trevor King on 2020-01-20 22:49:39 UTC ---



--- Additional comment from Clayton Coleman on 2020-01-20 23:12:43 UTC ---

Hold on.  This is NOT a setup issue.  The one you duped was the cluster failing to accept traffic during an upgrade, not setup.  So by duping you lost the urgent status.  Resetting

The duped bug has a legitimate "we drop user workload traffic on the floor during an upgrade" which is an upgrade blocker.

--- Additional comment from Stephen Cuppett on 2020-01-21 18:01:10 UTC ---

Setting to active development branch (4.4). Will create 4.3.z clone.

Comment 2 Clayton Coleman 2020-03-02 18:49:23 UTC

In general this specific bug is not a blocker to upgrades becoming available because it is not a regression from 4.1 behavior.  Instead, we are working on a comprehensive set of bug fixes that will be individually backported to mitigate the issues involved.

Comment 3 Ben Bennett 2020-03-13 13:15:00 UTC

With the merges of:
 https://github.com/openshift/multus-cni/pull/53
 https://github.com/openshift/cluster-network-operator/pull/485

I think we have addressed much of the problem.  Additional work is being done, and will be backported under separate BZs as appropriate.

Comment 6 zhaozhanqi 2020-03-16 06:24:50 UTC

check this issue did not be reproduced in recent job. move this bug to verified.

Comment 7 Brendan Shephard 2020-03-23 11:33:46 UTC

I have run this upgrade multiple times now on OSP13, 16 and Baremetal (on oVirt) and it has worked each time without issue. The upgrade path became available maybe a week and half ago and I was able to click the handy upgrade button.

Are we still advising against this at this stage until this BZ is closed? Or did we already fix it which is why the upgrade path now exists?

Comment 9 errata-xmlrpc 2020-03-24 14:32:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0858

Note You need to log in before you can comment on or make changes to this bug.