Description of problem: Upgrade tests for 4.2.12 to 4.3.0 in GCP are panicking [1] [2] How reproducible: Intermittent [1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/166 [2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/167 Error: Dec 19 19:12:32.526: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-19-131558: 84% complete Dec 19 19:12:42.732: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-19-131558: 84% complete Dec 19 19:12:52.526: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-19-131558: 84% complete Dec 19 19:12:59.890: INFO: Poke("http://34.74.108.57:80/echo?msg=hello"): Get http://34.74.108.57:80/echo?msg=hello: dial tcp 34.74.108.57:80: i/o timeout Dec 19 19:12:59.894: INFO: Could not reach HTTP service through 34.74.108.57:80 after 2m0s E1219 19:12:59.915508 263 runtime.go:78] Observed a panic: ginkgowrapper.FailurePanic{Message:"Dec 19 19:12:59.894: Could not reach HTTP service through 34.74.108.57:80 after 2m0s", Filename:"/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go", Line:915, FullStackTrace:"github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTPWithRetriableErrorCodes(0xc003dec240, 0xc004a8ab90, 0xc, 0x50, 0xa59baa0, 0x0, 0x0, 0x1bf08eb000)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:915 +0x306\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTP(...)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:896\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test.func1()\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:106 +0xa6\ngithub.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc004f6fea0)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54\ngithub.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc004f6fea0, 0x77359400, 0x0, 0x1, 0xc0022742a0)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8\ngithub.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0xc004080101)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:105 +0xaf\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).Test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0x2)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:87 +0x54\ngithub.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc003f6d4c0, 0xc004086560)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/disruption/disruption.go:119 +0x33f\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1(0xc004086560, 0xc003f83740)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x76\ncreated by github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:87 +0xa7"} ( Your test failed. Ginkgo panics to prevent subsequent assertions from running. Normally Ginkgo rescues this panic so you shouldn't see it. But, if you make an assertion in a goroutine, Ginkgo can't capture the panic. To circumvent this, you should call defer GinkgoRecover() at the top of the goroutine that caused this panic. ) goroutine 281 [running]: github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x5219a80, 0xc00423c480) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3 github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82 panic(0x5219a80, 0xc00423c480) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5 github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper.Fail.func1(0xc0021d4060, 0x54, 0x9fc703a, 0x8f, 0x393, 0xc0040f2c00, 0xbc8) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper/wrapper.go:63 +0xa1 panic(0x49a4360, 0x60dca90) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5 github.com/openshift/origin/vendor/github.com/onsi/ginkgo.Fail(0xc0021d4060, 0x54, 0xc004f6fb60, 0x1, 0x1) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/ginkgo_dsl.go:266 +0xc8 github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper.Fail(0xc0021d4060, 0x54, 0xc004f6fc08, 0x1, 0x1) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper/wrapper.go:67 +0x19b
Prettified version of the full stacktrace with \n -> real newlines: github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTPWithRetriableErrorCodes(0xc003dec240, 0xc004a8ab90, 0xc, 0x50, 0xa59baa0, 0x0, 0x0, 0x1bf08eb000) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:915 +0x306 github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTP(...) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:896 github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test.func1() /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:106 +0xa6 github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc004f6fea0) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54 github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc004f6fea0, 0x77359400, 0x0, 0x1, 0xc0022742a0) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0xc004080101) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:105 +0xaf github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).Test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0x2) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:87 +0x54 github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc003f6d4c0, 0xc004086560) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/disruption/disruption.go:119 +0x33f github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1(0xc004086560, 0xc003f83740) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x76 created by github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:87 +0xa7
That jib.go line is intentionally failing [1], and as Ginko told us, we're missing a 'defer GinkgoRecover' in chaosmonkey. But looks like we have one [2]? Ah well, hopefully the test-suite folks understand this better than I do ;). [1]: https://github.com/openshift/origin/blob/592a4a9d5d65cf50f78b01c8ecb4a99086835ca1/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go#L915 [2]: https://github.com/openshift/origin/blob/592a4a9d5d65cf50f78b01c8ecb4a99086835ca1/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go#L87-L88
Though the logs suggest that a panic to be concerned about, this is not actually the case. The panic entries in the referenced logs are the unfortunate result of an interaction between the following: - ginkgo uses panic() [1] to propagate assertion failures up the call stack - wait.* polling methods (e.g. wait.Until, wait.PollImmediate) call runtime.HandleCrash [2] - runtime.HandleCrash logs panics [3] So if a call to ginkgo.Fail is made within a wait.* polling loop, the resulting panic will be logged before ginkgo has a chance to recover it higher up the call stack. As per this bz, this behavior is likely to be a source of confusion for the uninitiated. It's not clear why it is important for wait.* polling methods to log panics and re-raise them, and ideally an upstream proposal would disable that behavior. Someone who wants to log a panic could always defer HandleCrash to run after the wait loop. 1: https://github.com/openshift/origin/blob/master/vendor/github.com/onsi/ginkgo/ginkgo_dsl.go#L266 2: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go#L151 3: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go#L47
As per discussion with Maru and Clayton we should look in to increasing the timeout for gcp as timeout was bumped for AWS.
*** Bug 1793174 has been marked as a duplicate of this bug. ***
Hold on. This is NOT a setup issue. The one you duped was the cluster failing to accept traffic during an upgrade, not setup. So by duping you lost the urgent status. Resetting The duped bug has a legitimate "we drop user workload traffic on the floor during an upgrade" which is an upgrade blocker.
Setting to active development branch (4.4). Will create 4.3.z clone.
The test is trying to hit a load balancer service and not a route so itβs not flowing through the haproxy ingress controller. I sifted through the workers journal and I see the following entries around the time of the failed test: Dec 19 19:12:59 ci-op--9d5rz-w-b-c7trv.c.openshift-gce-devel-ci.internal bash[1829]: Processing route for NIC 0/42:01:0a:00:20:03 as ens4 for 34.74.108.57 Dec 19 19:12:59 ci-op--9d5rz-w-b-c7trv.c.openshift-gce-devel-ci.internal hyperkube[1972]: I1219 19:12:59.814733 1972 prober.go:129] Readiness probe for "router-default-b5f87d8cd-qpphm_openshift-ingress(008a2c75-31cd-405d-8084-a28688b0b38a):router" succeeded 19:12:59 is the same time the network manager service starts on one of the nodes. Reassigning to the sdn team since this does not appear to be an ingress issue.
I don't have enough info yet to strongly implicate the SDN, and I think LoadBalancer Services are basically ingress issues, so I don't mind keeping it under Routing for now. But I certainly won't object to SDN folks taking a look if they're familiar with the moving parts here!
Clayton points out that this shows up in non-update jobs too, e.g. [1]: 2009/2256 Tests Failed ... Get https://api.ci-op-j43hd3fm-91171.origin-ci-int-gce.dev.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp: lookup api.ci-op-j43hd3fm-91171.origin-ci-int-gce.dev.openshift.com on 10.142.0.37:53: no such host with: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/971/build-log.txt | grep -c 'dial tcp: lookup api[.].* on.*:53: no such host' 4031 Looks like that's hitting ~20% of GCP e2e jobs [2], just like the originally-reported symptoms hit ~20% of update jobs [3]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/971 [2]: https://search.svc.ci.openshift.org/chart?name=e2e-gcp$&search=lookup%20api[.].*.dev.openshift.com.*no%20such%20host [3]: https://search.svc.ci.openshift.org/chart?name=e2e-gcp-upgrade$&search=Could%20not%20reach%20HTTP%20service%20through%20.*%20after%202m0s
(In reply to W. Trevor King from comment #10) > Clayton points out that this shows up in non-update jobs too, e.g. [1]: > > 2009/2256 Tests Failed > ... > Get > https://api.ci-op-j43hd3fm-91171.origin-ci-int-gce.dev.openshift.com:6443/ > api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: > dial tcp: lookup > api.ci-op-j43hd3fm-91171.origin-ci-int-gce.dev.openshift.com on > 10.142.0.37:53: no such host > > with: > > $ curl -s > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin- > installer-e2e-gcp-4.4/971/build-log.txt | grep -c 'dial tcp: lookup api[.].* > on.*:53: no such host' > 4031 > > Looks like that's hitting ~20% of GCP e2e jobs [2], just like the > originally-reported symptoms hit ~20% of update jobs [3]. > > [1]: > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release- > openshift-origin-installer-e2e-gcp-4.4/971 > [2]: > https://search.svc.ci.openshift.org/chart?name=e2e-gcp$&search=lookup%20api[. > ].*.dev.openshift.com.*no%20such%20host > [3]: > https://search.svc.ci.openshift.org/chart?name=e2e-gcp- > upgrade$&search=Could%20not%20reach%20HTTP%20service%20through%20. > *%20after%202m0s The report is about connectivity to a LoadBalancer Service endpoint over the ingress IP. Help me understand how the cited apiserver connectivity failure examples relate? So far I'm not seeing this occur outside the context of the upgrade tests.
Just a few quick notes so far, with the first example[1]. Pods behind the LB service to which the test loses connectivity: Dec 19 18:52:35.120: INFO: Waiting up to 2m0s for pod "service-test-mgggt" in namespace "e2e-k8s-service-upgrade-8115" to be "running and ready" Dec 19 18:52:35.120: INFO: Waiting up to 2m0s for pod "service-test-9slxq" in namespace "e2e-k8s-service-upgrade-8115" to be "running and ready" Here's a brief timeline from the test: # Upgrade starts Dec 19 18:52:47.169: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-19-131558 Dec 19 18:53:02.199: INFO: cluster upgrade is Progressing: Working towards registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-19-131558: downloading update # First service connectivity failure Dec 19 19:01:34.133: INFO: Poke("http://34.74.169.164:80/echo?msg=hello"): Get http://34.74.169.164:80/echo?msg=hello: dial tcp 34.74.169.164:80: i/o timeout # Second service connectivity failure Dec 19 19:02:06.133: INFO: Poke("http://34.74.169.164:80/echo?msg=hello"): Get http://34.74.169.164:80/echo?msg=hello: dial tcp 34.74.169.164:80: i/o timeout # Last service connectivity failure Dec 19 19:05:52.143: INFO: Could not reach HTTP service through 34.74.169.164:80 after 2m0s E1219 19:05:52.143855 235 runtime.go:78] Observed a panic: ginkgowrapper.FailurePanic{Message:"Dec 19 19:05:52.143: Could not reach HTTP service through 34.74.169.164:80 after 2m0s", As far as the worker journal, there are some potential clues that might help narrow the problem space. Here are successful Kubelet probe of one of the pods at the same time the test reports a failure: Dec 19 19:02:06 ci-op--jww5s-w-d-8ghv4.c.openshift-gce-devel-ci.internal hyperkube[1972]: I1219 19:02:06.102548 1972 prober.go:125] Readiness probe for "service-test-mgggt_e2e-k8s-service-upgrade-8115(b71fdf22-2290-11ea-b0e7-42010a000002):netexec" succeeded There are other successful probe logs around the times of the test failures. There aren't any reboots during the problem timeframe. Here's a reboot that happens much later: Dec 19 19:20:41 ci-op--jww5s-w-b-7gvsm.c.openshift-gce-devel-ci.internal root[46108]: machine-config-daemon[38572]: initiating reboot: Node will reboot into config rendered-worker-09bd6842a658db3b5673433304ae1463 So far this leads me to believe the pods are created at the right time and are responsive, but that traffic isn't making it from the the LB into the cluster, or from the ingress point to the pod. So maybe we can at least rule out the test code itself. [1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/166
Just a quick update to say that we still don't have a root cause. Clayton did find a second potentially correlated disruption through the *apiserver LB*: Dec 19 19:01:22.980 W clusteroperator/openshift-apiserver changed Available to False: AvailableOpenShiftAPICheckFailed: Available: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request) Which is suspiciously close to the first test LB failure at 19:01:34. The apiserver is killed at 18:58:04. Kube controller manager restarted at 18:59:29. Current runs lack lots of cluster state just prior to the upgrade (like the KCM logs and prior event store), so it's tough to say if the KCM did some unexpected reconciliation of the test LB. It's also not clear yet whether the GCP routing hack[1][2] is a red herring here (given we're seeing disruption to workers). I think maybe I've stretched the limit of what I can figure out with post-mortem analysis, so my plan is to try moving on to reproducing with test PRs that dump more state for analysis. [1] https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/481 [2] https://github.com/openshift/machine-config-operator/pull/1031
P.S., I still haven't totally ruled out the possibility of iptables rules inconsistencies. To do that it would be useful to have a cross-node iptables state dump from right before the upgrade and right after a connectivity loss is observed. P.P.S, adding GCP load balancer state dumps at those critical times might also be valuable.
While debugging, I noticed we're seeing the same disruption occur even on some successful runs[1]. Apparently only sustained disruptions are fatal to the test. I now lack evidence of a strong correlation between any particular 4.3 pod startup and the disruptions. The AvailableOpenShiftAPICheckFailed correlation was also not present in this run. I think iptables dumps from test pod nodes as soon as failure is observed would be useful. GCP target pool health status as well. I wasn't able to implement either today, but hopefully the problem space is somewhat narrowed by the latest info. [1] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/24456/pull-ci-openshift-origin-release-4.3-e2e-gcp-upgrade/165/artifacts/e2e-gcp-upgrade/container-logs/test.log
Job-detail page for^ is [1], in case folks want to look at other artifacts or anything. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24456/pull-ci-openshift-origin-release-4.3-e2e-gcp-upgrade/165
Significant detail found in 165[1]. Disruption logs: Jan 24 20:57:21.095: INFO: Poke("http://35.237.108.235:80/echo?msg=hello"): Get http://35.237.108.235:80/echo?msg=hello: dial tcp 35.237.108.235:80: connect: no route to host Jan 24 20:58:07.120: INFO: Poke("http://35.237.108.235:80/echo?msg=hello"): Get http://35.237.108.235:80/echo?msg=hello: dial tcp 35.237.108.235:80: i/o timeout Jan 24 20:59:02.175: INFO: Poke("http://35.237.108.235:80/echo?msg=hello"): Get http://35.237.108.235:80/echo?msg=hello: dial tcp 35.237.108.235:80: i/o timeout Jan 24 21:08:56.224: INFO: Poke("http://35.237.108.235:80/echo?msg=hello"): Get http://35.237.108.235:80/echo?msg=hello: dial tcp 35.237.108.235:80: i/o timeout Jan 24 21:15:42.019: INFO: Poke("http://35.237.108.235:80/echo?msg=hello"): Get http://35.237.108.235:80/echo?msg=hello: dial tcp 35.237.108.235:80: connect: no route to host KCM logs: I0124 21:13:41.389635 1 gce_loadbalancer.go:119] EnsureLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): ensure EXTERNAL loadbalancer I0124 21:13:42.265759 1 gce_loadbalancer_external.go:75] ensureExternalLoadBalancer(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test), us-east1, , [TCP/80], [ci-op-t4f9k-w-d-8mbl7.c.openshift-gce-devel-ci.internal ci-op-t4f9k-m-0.c.openshift-gce-devel-ci.internal ci-op-t4f9k-m-1.c.openshift-gce-devel-ci.internal ci-op-t4f9k-w-b-gm7q9.c.openshift-gce-devel-ci.internal], map[]) I0124 21:13:42.265793 1 gce_loadbalancer_external.go:84] ensureExternalLoadBalancer(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Desired network tier "Premium". I0124 21:13:44.385075 1 gce_loadbalancer_external.go:158] ensureExternalLoadBalancer(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Ensured IP address 35.237.108.235 (tier: Premium). I0124 21:13:44.788784 1 gce_loadbalancer_external.go:229] ensureExternalLoadBalancer(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Service needs nodes health checks. I0124 21:14:19.307125 1 gce_loadbalancer_external.go:514] ensureTargetPoolAndHealthCheck(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Updated target pool (with 4 hosts). I0124 21:14:19.464967 1 gce_loadbalancer_external.go:694] Checking http health check params k8s-d6825a0e603f25b6-node I0124 21:14:22.356575 1 gce_loadbalancer_external.go:134] ensureExternalLoadBalancer(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Released static IP 35.237.108.235. I0124 21:14:22.356607 1 gce_loadbalancer.go:159] EnsureLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): done ensuring loadbalancer. I0124 21:15:19.417780 1 gce_loadbalancer.go:172] UpdateLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): updating with 4 nodes I0124 21:15:53.286292 1 gce_loadbalancer.go:180] UpdateLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): done updating. err: <nil> I0124 21:17:48.384968 1 gce_loadbalancer.go:172] UpdateLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): updating with 5 nodes I0124 21:18:08.543536 1 gce_loadbalancer.go:180] UpdateLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): done updating. err: <nil> I0124 21:18:48.861795 1 gce_loadbalancer.go:193] EnsureLoadBalancerDeleted(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): deleting loadbalancer I0124 21:18:48.861838 1 gce_loadbalancer_external.go:340] ensureExternalLoadBalancerDeleted(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting forwarding rule. I0124 21:18:48.862167 1 gce_loadbalancer_external.go:322] ensureExternalLoadBalancerDeleted(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting firewall rule. I0124 21:18:48.862181 1 gce_loadbalancer_external.go:336] ensureExternalLoadBalancerDeleted(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting IP address. I0124 21:19:08.836267 1 gce_loadbalancer_external.go:346] ensureExternalLoadBalancerDeleted(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting target pool. I0124 21:19:12.809910 1 gce_loadbalancer_external.go:382] DeleteExternalTargetPoolAndChecks(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting health check adaad6bd00035412893a211500fd4294. I0124 21:19:13.065985 1 gce_loadbalancer_external.go:399] DeleteExternalTargetPoolAndChecks(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Health check adaad6bd00035412893a211500fd4294 is already deleted. I0124 21:19:13.066014 1 gce_loadbalancer_external.go:404] DeleteExternalTargetPoolAndChecks(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting health check firewall k8s-adaad6bd00035412893a211500fd4294-http-hc. I0124 21:19:13.182689 1 gce_loadbalancer_external.go:382] DeleteExternalTargetPoolAndChecks(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting health check k8s-d6825a0e603f25b6-node. I0124 21:19:17.089005 1 gce_loadbalancer_external.go:404] DeleteExternalTargetPoolAndChecks(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting health check firewall k8s-d6825a0e603f25b6-node-http-hc. I0124 21:19:24.510827 1 gce_loadbalancer.go:201] EnsureLoadBalancerDeleted(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): done deleting loadbalancer. err: <nil> Notice that about 2 minutes before the last pre-reboot disruption, coinciding with the 2 minute poke timeout, the KCM is messing with the GCP LB/health check. I haven't yet dug into the details of the KCM activity, but it certainly looks suspicious in this context. It's possible the 4.2 KCM (whose lose are lost post-upgrade of the pod) is doing similar things during reconciliation prior to the other disruptions. I'm going to try and preserve 4.2 KCM logs in a 4.2 PR to see. [1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24456/pull-ci-openshift-origin-release-4.3-e2e-gcp-upgrade/165
Reading through the history on this bug, I'm not clear if this is a temporary outage from which clusters eventually recover on their own, or if folks need to poke something to unstick the failing cluster, or how healthy the cluster is while stuck. We have populated must-gather (e.g. [1]) in at least some of these cases, so the cluster seems to be at least mostly alive even when hitting whatever the root cause issue is here. Do we have a recommended recovery procedure or anything yet? [1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/167/artifacts/e2e-gcp-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-aad43a50042f7707abb70968c9382271304aa01811953bf1ee5431bef0ac0ae9/
In 4.2.18 -> 4.3.1 GCP CI [1]: fail [k8s.io/kubernetes/test/e2e/framework/service/jig.go:915]: Feb 5 22:40:46.161: Could not reach HTTP service through 35.243.241.128:80 after 2m0s so this has not been magically fixed with the new releases. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/286
Updating the description to capture the current state. During upgrades there is a brief disruption to workloads running on the pod network while OVS is being updated. We are assessing solutions that would prevent this disruption. While 4.2 to 4.3 GCP is worse, the problem can occur on all releases of OpenShift.
*** Bug 1791117 has been marked as a duplicate of this bug. ***
With the merges of: - https://github.com/openshift/cluster-network-operator/pull/484 - https://github.com/openshift/multus-cni/pull/53 I think we have addressed much of the problem.
did not reproduce this issue in recent upgrade. Verified this bug
Formally linking the two pull requests from comment 31.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581
Bug is CLOSED ERRATA. Nobody needs info on it anymore (or if they do, they should comment again with the information they still need).
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475