Bug 1785457 - Workloads are unreachable during upgrades sometimes
Summary: Workloads are unreachable during upgrades sometimes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.4.0
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
: 1791117 1793174 (view as bug list)
Depends On:
Blocks: 1793635 1794169
TreeView+ depends on / blocked
 
Reported: 2019-12-19 22:12 UTC by Lalatendu Mohanty
Modified: 2021-06-18 07:22 UTC (History)
44 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1793635 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:20:55 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 484 0 None closed Bug 1805987: Uses the readiness indicator file option for Multus 2020-10-23 06:14:09 UTC
Github openshift multus-cni pull 53 0 None closed Bug 1809285: Removes configuration invalidation [4.4 backport] 2020-10-23 06:14:10 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:21:35 UTC

Description Lalatendu Mohanty 2019-12-19 22:12:39 UTC
Description of problem:

Upgrade tests for 4.2.12 to 4.3.0 in GCP are panicking [1] [2]


How reproducible:

Intermittent

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/166
[2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/167

Error:

Dec 19 19:12:32.526: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-19-131558: 84% complete
Dec 19 19:12:42.732: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-19-131558: 84% complete
Dec 19 19:12:52.526: INFO: cluster upgrade is Progressing: Working towards 4.3.0-0.nightly-2019-12-19-131558: 84% complete
Dec 19 19:12:59.890: INFO: Poke("http://34.74.108.57:80/echo?msg=hello"): Get http://34.74.108.57:80/echo?msg=hello: dial tcp 34.74.108.57:80: i/o timeout
Dec 19 19:12:59.894: INFO: Could not reach HTTP service through 34.74.108.57:80 after 2m0s
E1219 19:12:59.915508     263 runtime.go:78] Observed a panic: ginkgowrapper.FailurePanic{Message:"Dec 19 19:12:59.894: Could not reach HTTP service through 34.74.108.57:80 after 2m0s", Filename:"/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go", Line:915, FullStackTrace:"github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTPWithRetriableErrorCodes(0xc003dec240, 0xc004a8ab90, 0xc, 0x50, 0xa59baa0, 0x0, 0x0, 0x1bf08eb000)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:915 +0x306\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTP(...)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:896\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test.func1()\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:106 +0xa6\ngithub.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc004f6fea0)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54\ngithub.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc004f6fea0, 0x77359400, 0x0, 0x1, 0xc0022742a0)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8\ngithub.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0xc004080101)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:105 +0xaf\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).Test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0x2)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:87 +0x54\ngithub.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc003f6d4c0, 0xc004086560)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/disruption/disruption.go:119 +0x33f\ngithub.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1(0xc004086560, 0xc003f83740)\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x76\ncreated by github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do\n\t/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:87 +0xa7"} (
Your test failed.
Ginkgo panics to prevent subsequent assertions from running.
Normally Ginkgo rescues this panic so you shouldn't see it.
But, if you make an assertion in a goroutine, Ginkgo can't capture the panic.
To circumvent this, you should call
	defer GinkgoRecover()
at the top of the goroutine that caused this panic.
)
goroutine 281 [running]:
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x5219a80, 0xc00423c480)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x5219a80, 0xc00423c480)
	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper.Fail.func1(0xc0021d4060, 0x54, 0x9fc703a, 0x8f, 0x393, 0xc0040f2c00, 0xbc8)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper/wrapper.go:63 +0xa1
panic(0x49a4360, 0x60dca90)
	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/origin/vendor/github.com/onsi/ginkgo.Fail(0xc0021d4060, 0x54, 0xc004f6fb60, 0x1, 0x1)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/onsi/ginkgo/ginkgo_dsl.go:266 +0xc8
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper.Fail(0xc0021d4060, 0x54, 0xc004f6fc08, 0x1, 0x1)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/ginkgowrapper/wrapper.go:67 +0x19b

Comment 1 W. Trevor King 2019-12-19 23:11:19 UTC
Prettified version of the full stacktrace with \n -> real newlines:

github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTPWithRetriableErrorCodes(0xc003dec240, 0xc004a8ab90, 0xc, 0x50, 0xa59baa0, 0x0, 0x0, 0x1bf08eb000)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:915 +0x306
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service.(*TestJig).TestReachableHTTP(...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go:896
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test.func1()
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:106 +0xa6
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc004f6fea0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc004f6fea0, 0x77359400, 0x0, 0x1, 0xc0022742a0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0xc004080101)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:105 +0xaf
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades.(*ServiceUpgradeTest).Test(0xc0030e16e0, 0xc002aaa8c0, 0xc0022742a0, 0x2)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/upgrades/services.go:87 +0x54
github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc003f6d4c0, 0xc004086560)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/disruption/disruption.go:119 +0x33f
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1(0xc004086560, 0xc003f83740)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x76
created by github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go:87 +0xa7

Comment 2 W. Trevor King 2019-12-19 23:17:37 UTC
That jib.go line is intentionally failing [1], and as Ginko told us, we're missing a 'defer GinkgoRecover' in chaosmonkey.  But looks like we have one [2]?  Ah well, hopefully the test-suite folks understand this better than I do ;).

[1]: https://github.com/openshift/origin/blob/592a4a9d5d65cf50f78b01c8ecb4a99086835ca1/vendor/k8s.io/kubernetes/test/e2e/framework/service/jig.go#L915
[2]: https://github.com/openshift/origin/blob/592a4a9d5d65cf50f78b01c8ecb4a99086835ca1/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go#L87-L88

Comment 3 Maru Newby 2019-12-21 00:58:03 UTC
Though the logs suggest that a panic to be concerned about, this is not actually the case. The panic entries in the referenced logs are the unfortunate result of an interaction between the following:

 - ginkgo uses panic() [1] to propagate assertion failures up the call stack
 - wait.* polling methods (e.g. wait.Until, wait.PollImmediate) call runtime.HandleCrash [2] 
 - runtime.HandleCrash logs panics [3]

So if a call to ginkgo.Fail is made within a wait.* polling loop, the resulting panic will be logged before ginkgo has a chance to recover it higher up the call stack. As per this bz, this behavior is likely to be a source of confusion for the uninitiated.

It's not clear why it is important for wait.* polling methods to log panics and re-raise them, and ideally an upstream proposal would disable that behavior. Someone who wants to log a panic could always defer HandleCrash to run after the wait loop. 

1: https://github.com/openshift/origin/blob/master/vendor/github.com/onsi/ginkgo/ginkgo_dsl.go#L266
2: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go#L151
3: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go#L47

Comment 4 Lalatendu Mohanty 2019-12-23 16:28:53 UTC
As per discussion with Maru and Clayton we should look in to increasing the timeout for gcp as timeout was bumped for AWS.

Comment 5 W. Trevor King 2020-01-20 22:49:39 UTC
*** Bug 1793174 has been marked as a duplicate of this bug. ***

Comment 6 Clayton Coleman 2020-01-20 23:12:43 UTC
Hold on.  This is NOT a setup issue.  The one you duped was the cluster failing to accept traffic during an upgrade, not setup.  So by duping you lost the urgent status.  Resetting

The duped bug has a legitimate "we drop user workload traffic on the floor during an upgrade" which is an upgrade blocker.

Comment 7 Stephen Cuppett 2020-01-21 18:01:10 UTC
Setting to active development branch (4.4). Will create 4.3.z clone.

Comment 8 Daneyon Hansen 2020-01-22 23:14:09 UTC
The test is trying to hit a load balancer service and not a route so it’s not flowing through the haproxy ingress controller. I sifted through the workers journal and I see the following entries around the time of the failed test:

Dec 19 19:12:59 ci-op--9d5rz-w-b-c7trv.c.openshift-gce-devel-ci.internal bash[1829]: Processing route for NIC 0/42:01:0a:00:20:03 as ens4 for 34.74.108.57
Dec 19 19:12:59 ci-op--9d5rz-w-b-c7trv.c.openshift-gce-devel-ci.internal hyperkube[1972]: I1219 19:12:59.814733    1972 prober.go:129] Readiness probe for "router-default-b5f87d8cd-qpphm_openshift-ingress(008a2c75-31cd-405d-8084-a28688b0b38a):router" succeeded

19:12:59 is the same time the network manager service starts on one of the nodes.

Reassigning to the sdn team since this does not appear to be an ingress issue.

Comment 9 Dan Mace 2020-01-23 00:33:27 UTC
I don't have enough info yet to strongly implicate the SDN, and I think LoadBalancer Services are basically ingress issues, so I don't mind keeping it under Routing for now. But I certainly won't object to SDN folks taking a look if they're familiar with the moving parts here!

Comment 10 W. Trevor King 2020-01-23 05:59:57 UTC
Clayton points out that this shows up in non-update jobs too, e.g. [1]:

  2009/2256 Tests Failed
  ...
  Get https://api.ci-op-j43hd3fm-91171.origin-ci-int-gce.dev.openshift.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: dial tcp: lookup api.ci-op-j43hd3fm-91171.origin-ci-int-gce.dev.openshift.com on 10.142.0.37:53: no such host

with:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/971/build-log.txt | grep -c 'dial tcp: lookup api[.].* on.*:53: no such host'
  4031

Looks like that's hitting ~20% of GCP e2e jobs [2], just like the originally-reported symptoms hit ~20% of update jobs [3].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/971
[2]: https://search.svc.ci.openshift.org/chart?name=e2e-gcp$&search=lookup%20api[.].*.dev.openshift.com.*no%20such%20host
[3]: https://search.svc.ci.openshift.org/chart?name=e2e-gcp-upgrade$&search=Could%20not%20reach%20HTTP%20service%20through%20.*%20after%202m0s

Comment 11 Dan Mace 2020-01-23 13:57:36 UTC
(In reply to W. Trevor King from comment #10)
> Clayton points out that this shows up in non-update jobs too, e.g. [1]:
> 
>   2009/2256 Tests Failed
>   ...
>   Get
> https://api.ci-op-j43hd3fm-91171.origin-ci-int-gce.dev.openshift.com:6443/
> api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0:
> dial tcp: lookup
> api.ci-op-j43hd3fm-91171.origin-ci-int-gce.dev.openshift.com on
> 10.142.0.37:53: no such host
> 
> with:
> 
>   $ curl -s
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-
> installer-e2e-gcp-4.4/971/build-log.txt | grep -c 'dial tcp: lookup api[.].*
> on.*:53: no such host'
>   4031
> 
> Looks like that's hitting ~20% of GCP e2e jobs [2], just like the
> originally-reported symptoms hit ~20% of update jobs [3].
> 
> [1]:
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-gcp-4.4/971
> [2]:
> https://search.svc.ci.openshift.org/chart?name=e2e-gcp$&search=lookup%20api[.
> ].*.dev.openshift.com.*no%20such%20host
> [3]:
> https://search.svc.ci.openshift.org/chart?name=e2e-gcp-
> upgrade$&search=Could%20not%20reach%20HTTP%20service%20through%20.
> *%20after%202m0s

The report is about connectivity to a LoadBalancer Service endpoint over the ingress IP. Help me understand how the cited apiserver connectivity failure examples relate? So far I'm not seeing this occur outside the context of the upgrade tests.

Comment 12 Dan Mace 2020-01-23 14:47:03 UTC
Just a few quick notes so far, with the first example[1].

Pods behind the LB service to which the test loses connectivity:

  Dec 19 18:52:35.120: INFO: Waiting up to 2m0s for pod "service-test-mgggt" in namespace "e2e-k8s-service-upgrade-8115" to be "running and ready"
  Dec 19 18:52:35.120: INFO: Waiting up to 2m0s for pod "service-test-9slxq" in namespace "e2e-k8s-service-upgrade-8115" to be "running and ready"


Here's a brief timeline from the test:

  # Upgrade starts
  Dec 19 18:52:47.169: INFO: Starting upgrade to version= image=registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-19-131558
  Dec 19 18:53:02.199: INFO: cluster upgrade is Progressing: Working towards registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-12-19-131558: downloading update

  # First service connectivity failure
  Dec 19 19:01:34.133: INFO: Poke("http://34.74.169.164:80/echo?msg=hello"): Get http://34.74.169.164:80/echo?msg=hello: dial tcp 34.74.169.164:80: i/o timeout

  # Second service connectivity failure
  Dec 19 19:02:06.133: INFO: Poke("http://34.74.169.164:80/echo?msg=hello"): Get http://34.74.169.164:80/echo?msg=hello: dial tcp 34.74.169.164:80: i/o timeout

  # Last service connectivity failure
  Dec 19 19:05:52.143: INFO: Could not reach HTTP service through 34.74.169.164:80 after 2m0s
  E1219 19:05:52.143855     235 runtime.go:78] Observed a panic: ginkgowrapper.FailurePanic{Message:"Dec 19 19:05:52.143: Could not reach HTTP service through 34.74.169.164:80 after 2m0s",



As far as the worker journal, there are some potential clues that might help narrow the problem space. Here are successful Kubelet probe of one of the pods at the same time the test reports a failure:

  Dec 19 19:02:06 ci-op--jww5s-w-d-8ghv4.c.openshift-gce-devel-ci.internal hyperkube[1972]: I1219 19:02:06.102548    1972 prober.go:125] Readiness probe for "service-test-mgggt_e2e-k8s-service-upgrade-8115(b71fdf22-2290-11ea-b0e7-42010a000002):netexec" succeeded

There are other successful probe logs around the times of the test failures.

There aren't any reboots during the problem timeframe. Here's a reboot that happens much later:

  Dec 19 19:20:41 ci-op--jww5s-w-b-7gvsm.c.openshift-gce-devel-ci.internal root[46108]: machine-config-daemon[38572]: initiating reboot: Node will reboot into config rendered-worker-09bd6842a658db3b5673433304ae1463


So far this leads me to believe the pods are created at the right time and are responsive, but that traffic isn't making it from the the LB into the cluster, or from the ingress point to the pod. So maybe we can at least rule out the test code itself.

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/166

Comment 14 Dan Mace 2020-01-23 22:05:20 UTC
Just a quick update to say that we still don't have a root cause.

Clayton did find a second potentially correlated disruption through the *apiserver LB*:

  Dec 19 19:01:22.980 W clusteroperator/openshift-apiserver changed Available to False: AvailableOpenShiftAPICheckFailed: Available: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)

Which is suspiciously close to the first test LB failure at 19:01:34.

The apiserver is killed at 18:58:04.

Kube controller manager restarted at 18:59:29.

Current runs lack lots of cluster state just prior to the upgrade (like the KCM logs and prior event store), so it's tough to say if the KCM did some unexpected reconciliation of the test LB. It's also not clear yet whether the GCP routing hack[1][2] is a red herring here (given we're seeing disruption to workers). I think maybe I've stretched the limit of what I can figure out with post-mortem analysis, so my plan is to try moving on to reproducing with test PRs that dump more state for analysis.

[1] https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/481
[2] https://github.com/openshift/machine-config-operator/pull/1031

Comment 15 Dan Mace 2020-01-23 22:11:49 UTC
P.S., I still haven't totally ruled out the possibility of iptables rules inconsistencies. To do that it would be useful to have a cross-node iptables state dump from right before the upgrade and right after a connectivity loss is observed.

P.P.S, adding GCP load balancer state dumps at those critical times might also be valuable.

Comment 16 Dan Mace 2020-01-24 23:28:11 UTC
While debugging, I noticed we're seeing the same disruption occur even on some successful runs[1]. Apparently only sustained disruptions are fatal to the test. I now lack evidence of a strong correlation between any particular 4.3 pod startup and the disruptions. The AvailableOpenShiftAPICheckFailed correlation was also not present in this run.

I think iptables dumps from test pod nodes as soon as failure is observed would be useful. GCP target pool health status as well. I wasn't able to implement either today, but hopefully the problem space is somewhat narrowed by the latest info.


[1] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/24456/pull-ci-openshift-origin-release-4.3-e2e-gcp-upgrade/165/artifacts/e2e-gcp-upgrade/container-logs/test.log

Comment 17 W. Trevor King 2020-01-24 23:45:45 UTC
Job-detail page for^ is [1], in case folks want to look at other artifacts or anything.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24456/pull-ci-openshift-origin-release-4.3-e2e-gcp-upgrade/165

Comment 18 Dan Mace 2020-01-27 15:40:18 UTC
Significant detail found in 165[1].

Disruption logs:

Jan 24 20:57:21.095: INFO: Poke("http://35.237.108.235:80/echo?msg=hello"): Get http://35.237.108.235:80/echo?msg=hello: dial tcp 35.237.108.235:80: connect: no route to host
Jan 24 20:58:07.120: INFO: Poke("http://35.237.108.235:80/echo?msg=hello"): Get http://35.237.108.235:80/echo?msg=hello: dial tcp 35.237.108.235:80: i/o timeout
Jan 24 20:59:02.175: INFO: Poke("http://35.237.108.235:80/echo?msg=hello"): Get http://35.237.108.235:80/echo?msg=hello: dial tcp 35.237.108.235:80: i/o timeout
Jan 24 21:08:56.224: INFO: Poke("http://35.237.108.235:80/echo?msg=hello"): Get http://35.237.108.235:80/echo?msg=hello: dial tcp 35.237.108.235:80: i/o timeout
Jan 24 21:15:42.019: INFO: Poke("http://35.237.108.235:80/echo?msg=hello"): Get http://35.237.108.235:80/echo?msg=hello: dial tcp 35.237.108.235:80: connect: no route to host


KCM logs:

I0124 21:13:41.389635       1 gce_loadbalancer.go:119] EnsureLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): ensure EXTERNAL loadbalancer
I0124 21:13:42.265759       1 gce_loadbalancer_external.go:75] ensureExternalLoadBalancer(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test), us-east1, , [TCP/80], [ci-op-t4f9k-w-d-8mbl7.c.openshift-gce-devel-ci.internal ci-op-t4f9k-m-0.c.openshift-gce-devel-ci.internal ci-op-t4f9k-m-1.c.openshift-gce-devel-ci.internal ci-op-t4f9k-w-b-gm7q9.c.openshift-gce-devel-ci.internal], map[])
I0124 21:13:42.265793       1 gce_loadbalancer_external.go:84] ensureExternalLoadBalancer(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Desired network tier "Premium".
I0124 21:13:44.385075       1 gce_loadbalancer_external.go:158] ensureExternalLoadBalancer(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Ensured IP address 35.237.108.235 (tier: Premium).
I0124 21:13:44.788784       1 gce_loadbalancer_external.go:229] ensureExternalLoadBalancer(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Service needs nodes health checks.
I0124 21:14:19.307125       1 gce_loadbalancer_external.go:514] ensureTargetPoolAndHealthCheck(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Updated target pool (with 4 hosts).
I0124 21:14:19.464967       1 gce_loadbalancer_external.go:694] Checking http health check params k8s-d6825a0e603f25b6-node
I0124 21:14:22.356575       1 gce_loadbalancer_external.go:134] ensureExternalLoadBalancer(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Released static IP 35.237.108.235.
I0124 21:14:22.356607       1 gce_loadbalancer.go:159] EnsureLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): done ensuring loadbalancer.
I0124 21:15:19.417780       1 gce_loadbalancer.go:172] UpdateLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): updating with 4 nodes
I0124 21:15:53.286292       1 gce_loadbalancer.go:180] UpdateLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): done updating. err: <nil>
I0124 21:17:48.384968       1 gce_loadbalancer.go:172] UpdateLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): updating with 5 nodes
I0124 21:18:08.543536       1 gce_loadbalancer.go:180] UpdateLoadBalancer(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): done updating. err: <nil>
I0124 21:18:48.861795       1 gce_loadbalancer.go:193] EnsureLoadBalancerDeleted(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): deleting loadbalancer
I0124 21:18:48.861838       1 gce_loadbalancer_external.go:340] ensureExternalLoadBalancerDeleted(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting forwarding rule.
I0124 21:18:48.862167       1 gce_loadbalancer_external.go:322] ensureExternalLoadBalancerDeleted(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting firewall rule.
I0124 21:18:48.862181       1 gce_loadbalancer_external.go:336] ensureExternalLoadBalancerDeleted(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting IP address.
I0124 21:19:08.836267       1 gce_loadbalancer_external.go:346] ensureExternalLoadBalancerDeleted(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting target pool.
I0124 21:19:12.809910       1 gce_loadbalancer_external.go:382] DeleteExternalTargetPoolAndChecks(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting health check adaad6bd00035412893a211500fd4294.
I0124 21:19:13.065985       1 gce_loadbalancer_external.go:399] DeleteExternalTargetPoolAndChecks(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Health check adaad6bd00035412893a211500fd4294 is already deleted.
I0124 21:19:13.066014       1 gce_loadbalancer_external.go:404] DeleteExternalTargetPoolAndChecks(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting health check firewall k8s-adaad6bd00035412893a211500fd4294-http-hc.
I0124 21:19:13.182689       1 gce_loadbalancer_external.go:382] DeleteExternalTargetPoolAndChecks(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting health check k8s-d6825a0e603f25b6-node.
I0124 21:19:17.089005       1 gce_loadbalancer_external.go:404] DeleteExternalTargetPoolAndChecks(adaad6bd00035412893a211500fd4294(e2e-k8s-service-upgrade-4150/service-test)): Deleting health check firewall k8s-d6825a0e603f25b6-node-http-hc.
I0124 21:19:24.510827       1 gce_loadbalancer.go:201] EnsureLoadBalancerDeleted(ci-op-t4f9k, e2e-k8s-service-upgrade-4150, service-test, adaad6bd00035412893a211500fd4294, us-east1): done deleting loadbalancer. err: <nil>


Notice that about 2 minutes before the last pre-reboot disruption, coinciding with the 2 minute poke timeout, the KCM is messing with the GCP LB/health check. I haven't yet dug into the details of the KCM activity, but it certainly looks suspicious in this context. It's possible the 4.2 KCM (whose lose are lost post-upgrade of the pod) is doing similar things during reconciliation prior to the other disruptions. I'm going to try and preserve 4.2 KCM logs in a 4.2 PR to see.


[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24456/pull-ci-openshift-origin-release-4.3-e2e-gcp-upgrade/165

Comment 21 W. Trevor King 2020-02-05 18:14:52 UTC
Reading through the history on this bug, I'm not clear if this is a temporary outage from which clusters eventually recover on their own, or if folks need to poke something to unstick the failing cluster, or how healthy the cluster is while stuck.  We have populated must-gather (e.g. [1]) in at least some of these cases, so the cluster seems to be at least mostly alive even when hitting whatever the root cause issue is here.  Do we have a recommended recovery procedure or anything yet?

[1]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/167/artifacts/e2e-gcp-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-aad43a50042f7707abb70968c9382271304aa01811953bf1ee5431bef0ac0ae9/

Comment 22 W. Trevor King 2020-02-06 00:19:49 UTC
In 4.2.18 -> 4.3.1 GCP CI [1]:

  fail [k8s.io/kubernetes/test/e2e/framework/service/jig.go:915]: Feb  5 22:40:46.161: Could not reach HTTP service through 35.243.241.128:80 after 2m0s

so this has not been magically fixed with the new releases.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/286

Comment 24 Clayton Coleman 2020-02-19 17:32:23 UTC
Updating the description to capture the current state.  During upgrades there is a brief disruption to workloads running on the pod network while OVS is being updated.  We are assessing solutions that would prevent this disruption.  While 4.2 to 4.3 GCP is worse, the problem can occur on all releases of OpenShift.

Comment 28 Ben Bennett 2020-03-10 13:21:50 UTC
*** Bug 1791117 has been marked as a duplicate of this bug. ***

Comment 31 Ben Bennett 2020-03-12 16:35:51 UTC
With the merges of:
- https://github.com/openshift/cluster-network-operator/pull/484
- https://github.com/openshift/multus-cni/pull/53

I think we have addressed much of the problem.

Comment 35 zhaozhanqi 2020-03-13 03:55:40 UTC
did not reproduce this issue in recent upgrade. Verified this bug

Comment 38 W. Trevor King 2020-03-31 04:47:50 UTC
Formally linking the two pull requests from comment 31.

Comment 40 errata-xmlrpc 2020-05-04 11:20:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 41 W. Trevor King 2020-05-17 04:30:23 UTC
Bug is CLOSED ERRATA.  Nobody needs info on it anymore (or if they do, they should comment again with the information they still need).

Comment 42 W. Trevor King 2021-04-05 17:36:41 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.