https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23312#1:build-log.txt%3A8500 Failing tests: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift] Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20200327-112944.xml error: 1 fail, 0 pass, 0 skip (43m24s) 2020/03/27 11:29:45 Container test in pod e2e-aws-upgrade failed, exit code 1, reason Error 2020/03/27 11:38:24 Copied 209.98MB of artifacts from e2e-aws-upgrade to /logs/artifacts/e2e-aws-upgrade 2020/03/27 11:38:24 Releasing lease for "aws-quota-slice" 2020/03/27 11:38:24 No custom metadata found and prow metadata already exists. Not updating the metadata. 2020/03/27 11:38:25 Ran for 1h21m45s error: could not run steps: step e2e-aws-upgrade failed: template pod "e2e-aws-upgrade" failed: the pod ci-op-vj6dl884/e2e-aws-upgrade failed after 1h19m34s (failed containers: test): ContainerFailed one or more containers exited Container test exited with code 1, reason Error --- controller-operator" cannot get resource "configmaps" in API group "" in the namespace "kube-system"\n Mar 27 11:28:54.151 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-869484c64d-zz24w node/ip-10-0-155-99.us-east-2.compute.internal deleted Mar 27 11:28:54.165 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-b485d75d6-v6d6t Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-b485d75d6-v6d6t to ip-10-0-155-99.us-east-2.compute.internal Mar 27 11:28:56.249 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-b485d75d6-v6d6t Container image "registry.svc.ci.openshift.org/ocp/4.5-2020-03-27-101459@sha256:8e2d144bf788ba690befe8476d93fd102c0f6f7abba931a318ff881e4ec39e6f" already present on machine Mar 27 11:28:56.564 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-b485d75d6-v6d6t Created container guard Mar 27 11:28:56.605 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-b485d75d6-v6d6t Started container guard Mar 27 11:28:59.156 I ns/openshift-etcd-operator deployment/etcd-operator unhealthy members: ip-10-0-155-99.us-east-2.compute.internal,ip-10-0-139-80.us-east-2.compute.internal,ip-10-0-128-131.us-east-2.compute.internal (10 times) Mar 27 11:29:12.183 I ns/openshift-etcd-operator deployment/etcd-operator unhealthy members: ip-10-0-139-80.us-east-2.compute.internal,ip-10-0-128-131.us-east-2.compute.internal (11 times) Mar 27 11:29:19.758 W clusterversion/version cluster reached 4.5.0-0.ci-2020-03-27-101459 Mar 27 11:29:19.758 W clusterversion/version changed Progressing to False: Cluster version is 4.5.0-0.ci-2020-03-27-101459 Mar 27 11:29:44.235 I test="[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]" failed Failing tests: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift] Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20200327-112944.xml error: 1 fail, 0 pass, 0 skip (43m24s)
Actual error for this job: error waiting for deployment "dp" status to match expectation: Get https://api.ci-op-vj6dl884-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/apps/v1/namespaces/e2e-k8s-sig-apps-deployment-upgrade-3212/deployments/dp: unexpected EOF EOF suggests this is a networking thing, so might be something for the SDN team.
There are two issues here: Mar 27 11:29:43.558: INFO: API was unreachable during disruption for at least 15s of 43m21s (1%): We should not see this on a 4.4 -> 4.5 upgrade on AWS. Since we have put in a fix for graceful shutdown. So both kube-apiserver and openshift-apiserver should be able to serve requests on flight and gracefully terminate. fail [k8s.io/kubernetes/test/e2e/upgrades/apps/deployments.go:67]: Unexpected error: <*errors.errorString | 0xc000f1d610>: { s: "error waiting for deployment \"dp\" status to match expectation: Get https://api.ci-op-vj6dl884-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/apps/v1/namespaces/e2e-k8s-sig-apps-deployment-upgrade-3212/deployments/dp: unexpected EOF", } error waiting for deployment "dp" status to match expectation: Get https://api.ci-op-vj6dl884-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/apps/v1/namespaces/e2e-k8s-sig-apps-deployment-upgrade-3212/deployments/dp: unexpected EOF occurred This also relates to kube-apiserver not responding - unexpected EOF Given this, we need to keep it in 4.4 and investigate further.
- clusteroperator objects seem to be reporting ok. - I have gone through kube-apiserver logs, didn't see anything relevant that could be an issue. - Checked the sdn logs, nothing pops out given my limited knowledge. From the test log, I can see the following Mar 27 10:46:42.746 - 3s E kube-apiserver Kube API is not responding to GET requests ... Mar 27 10:46:47.165 I kube-apiserver Kube API started responding to GET requests Mar 27 10:54:05.746 E kube-apiserver Kube API is not responding to GET requests ... Mar 27 10:54:06.036 I kube-apiserver Kube API started responding to GET requests Mar 27 11:15:51.746 E kube-apiserver Kube API is not responding to GET requests ... Mar 27 11:15:51.919 I kube-apiserver Kube API started responding to GET requests And the "unexpected EOF" error the test encounters coincide Mar 27 10:46:46.803: INFO: Get pod "pod-secrets-cd6fdddb-3bd7-487a-bb46-06dc85de2591" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-1475" failed, ignoring for 2s. Error: Get https://api.ci-op-vj6dl884-77109.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-1475/pods/pod-secrets-cd6fdddb-3bd7-487a-bb46-06dc85de2591: unexpected EOF Mar 27 10:46:46.803: INFO: Get pod "pod-configmap-9c585800-9bd7-4e75-98b1-f44c4bc41341" in namespace "e2e-k8s-sig-storage-sig-api-machinery-configmap-upgrade-4568" failed, ignoring for 2s. Error: Get https://api.ci-op-vj6dl884-77109.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-k8s-sig-storage-sig-api-machinery-configmap-upgrade-4568/pods/pod-configmap-9c585800-9bd7-4e75-98b1-f44c4bc41341: unexpected EOF kube-apiserver was NOT responding to request from 10:46:42 to 10:46:47 and the above "unexpected EOF" occurred at 10:46:46. But I expected the test to keep trying and pass eventually, test wait poll time is 2s and it times out after 5m. https://github.com/openshift/kubernetes/blob/d6035f3e0d79dd05628ef42231beae97806a06ad/test/e2e/framework/deployment/wait.go#L34 I also see the following in the test log: "Your test failed. Ginkgo panics to prevent subsequent assertions from running. Normally Ginkgo rescues this panic so you shouldn't see it. But, if you make an assertion in a goroutine, Ginkgo can't capture the panic. To circumvent this, you should call defer GinkgoRecover()" Does this mean we have a test running in a go-routine that does not have "defer GinkgoRecover()"? The test in question is here https://github.com/openshift/kubernetes/blob/master/test/e2e/upgrades/apps/deployments.go#L67. It's supposed to poll every 2s but I don't see enough poll attempts Mar 27 10:46:22.639: INFO: deployment status: v1.DeploymentStatus{...} Mar 27 10:46:24.667: INFO: deployment status: v1.DeploymentStatus{...} and then the Ginkgo panic follows. Could it be that the panic (from a different test) caused this test to abort and fail? I also did a search in CI, apparently there is only 3 incidents like this in the last 14 days. https://search.svc.ci.openshift.org/?search=error+waiting+for+deployment.*status+to+match+expectation.*unexpected+EOF&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520 My findings don't point to a root cause yet.
*** This bug has been marked as a duplicate of bug 1817588 ***