1818071 – upgrade failed on 'controller-operator" cannot get resource "configmaps" in API group "" in the namespace "kube-system"'

Bug 1818071 - upgrade failed on 'controller-operator" cannot get resource "configmaps" in API group "" in the namespace "kube-system"'

Summary: upgrade failed on 'controller-operator" cannot get resource "configmaps" in A...

Keywords:
Status:	CLOSED DUPLICATE of bug 1817588
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Abu Kashem
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:	buildcop
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-27 15:26 UTC by Hongkai Liu
Modified:	2020-04-28 16:01 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-28 16:01:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2020-03-27 15:26:18 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23312#1:build-log.txt%3A8500

Failing tests:
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]
Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20200327-112944.xml
error: 1 fail, 0 pass, 0 skip (43m24s)
2020/03/27 11:29:45 Container test in pod e2e-aws-upgrade failed, exit code 1, reason Error
2020/03/27 11:38:24 Copied 209.98MB of artifacts from e2e-aws-upgrade to /logs/artifacts/e2e-aws-upgrade
2020/03/27 11:38:24 Releasing lease for "aws-quota-slice"
2020/03/27 11:38:24 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/03/27 11:38:25 Ran for 1h21m45s
error: could not run steps: step e2e-aws-upgrade failed: template pod "e2e-aws-upgrade" failed: the pod ci-op-vj6dl884/e2e-aws-upgrade failed after 1h19m34s (failed containers: test): ContainerFailed one or more containers exited
Container test exited with code 1, reason Error
---
controller-operator" cannot get resource "configmaps" in API group "" in the namespace "kube-system"\n
Mar 27 11:28:54.151 W ns/openshift-machine-config-operator pod/etcd-quorum-guard-869484c64d-zz24w node/ip-10-0-155-99.us-east-2.compute.internal deleted
Mar 27 11:28:54.165 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-b485d75d6-v6d6t Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-b485d75d6-v6d6t to ip-10-0-155-99.us-east-2.compute.internal
Mar 27 11:28:56.249 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-b485d75d6-v6d6t Container image "registry.svc.ci.openshift.org/ocp/4.5-2020-03-27-101459@sha256:8e2d144bf788ba690befe8476d93fd102c0f6f7abba931a318ff881e4ec39e6f" already present on machine
Mar 27 11:28:56.564 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-b485d75d6-v6d6t Created container guard
Mar 27 11:28:56.605 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-b485d75d6-v6d6t Started container guard
Mar 27 11:28:59.156 I ns/openshift-etcd-operator deployment/etcd-operator unhealthy members: ip-10-0-155-99.us-east-2.compute.internal,ip-10-0-139-80.us-east-2.compute.internal,ip-10-0-128-131.us-east-2.compute.internal (10 times)
Mar 27 11:29:12.183 I ns/openshift-etcd-operator deployment/etcd-operator unhealthy members: ip-10-0-139-80.us-east-2.compute.internal,ip-10-0-128-131.us-east-2.compute.internal (11 times)
Mar 27 11:29:19.758 W clusterversion/version cluster reached 4.5.0-0.ci-2020-03-27-101459
Mar 27 11:29:19.758 W clusterversion/version changed Progressing to False: Cluster version is 4.5.0-0.ci-2020-03-27-101459
Mar 27 11:29:44.235 I test="[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]" failed
Failing tests:
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]
Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20200327-112944.xml
error: 1 fail, 0 pass, 0 skip (43m24s)

Comment 1 W. Trevor King 2020-03-31 05:02:09 UTC

Actual error for this job:

  error waiting for deployment "dp" status to match expectation: Get https://api.ci-op-vj6dl884-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/apps/v1/namespaces/e2e-k8s-sig-apps-deployment-upgrade-3212/deployments/dp: unexpected EOF

EOF suggests this is a networking thing, so might be something for the SDN team.

Comment 2 Abu Kashem 2020-04-07 18:07:32 UTC

There are two issues here:

Mar 27 11:29:43.558: INFO: API was unreachable during disruption for at least 15s of 43m21s (1%):

We should not see this on a 4.4 -> 4.5 upgrade on AWS. Since we have put in a fix for graceful shutdown. So both kube-apiserver and openshift-apiserver should be able to serve requests on flight and gracefully terminate.




fail [k8s.io/kubernetes/test/e2e/upgrades/apps/deployments.go:67]: Unexpected error:
    <*errors.errorString | 0xc000f1d610>: {
        s: "error waiting for deployment \"dp\" status to match expectation: Get https://api.ci-op-vj6dl884-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/apps/v1/namespaces/e2e-k8s-sig-apps-deployment-upgrade-3212/deployments/dp: unexpected EOF",
    }
    error waiting for deployment "dp" status to match expectation: Get https://api.ci-op-vj6dl884-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/apps/v1/namespaces/e2e-k8s-sig-apps-deployment-upgrade-3212/deployments/dp: unexpected EOF
occurred

This also relates to kube-apiserver not responding - unexpected EOF

Given this, we need to keep it in 4.4 and investigate further.

Comment 3 Abu Kashem 2020-04-16 00:36:06 UTC

- clusteroperator objects seem to be reporting ok.
- I have gone through kube-apiserver logs, didn't see anything relevant that could be an issue.
- Checked the sdn logs, nothing pops out given my limited knowledge.

From the test log, I can see the following
Mar 27 10:46:42.746 - 3s E kube-apiserver Kube API is not responding to GET requests
...
Mar 27 10:46:47.165 I kube-apiserver Kube API started responding to GET requests

Mar 27 10:54:05.746 E kube-apiserver Kube API is not responding to GET requests
...
Mar 27 10:54:06.036 I kube-apiserver Kube API started responding to GET requests

Mar 27 11:15:51.746 E kube-apiserver Kube API is not responding to GET requests
...
Mar 27 11:15:51.919 I kube-apiserver Kube API started responding to GET requests

And the "unexpected EOF" error the test encounters coincide
Mar 27 10:46:46.803: INFO: Get pod "pod-secrets-cd6fdddb-3bd7-487a-bb46-06dc85de2591" in namespace "e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-1475" failed, ignoring for 2s. Error: Get https://api.ci-op-vj6dl884-77109.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-k8s-sig-storage-sig-api-machinery-secret-upgrade-1475/pods/pod-secrets-cd6fdddb-3bd7-487a-bb46-06dc85de2591: unexpected EOF
Mar 27 10:46:46.803: INFO: Get pod "pod-configmap-9c585800-9bd7-4e75-98b1-f44c4bc41341" in namespace "e2e-k8s-sig-storage-sig-api-machinery-configmap-upgrade-4568" failed, ignoring for 2s. Error: Get https://api.ci-op-vj6dl884-77109.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-k8s-sig-storage-sig-api-machinery-configmap-upgrade-4568/pods/pod-configmap-9c585800-9bd7-4e75-98b1-f44c4bc41341: unexpected EOF

kube-apiserver was NOT responding to request from 10:46:42 to 10:46:47 and the above "unexpected EOF" occurred at 10:46:46.

But I expected the test to keep trying and pass eventually, test wait poll time is 2s and it times out after 5m.
https://github.com/openshift/kubernetes/blob/d6035f3e0d79dd05628ef42231beae97806a06ad/test/e2e/framework/deployment/wait.go#L34

I also see the following in the test log:
"Your test failed.
Ginkgo panics to prevent subsequent assertions from running.
Normally Ginkgo rescues this panic so you shouldn't see it.

But, if you make an assertion in a goroutine, Ginkgo can't capture the panic.
To circumvent this, you should call

defer GinkgoRecover()"

Does this mean we have a test running in a go-routine that does not have "defer GinkgoRecover()"?

The test in question is here https://github.com/openshift/kubernetes/blob/master/test/e2e/upgrades/apps/deployments.go#L67.
It's supposed to poll every 2s but I don't see enough poll attempts

Mar 27 10:46:22.639: INFO: deployment status: v1.DeploymentStatus{...}
Mar 27 10:46:24.667: INFO: deployment status: v1.DeploymentStatus{...}

and then the Ginkgo panic follows. Could it be that the panic (from a different test) caused this test to abort and fail?

I also did a search in CI, apparently there is only 3 incidents like this in the last 14 days.
https://search.svc.ci.openshift.org/?search=error+waiting+for+deployment.*status+to+match+expectation.*unexpected+EOF&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520

My findings don't point to a root cause yet.

Comment 4 Ben Parees 2020-04-28 16:01:26 UTC


*** This bug has been marked as a duplicate of bug 1817588 ***

Note You need to log in before you can comment on or make changes to this bug.