I am looking at jobs with this failure from this search: https://search.ci.openshift.org/?search=%5C%5Bsig-storage%5C%5D+PersistentVolumes+GCEPD+should+test+that+deleting+the+PV+before+the+pod+does+not+cause+pod+deletion+to+fail+on+PD+detach&maxAge=168h&context=1&type=bug%2Bjunit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job Looking at the first failed job here: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328696868191866880 The first thing I see, is there are a large number of tests failing in this run, including the one in this bz title: 592/2494 tests failed. Looking at the first 5 failures, I see a consistent pattern: Test 1 ------- [sig-storage] Zone Support [Feature:vsphere] Verify a pod is created and attached to a dynamically created PV, based on the allowed zones and datastore specified in storage class when there are multiple datastores with the same name under different zones across datacenters [Suite:openshift/conformance/parallel] [Suite:k8s] Failure -------- STEP: Creating a kubernetes client STEP: Building a namespace api object, basename zone-support [AfterEach] [sig-storage] Zone Support [Feature:vsphere] k8s.io/kubernetes.2/test/e2e/framework/framework.go:175 Nov 17 14:27:44.556: INFO: Running AfterSuite actions on node 1 fail [k8s.io/kubernetes.2/test/e2e/framework/framework.go:236]: Unexpected error: <*errors.errorString | 0xc0002d08a0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred Test 2 -------- [sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully [Suite:openshift/conformance/parallel] Failure -------- STEP: Creating a kubernetes client [BeforeEach] [sig-api-machinery][Feature:APIServer][Late] github.com/openshift/origin/test/extended/util/client.go:115 Nov 17 15:24:57.835: INFO: configPath is now "/tmp/configfile988887360" Nov 17 15:24:57.835: INFO: The user is now "e2e-test-terminating-kube-apiserver-p84wk-user" Nov 17 15:24:57.835: INFO: Creating project "e2e-test-terminating-kube-apiserver-p84wk" Nov 17 15:24:58.036: INFO: Waiting on permissions in project "e2e-test-terminating-kube-apiserver-p84wk" ... Nov 17 15:24:58.071: INFO: Waiting for ServiceAccount "default" to be provisioned... Nov 17 15:24:58.213: INFO: Waiting for ServiceAccount "deployer" to be provisioned... Nov 17 15:24:58.348: INFO: Waiting for ServiceAccount "builder" to be provisioned... Nov 17 15:24:58.488: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned... Nov 17 15:24:58.555: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned... Nov 17 15:24:58.698: INFO: Waiting for RoleBinding "system:deployers" to be provisioned... Nov 17 15:25:13.879: INFO: Project "e2e-test-terminating-kube-apiserver-p84wk" has been fully provisioned. [It] kubelet terminates kube-apiserver gracefully [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/apiserver/graceful_termination.go:20 [AfterEach] [sig-api-machinery][Feature:APIServer][Late] Test 3 ------- : [sig-storage] Projected configMap should be consumable from pods in volume with mappings and Item mode set [LinuxOnly] [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] Failure -------- [BeforeEach] [sig-storage] Projected configMap k8s.io/kubernetes.2/test/e2e/framework/framework.go:174 STEP: Creating a kubernetes client STEP: Building a namespace api object, basename projected [AfterEach] [sig-storage] Projected configMap k8s.io/kubernetes.2/test/e2e/framework/framework.go:175 STEP: Collecting events from namespace "e2e-projected-9113". STEP: Found 1 events. Nov 17 15:10:58.239: INFO: At 2020-11-17 15:08:58 +0000 UTC - event for e2e-projected-9113: {namespace-security-allocation-controller } CreatedSCCRanges: created SCC ranges Nov 17 15:10:58.272: INFO: POD NODE PHASE GRACE CONDITIONS Nov 17 15:10:58.272: INFO: fail [k8s.io/kubernetes.2/test/e2e/framework/framework.go:236]: Unexpected error: <*errors.errorString | 0xc0002d08a0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred Test 4 ------ : [sig-api-machinery] Garbage collector should orphan RS created by deployment when deleteOptions.PropagationPolicy is Orphan [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] Failure ------- k8s.io/kubernetes.2/test/e2e/framework/framework.go:174 STEP: Creating a kubernetes client STEP: Building a namespace api object, basename gc [AfterEach] [sig-api-machinery] Garbage collector k8s.io/kubernetes.2/test/e2e/framework/framework.go:175 Nov 17 15:10:57.718: INFO: Running AfterSuite actions on node 1 fail [k8s.io/kubernetes.2/test/e2e/framework/framework.go:236]: Unexpected error: <*errors.errorString | 0xc000294890>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred Test 5 ------- : [sig-node] Downward API should provide host IP as an env var [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] Failure -------- STEP: Creating a kubernetes client STEP: Building a namespace api object, basename downward-api [AfterEach] [sig-node] Downward API k8s.io/kubernetes.2/test/e2e/framework/framework.go:175 STEP: Collecting events from namespace "e2e-downward-api-9093". Nov 17 15:10:57.477: INFO: Running AfterSuite actions on node 1 fail [k8s.io/kubernetes.2/test/e2e/framework/framework.go:236]: Unexpected error: <*errors.errorString | 0xc0002d08a0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred Going down the list of test failures in this run, the first 15 failures are all the same(test failure 2 is slightly different), build a namespace api object, and the test has a timeout waiting for the condition. Looking at the apiserver for this timeframes, I see lots of the following logs: E1117 14:27:44.837303 17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] E1117 14:27:44.870483 17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] E1117 14:27:44.897478 17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] E1117 14:27:44.913241 17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] E1117 14:27:44.927279 17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] E1117 14:27:44.954517 17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] E1117 15:10:38.856407 17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] E1117 15:10:38.989902 17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] E1117 15:10:39.184926 17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] E1117 15:10:39.754515 17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed] I'm not exactly sure what these logs mean, it would appear that auth failed, meaning the request/pod would encounter errors. will reach out to the apiserver team on slack to get some insight. Jan, I don't see any occurrences of any of the nodes not being ready due to network-unavailable. It appears that all nodes are available in this run.
I also looked at this job, which has failed 573/2494 tests. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328292916908527616 Very similiar to above, timeout waiting for the condition. And this job, which has failed 432/2499 tests: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328141189987176448https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328141189987176448 All of these runs indicate a problem with the cluster, and are not specific to the test failing in this bz. I used this query to look for a test run that had fewer failures (the first 5-6 test runs from this query on aws all have 400-500 failures), that would indicate a problem with just this test. I was not able to find one. I did have a chat with the api-server team on slack, and the recommendation was to "look for a non-trivial number of jobs in which the failure in question is one of relatively few". After some thought (and hours spent debugging the large number of failures (400-500) on the above mentioned runs), I agree with this approach. I looked, and was not able to find a job where the cluster was stable (relatively low number of failures) that included this test failure. I'm going to close this bz. If you see this test failing again, and the cluster is stable, please re-open.
This is the query I used, forgot it in the above comment: https://search.ci.openshift.org/?search=%5C%5Bsig-storage%5C%5D+PersistentVolumes+GCEPD+should+test+that+deleting+the+PV+before+the+pod+does+not+cause+pod+deletion+to+fail+on+PD+detach&maxAge=336h&context=1&type=bug%2Bjunit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job