1886675 – [sig-storage] PersistentVolumes GCEPD should test that deleting the PV before the pod does not cause pod deletion to fail on PD detach - transport is closing

Bug 1886675 - [sig-storage] PersistentVolumes GCEPD should test that deleting the PV before the pod does not cause pod deletion to fail on PD detach - transport is closing

Summary: [sig-storage] PersistentVolumes GCEPD should test that deleting the PV before...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Victor Pickard
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:	OKDBlocker
Depends On:	1872470 1876174 1877128
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-09 06:09 UTC by Benjamin Gilbert
Modified:	2020-11-23 19:49 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1872470
Environment:	[sig-storage] PersistentVolumes GCEPD should test that deleting the PV before the pod does not cause pod deletion to fail on PD detach
Last Closed:	2020-11-23 19:48:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 3 Victor Pickard 2020-11-23 19:12:34 UTC

I am looking at jobs with this failure from this search:

https://search.ci.openshift.org/?search=%5C%5Bsig-storage%5C%5D+PersistentVolumes+GCEPD+should+test+that+deleting+the+PV+before+the+pod+does+not+cause+pod+deletion+to+fail+on+PD+detach&maxAge=168h&context=1&type=bug%2Bjunit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job

Looking at the first failed job here:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328696868191866880

The first thing I see, is there are a large number of tests failing in this run, including the one in this bz title:

592/2494 tests failed.

Looking at the first 5 failures, I see a consistent pattern:

Test 1
-------
[sig-storage] Zone Support [Feature:vsphere] Verify a pod is created and attached to a dynamically created PV, based on the allowed zones and datastore specified in storage class when there are multiple datastores with the same name under different zones across datacenters [Suite:openshift/conformance/parallel] [Suite:k8s] 

Failure
--------
STEP: Creating a kubernetes client
STEP: Building a namespace api object, basename zone-support
[AfterEach] [sig-storage] Zone Support [Feature:vsphere]
  k8s.io/kubernetes.2/test/e2e/framework/framework.go:175
Nov 17 14:27:44.556: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes.2/test/e2e/framework/framework.go:236]: Unexpected error:
    <*errors.errorString | 0xc0002d08a0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred



Test 2
--------
[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully [Suite:openshift/conformance/parallel] 

Failure
--------

STEP: Creating a kubernetes client
[BeforeEach] [sig-api-machinery][Feature:APIServer][Late]
  github.com/openshift/origin/test/extended/util/client.go:115
Nov 17 15:24:57.835: INFO: configPath is now "/tmp/configfile988887360"
Nov 17 15:24:57.835: INFO: The user is now "e2e-test-terminating-kube-apiserver-p84wk-user"
Nov 17 15:24:57.835: INFO: Creating project "e2e-test-terminating-kube-apiserver-p84wk"
Nov 17 15:24:58.036: INFO: Waiting on permissions in project "e2e-test-terminating-kube-apiserver-p84wk" ...
Nov 17 15:24:58.071: INFO: Waiting for ServiceAccount "default" to be provisioned...
Nov 17 15:24:58.213: INFO: Waiting for ServiceAccount "deployer" to be provisioned...
Nov 17 15:24:58.348: INFO: Waiting for ServiceAccount "builder" to be provisioned...
Nov 17 15:24:58.488: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned...
Nov 17 15:24:58.555: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned...
Nov 17 15:24:58.698: INFO: Waiting for RoleBinding "system:deployers" to be provisioned...
Nov 17 15:25:13.879: INFO: Project "e2e-test-terminating-kube-apiserver-p84wk" has been fully provisioned.
[It] kubelet terminates kube-apiserver gracefully [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/apiserver/graceful_termination.go:20
[AfterEach] [sig-api-machinery][Feature:APIServer][Late]



Test 3
-------
: [sig-storage] Projected configMap should be consumable from pods in volume with mappings and Item mode set [LinuxOnly] [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] 


Failure
--------
[BeforeEach] [sig-storage] Projected configMap
  k8s.io/kubernetes.2/test/e2e/framework/framework.go:174
STEP: Creating a kubernetes client
STEP: Building a namespace api object, basename projected
[AfterEach] [sig-storage] Projected configMap
  k8s.io/kubernetes.2/test/e2e/framework/framework.go:175
STEP: Collecting events from namespace "e2e-projected-9113".
STEP: Found 1 events.
Nov 17 15:10:58.239: INFO: At 2020-11-17 15:08:58 +0000 UTC - event for e2e-projected-9113: {namespace-security-allocation-controller } CreatedSCCRanges: created SCC ranges
Nov 17 15:10:58.272: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Nov 17 15:10:58.272: INFO: 

fail [k8s.io/kubernetes.2/test/e2e/framework/framework.go:236]: Unexpected error:
    <*errors.errorString | 0xc0002d08a0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred


Test 4
------
: [sig-api-machinery] Garbage collector should orphan RS created by deployment when deleteOptions.PropagationPolicy is Orphan [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] 

Failure
-------
  k8s.io/kubernetes.2/test/e2e/framework/framework.go:174
STEP: Creating a kubernetes client
STEP: Building a namespace api object, basename gc
[AfterEach] [sig-api-machinery] Garbage collector
  k8s.io/kubernetes.2/test/e2e/framework/framework.go:175

Nov 17 15:10:57.718: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes.2/test/e2e/framework/framework.go:236]: Unexpected error:
    <*errors.errorString | 0xc000294890>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

Test 5
-------
: [sig-node] Downward API should provide host IP as an env var [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]

Failure
--------
STEP: Creating a kubernetes client
STEP: Building a namespace api object, basename downward-api
[AfterEach] [sig-node] Downward API
  k8s.io/kubernetes.2/test/e2e/framework/framework.go:175
STEP: Collecting events from namespace "e2e-downward-api-9093".
Nov 17 15:10:57.477: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes.2/test/e2e/framework/framework.go:236]: Unexpected error:
    <*errors.errorString | 0xc0002d08a0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

Going down the list of test failures in this run, the first 15 failures are all the same(test failure 2 is slightly different), build a namespace api object, and the test has a timeout waiting for the condition. 

Looking at the apiserver for this timeframes, I see lots of the following logs:


E1117 14:27:44.837303      17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]
E1117 14:27:44.870483      17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]
E1117 14:27:44.897478      17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]
E1117 14:27:44.913241      17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]
E1117 14:27:44.927279      17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]
E1117 14:27:44.954517      17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]


E1117 15:10:38.856407      17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]
E1117 15:10:38.989902      17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]
E1117 15:10:39.184926      17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]
E1117 15:10:39.754515      17 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, token lookup failed]


I'm not exactly sure what these logs mean, it would appear that auth failed, meaning the request/pod would encounter errors. will reach out to the apiserver team on slack to get some insight. 




Jan,
I don't see any occurrences of any of the nodes not being ready due to network-unavailable. It appears that all nodes are available in this run.

Comment 4 Victor Pickard 2020-11-23 19:48:13 UTC

I also looked at this job, which has failed 573/2494 tests.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328292916908527616

Very similiar to above, timeout waiting for the condition.

And this job, which has failed 432/2499 tests:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328141189987176448https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1328141189987176448


All of these runs indicate a problem with the cluster, and are not specific to the test failing in this bz.



I used this query to look for a test run that had fewer failures (the first 5-6 test runs from this query on aws all have 400-500 failures), that would indicate a problem with just this test. I was not able to find one.

I did have a chat with the api-server team on slack, and the recommendation was to "look for a non-trivial number of jobs in which the failure in question is one of relatively few". After some thought (and hours spent debugging the large number of failures (400-500) on the above mentioned runs), I agree with this approach. I looked, and was not able to find a job where the cluster was stable (relatively low number of failures) that included this test failure. 

I'm going to close this bz. If you see this test failing again, and the cluster is stable, please re-open.

Comment 5 Victor Pickard 2020-11-23 19:49:01 UTC

This is the query I used, forgot it in the above comment:

https://search.ci.openshift.org/?search=%5C%5Bsig-storage%5C%5D+PersistentVolumes+GCEPD+should+test+that+deleting+the+PV+before+the+pod+does+not+cause+pod+deletion+to+fail+on+PD+detach&maxAge=336h&context=1&type=bug%2Bjunit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job

Note You need to log in before you can comment on or make changes to this bug.