Bug 1750968

Summary: [4.2] e2e-gcp-serial: [sig-storage] ... [Serial] tests are failing
Product: OpenShift Container Platform Reporter: Brenton Leanhardt <bleanhar>
Component: StorageAssignee: Hemant Kumar <hekumar>
Status: CLOSED ERRATA QA Contact: Qin Ping <piqin>
Severity: high Docs Contact:
Priority: high    
Version: 4.2.0CC: aos-bugs, aos-storage-staff, bbennett, bchilds, chuffman, deads, eparis, fbertina, hekumar, lxia, mfojtik, miabbott, piqin
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1749882 Environment:
Last Closed: 2019-10-16 06:40:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1749882    
Bug Blocks:    

Description Brenton Leanhardt 2019-09-10 21:03:55 UTC
+++ This bug was initially created as a clone of Bug #1749882 +++

Example:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/101

We see a variety of tests failing, from CSI, Local Volume, and Volume Metrics, such as:

[sig-storage] [Serial] Volume metrics should create metrics for total time taken in volume operations in P/V Controller [Suite:openshift/conformance/serial] [Suite:k8s]
[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: gce-localssd-scsi-fs] [Serial] [Testpattern: Pre-provisioned PV (ext3)] volumes should allow exec of files on the volume [Suite:openshift/conformance/serial] [Suite:k8s] 
[sig-storage] CSI Volumes [Driver: pd.csi.storage.gke.io][Serial] [Testpattern: Dynamic PV (default fs)] provisioning should create and delete block persistent volumes [Suite:openshift/conformance/serial] [Suite:k8s] 

In addition, this seems to be similar to the CRI-O hostNetwork issue, as the following is found in the build.log:

csi-gce-pd-node-rv4dv/gce-pd-driver: W0906 11:10:28.546739       1 gce.go:126] GOOGLE_APPLICATION_CREDENTIALS env var not set
csi-gce-pd-node-rv4dv/gce-pd-driver: I0906 11:10:28.546776       1 gce.go:128] Using DefaultTokenSource &oauth2.reuseTokenSource{new:google.computeSource{account:""}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
csi-gce-pd-node-rv4dv/gce-pd-driver: E0906 11:10:29.570703       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
csi-gce-pd-node-rv4dv/gce-pd-driver: E0906 11:10:35.586678       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
csi-gce-pd-node-rv4dv/gce-pd-driver: E0906 11:10:40.578686       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
csi-gce-pd-node-rv4dv/gce-pd-driver: E0906 11:10:45.634635       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
csi-gce-pd-node-rv4dv/gce-pd-driver: E0906 11:10:50.626754       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
csi-gce-pd-node-rv4dv/gce-pd-driver: E0906 11:10:55.618659       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
csi-gce-pd-node-rv4dv/gce-pd-driver: E0906 11:11:00.610726       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
csi-gce-pd-node-rv4dv/gce-pd-driver: E0906 11:11:01.634650       1 gce.go:135] error fetching initial token: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: connect: connection refused
csi-gce-pd-node-rv4dv/gce-pd-driver: F0906 11:11:01.634690       1 main.go:62] Failed to get cloud provider: timed out waiting for the condition

Note: The CSI tests should have been disabled previously under https://github.com/openshift/origin/pull/23720 .

Comment 1 Eric Paris 2019-09-10 21:06:03 UTC
We know pr 23720 was inadequate and are addressing more CSI tests in https://github.com/openshift/origin/pull/23760

Comment 2 Brenton Leanhardt 2019-09-10 21:06:51 UTC
There are a number of CSI tests still failing even after the majority were disabled.  The fastest way to getting this working may be to turn on hostNetwork for the entire serial gcp suite.  If you look here you can see the definition of the serial job and which template it uses currently:

https://github.com/openshift/release/blob/69ffbdb41b4efdb435c97b1512fb671fe74e2246/ci-operator/jobs/openshift/release/openshift-release-release-4.2-periodics.yaml#L1051

You wouldn't want to edit that template because it is used by all the test jobs.  I'm assuming you'd create a new template like this one but add in the hostNetwork configuration in the spec.  Here's an example:

https://github.com/openshift/release/blob/69ffbdb41b4efdb435c97b1512fb671fe74e2246/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L70

If that fails another option would be to create a separate test suite for running the serial tests that require hostNetwork.

Comment 3 Brenton Leanhardt 2019-09-10 21:10:55 UTC
Eric tells me CSI is not supported on GCP.  In that case please just use this bug to track the fix to disable the appropriate tests.

Comment 4 Eric Paris 2019-09-10 21:16:24 UTC
I believe the CSI failure above is #1750926

This should be used to look into any problems with local volumes and volume metrics on GCP.

CSI tests should be disabled.

Comment 14 David Eads 2019-09-13 13:53:13 UTC
I think the offending tests have now been skipped. Have a look at recent results.

Comment 15 Fabio Bertinatto 2019-09-17 12:24:13 UTC
gce-localssd-scsi-fs tests are being skipped. No recent failures:

https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-gcp-serial-4.2&show-stale-tests=

Comment 17 Qin Ping 2019-09-18 06:37:51 UTC
All the failed teat cases are skipped in the 4.2.0-0.nightly-2019-09-13-032440

Comment 18 errata-xmlrpc 2019-10-16 06:40:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Comment 19 Christian Huffman 2019-11-26 15:28:37 UTC
*** Bug 1749882 has been marked as a duplicate of this bug. ***