Bug 1811341

Summary:	Subpath test pod did not start within 5 minutes
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Storage	Assignee:	OpenShift Storage Bugzilla Bot <ocp-storage-bot>
Storage sub component:	Storage	QA Contact:	Wei Duan <wduan>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	aos-bugs, jokerman, jsafrane, piqin, vlaad
Version:	4.4
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-25 20:41:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2020-03-07 21:07:34 UTC

Subpath test pod did not start within 5 minutes, a number of errors reported in pod events, none of which indicate why the pod never came out of pending:

4.4 release job failed https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/1936

We need to understand why this happened (could be a crio issue, a volume issue, or an apiserver issue).  I consider this a release blocker unless we can triage this down to why we failed - generally lower level bugs surface as "couldn't do this thing within a normal amount of time".  The test itself is not flaky that I can see.


Mar  7 14:10:06.569: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for gluster-server: {default-scheduler } Scheduled: Successfully assigned e2e-provisioning-1408/gluster-server to ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal
Mar  7 14:10:06.569: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-subpath-test-gluster-preprovisionedpv-6g7j: {default-scheduler } Scheduled: Successfully assigned e2e-provisioning-1408/pod-subpath-test-gluster-preprovisionedpv-6g7j to ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal
Mar  7 14:10:06.569: INFO: At 2020-03-07 14:03:40 +0000 UTC - event for gluster-server: {kubelet ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal} Pulling: Pulling image "gcr.io/kubernetes-e2e-test-images/volume/gluster:1.0"
Mar  7 14:10:06.569: INFO: At 2020-03-07 14:04:25 +0000 UTC - event for gluster-server: {kubelet ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal} Pulled: Successfully pulled image "gcr.io/kubernetes-e2e-test-images/volume/gluster:1.0"
Mar  7 14:10:06.569: INFO: At 2020-03-07 14:04:25 +0000 UTC - event for gluster-server: {kubelet ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal} Created: Created container gluster-server
Mar  7 14:10:06.569: INFO: At 2020-03-07 14:04:26 +0000 UTC - event for gluster-server: {kubelet ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal} Started: Started container gluster-server
Mar  7 14:10:06.569: INFO: At 2020-03-07 14:04:27 +0000 UTC - event for pvc-pmzmz: {persistentvolume-controller } ProvisioningFailed: storageclass.storage.k8s.io "e2e-provisioning-1408" not found
Mar  7 14:10:06.569: INFO: At 2020-03-07 14:04:48 +0000 UTC - event for pod-subpath-test-gluster-preprovisionedpv-6g7j: {kubelet ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal} FailedMount: Unable to attach or mount volumes: unmounted volumes=[test-volume liveness-probe-volume default-token-6lsm6], unattached volumes=[test-volume liveness-probe-volume default-token-6lsm6]: error processing PVC e2e-provisioning-1408/pvc-pmzmz: failed to fetch PVC from API server: persistentvolumeclaims "pvc-pmzmz" is forbidden: User "system:node:ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal" cannot get resource "persistentvolumeclaims" in API group "" in the namespace "e2e-provisioning-1408": no relationship found between node "ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal" and this object
Mar  7 14:10:06.569: INFO: At 2020-03-07 14:04:49 +0000 UTC - event for pod-subpath-test-gluster-preprovisionedpv-6g7j: {kubelet ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal} FailedMount: MountVolume.NewMounter initialization failed for volume "gluster-vr8wh" : endpoints "gluster-server" not found
Mar  7 14:10:06.569: INFO: At 2020-03-07 14:07:02 +0000 UTC - event for pod-subpath-test-gluster-preprovisionedpv-6g7j: {kubelet ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal} FailedMount: Unable to attach or mount volumes: unmounted volumes=[test-volume], unattached volumes=[liveness-probe-volume default-token-6lsm6 test-volume]: timed out waiting for the condition
Mar  7 14:10:06.569: INFO: At 2020-03-07 14:09:19 +0000 UTC - event for pod-subpath-test-gluster-preprovisionedpv-6g7j: {kubelet ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal} FailedMount: Unable to attach or mount volumes: unmounted volumes=[test-volume], unattached volumes=[test-volume liveness-probe-volume default-token-6lsm6]: timed out waiting for the condition
Mar  7 14:10:06.569: INFO: At 2020-03-07 14:09:56 +0000 UTC - event for gluster-server: {kubelet ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal} Killing: Stopping container gluster-server
Mar  7 14:10:06.574: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Mar  7 14:10:06.574: INFO: 
Mar  7 14:10:06.598: INFO: skipping dumping cluster info - cluster too large
Mar  7 14:10:06.598: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-provisioning-1408" for this suite.
Mar  7 14:10:06.637: INFO: Running AfterSuite actions on all nodes
Mar  7 14:10:06.637: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes/test/e2e/framework/util.go:894]: Unexpected error:
    <*errors.errorString | 0xc0050e6860>: {
        s: "expected pod \"pod-subpath-test-gluster-preprovisionedpv-6g7j\" success: Gave up after waiting 5m0s for pod \"pod-subpath-test-gluster-preprovisionedpv-6g7j\" to be \"success or failure\"",
    }
    expected pod "pod-subpath-test-gluster-preprovisionedpv-6g7j" success: Gave up after waiting 5m0s for pod "pod-subpath-test-gluster-preprovisionedpv-6g7j" to be "success or failure"
occurred

Comment 1 Ryan Phillips 2020-03-11 01:26:51 UTC

The logs were not attached to the CI build for some reason or the artifact server is down... I'll check again tomorrow.

Comment 2 Ryan Phillips 2020-03-11 16:49:03 UTC

Looks like a test issue:

Mar 07 14:04:48.879871 ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal hyperkube[1341]: I0307 14:04:48.876956    1341 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-provisioning-1408", Name:"pod-subpath-test-gluster-preprovisionedpv-6g7j", UID:"e470c7a6-e860-4c6a-949b-5f028fc7c620", APIVersion:"v1", ResourceVersion:"34487", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to attach or mount volumes: unmounted volumes=[test-volume liveness-probe-volume default-token-6lsm6], unattached volumes=[test-volume liveness-probe-volume default-token-6lsm6]: error processing PVC e2e-provisioning-1408/pvc-pmzmz: failed to fetch PVC from API server: persistentvolumeclaims "pvc-pmzmz" is forbidden: User "system:node:ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal" cannot get resource "persistentvolumeclaims" in API group "" in the namespace "e2e-provisioning-1408": no relationship found between node "ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal" and this object


"pvc-pmzmz" is forbidden: User "system:node:ci-op-8fxjt-w-d-npxj8.c.openshift-gce-devel-ci.internal"

Comment 5 Hemant Kumar 2020-06-18 21:58:02 UTC

I think this still might be a problem. For azure environments there are other bugs that prevent the pod from coming up but for GCP or AWS, this will indicates a bug in system somewhere (or e2e). I will investigate this in more detail in upstream.

A more recent run on GCP - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1273584118654832640

Comment 7 Hemant Kumar 2020-07-29 16:00:40 UTC

I think the root cause of the bugs associated with gluster tests is, we are creating endpoints without service entries. This can cause endpoint to become unavailable. Opened a PR to fix this - https://github.com/kubernetes/kubernetes/pull/93535

Comment 9 Hemant Kumar 2020-08-06 16:21:58 UTC

Since upstream PR is merged - https://github.com/kubernetes/kubernetes/pull/93535 , I expect that when OCP-4.6 is rebased against released version of k8s, we will automatically get the fix.