Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1827863

Summary:	[Feature:Platform][Smoke] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel] timing out
Product:	OpenShift Container Platform	Reporter:	Miciah Dashiel Butler Masters <mmasters>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	Jianwei Hou <jhou>
Status:	CLOSED INSUFFICIENT_DATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	agarcial, aos-bugs, jokerman, kgarriso, mfojtik, obulatov, ssoto, xtian
Version:	4.2.z
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1874524 (view as bug list)		Environment:
Last Closed:	2020-07-01 10:07:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1874524

Description Miciah Dashiel Butler Masters 2020-04-25 00:38:54 UTC

Description of problem:

The "Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy" test is failing frequently with what appears to be a timeout.  I am seeing the following:

    started: (0/3/2144) "[Feature:Platform][Smoke] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]"

Eventually followed by the following:

    ---------------------------------------------------------
    Received interrupt.  Running AfterSuite...
    ^C again to terminate immediately
    Apr 24 12:23:53.994: INFO: Running AfterSuite actions on all nodes
    Apr 24 12:23:53.994: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
    Apr 24 12:23:54.089: INFO: Running AfterSuite actions on node 1
    
    failed: (15m0s) 2020-04-24T12:23:54 "[Feature:Platform][Smoke] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]"

It looks like the test is timing out after 15 minutes, and the test runner is sending a signal to terminate the test.

I am seeing especially many failures that follow this pattern on e2e-vsphere-upi.  For example:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.2/623
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.2/618
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.2/617

I did see some failures following this pattern for a few other platforms:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/8271/rehearse-8271-release-openshift-ocp-installer-e2e-openstack-ppc64le-4.3/57
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1817

Using search.svc, I found many more failures of the same test, but often those failures result from missing image or i/o timeout errors whereas the above failures appear to be the test timing out and being terminated by the test runner.

https://search.svc.ci.openshift.org/?search=Managed%20cluster%20should%20ensure%20pods%20use%20downstream%20images%20from%20our%20release%20image%20with%20proper%20ImagePullPolicy

Comment 3 Oleg Bulatov 2020-04-27 17:00:31 UTC

Apr 24 12:13:42.904: INFO: Running 'oc --namespace=openshift-console --config=/tmp/admin.kubeconfig exec downloads-6fcb9d8c68-dpjsw -c download-server -- cat /etc/redhat-release'
Apr 24 12:15:58.357: INFO: Image relase info:Red Hat Enterprise Linux Server release 7.8 (Maipo)

Apr 24 12:15:58.357: INFO: Running 'oc --namespace=openshift-controller-manager-operator --config=/tmp/admin.kubeconfig exec openshift-controller-manager-operator-9495b7dbf-8vnvr -c operator -- cat /etc/redhat-release'
Apr 24 12:15:59.049: INFO: Image relase info:Red Hat Enterprise Linux Server release 7.8 (Maipo)

...

Apr 24 12:21:15.331: INFO: Running 'oc --namespace=openshift-etcd --config=/tmp/admin.kubeconfig exec etcd-member-control-plane-2 -c etcd-member -- cat /etc/redhat-release'
Apr 24 12:23:30.999: INFO: Image relase info:Red Hat Enterprise Linux Server release 7.8 (Maipo)


Sometimes `oc exec` takes considerable time (2 minutes for etcd-member-control-plane-2) to run `cat`. I don't know what may cause such delays, so I'm moving this to CLI.

Comment 4 Maciej Szulik 2020-04-28 10:29:42 UTC

It looks like the problem is not with any component per se, but rather the overall timeout on the test. 
I'm moving this to cloud team, since it looks like the lags are caused by the vsphere installation.

Comment 5 Ben Parees 2020-04-28 16:43:00 UTC

*** Bug 1813967 has been marked as a duplicate of this bug. ***

Comment 6 Alberto 2020-05-12 08:30:43 UTC

The vSphere CI/dev environment have very limited resources which might be impacting on the timing. I'm assigning this to be tracked by the team who owns this test. I'd suggest to increase the timeout and revisit it when vSphere CI is migrated to a new environment with more capacity https://docs.google.com/document/d/1f26SLA_nYpKopYUJ5_YpAKRtxAXs6x1l-sYz359utgE/edit?ts=5ea70e3a

https://github.com/openshift/origin/blob/7c3ca66a9dfce672a21172425856598e2d1a9916/cmd/openshift-tests/e2e.go#L92

https://github.com/openshift/origin/blob/5c167724f4a2c63064acf19c90e0445ad384f5d8/pkg/test/ginkgo/cmd_runsuite.go#L181

Comment 7 Danil Grigorev 2020-05-12 08:51:03 UTC

Setting timeout onto 20 minutes, https://github.com/openshift/origin/pull/24968

Comment 9 Alberto 2020-05-12 11:30:58 UTC

Hey Oleg, seemed natural to me based on the issue. Can you think of a better component home? please feel free to just move back to cloud team otherwise.

Comment 10 Oleg Bulatov 2020-05-12 19:45:11 UTC

This is a sig-arch test. It covers the platform as a whole, so I don't know a better component.

Comment 11 Alberto 2020-07-01 10:07:58 UTC

I'm closing this as I don't see this timeout happening https://search.apps.build01.ci.devcluster.openshift.com/?search=Managed+cluster+should+ensure+pods+use+downstream+images+from+our+release+image+with+proper+ImagePullPolicy&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
I'll reopen if this is identified again.