Description of problem: The "Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy" test is failing frequently with what appears to be a timeout. I am seeing the following: started: (0/3/2144) "[Feature:Platform][Smoke] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]" Eventually followed by the following: --------------------------------------------------------- Received interrupt. Running AfterSuite... ^C again to terminate immediately Apr 24 12:23:53.994: INFO: Running AfterSuite actions on all nodes Apr 24 12:23:53.994: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready Apr 24 12:23:54.089: INFO: Running AfterSuite actions on node 1 failed: (15m0s) 2020-04-24T12:23:54 "[Feature:Platform][Smoke] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]" It looks like the test is timing out after 15 minutes, and the test runner is sending a signal to terminate the test. I am seeing especially many failures that follow this pattern on e2e-vsphere-upi. For example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.2/623 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.2/618 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.2/617 I did see some failures following this pattern for a few other platforms: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/8271/rehearse-8271-release-openshift-ocp-installer-e2e-openstack-ppc64le-4.3/57 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1817 Using search.svc, I found many more failures of the same test, but often those failures result from missing image or i/o timeout errors whereas the above failures appear to be the test timing out and being terminated by the test runner. https://search.svc.ci.openshift.org/?search=Managed%20cluster%20should%20ensure%20pods%20use%20downstream%20images%20from%20our%20release%20image%20with%20proper%20ImagePullPolicy
Apr 24 12:13:42.904: INFO: Running 'oc --namespace=openshift-console --config=/tmp/admin.kubeconfig exec downloads-6fcb9d8c68-dpjsw -c download-server -- cat /etc/redhat-release' Apr 24 12:15:58.357: INFO: Image relase info:Red Hat Enterprise Linux Server release 7.8 (Maipo) Apr 24 12:15:58.357: INFO: Running 'oc --namespace=openshift-controller-manager-operator --config=/tmp/admin.kubeconfig exec openshift-controller-manager-operator-9495b7dbf-8vnvr -c operator -- cat /etc/redhat-release' Apr 24 12:15:59.049: INFO: Image relase info:Red Hat Enterprise Linux Server release 7.8 (Maipo) ... Apr 24 12:21:15.331: INFO: Running 'oc --namespace=openshift-etcd --config=/tmp/admin.kubeconfig exec etcd-member-control-plane-2 -c etcd-member -- cat /etc/redhat-release' Apr 24 12:23:30.999: INFO: Image relase info:Red Hat Enterprise Linux Server release 7.8 (Maipo) Sometimes `oc exec` takes considerable time (2 minutes for etcd-member-control-plane-2) to run `cat`. I don't know what may cause such delays, so I'm moving this to CLI.
It looks like the problem is not with any component per se, but rather the overall timeout on the test. I'm moving this to cloud team, since it looks like the lags are caused by the vsphere installation.
*** Bug 1813967 has been marked as a duplicate of this bug. ***
The vSphere CI/dev environment have very limited resources which might be impacting on the timing. I'm assigning this to be tracked by the team who owns this test. I'd suggest to increase the timeout and revisit it when vSphere CI is migrated to a new environment with more capacity https://docs.google.com/document/d/1f26SLA_nYpKopYUJ5_YpAKRtxAXs6x1l-sYz359utgE/edit?ts=5ea70e3a https://github.com/openshift/origin/blob/7c3ca66a9dfce672a21172425856598e2d1a9916/cmd/openshift-tests/e2e.go#L92 https://github.com/openshift/origin/blob/5c167724f4a2c63064acf19c90e0445ad384f5d8/pkg/test/ginkgo/cmd_runsuite.go#L181
Setting timeout onto 20 minutes, https://github.com/openshift/origin/pull/24968
Hey Oleg, seemed natural to me based on the issue. Can you think of a better component home? please feel free to just move back to cloud team otherwise.
This is a sig-arch test. It covers the platform as a whole, so I don't know a better component.
I'm closing this as I don't see this timeout happening https://search.apps.build01.ci.devcluster.openshift.com/?search=Managed+cluster+should+ensure+pods+use+downstream+images+from+our+release+image+with+proper+ImagePullPolicy&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job I'll reopen if this is identified again.