Bug 1827863 - [Feature:Platform][Smoke] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel] timing out
Summary: [Feature:Platform][Smoke] Managed cluster should ensure pods use downstream i...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
: 1813967 (view as bug list)
Depends On:
Blocks: 1874524
TreeView+ depends on / blocked
 
Reported: 2020-04-25 00:38 UTC by Miciah Dashiel Butler Masters
Modified: 2020-09-01 14:31 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1874524 (view as bug list)
Environment:
Last Closed: 2020-07-01 10:07:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Miciah Dashiel Butler Masters 2020-04-25 00:38:54 UTC
Description of problem:

The "Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy" test is failing frequently with what appears to be a timeout.  I am seeing the following:

    started: (0/3/2144) "[Feature:Platform][Smoke] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]"

Eventually followed by the following:

    ---------------------------------------------------------
    Received interrupt.  Running AfterSuite...
    ^C again to terminate immediately
    Apr 24 12:23:53.994: INFO: Running AfterSuite actions on all nodes
    Apr 24 12:23:53.994: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
    Apr 24 12:23:54.089: INFO: Running AfterSuite actions on node 1
    
    failed: (15m0s) 2020-04-24T12:23:54 "[Feature:Platform][Smoke] Managed cluster should ensure pods use downstream images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel]"

It looks like the test is timing out after 15 minutes, and the test runner is sending a signal to terminate the test.

I am seeing especially many failures that follow this pattern on e2e-vsphere-upi.  For example:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.2/623
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.2/618
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.2/617

I did see some failures following this pattern for a few other platforms:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/8271/rehearse-8271-release-openshift-ocp-installer-e2e-openstack-ppc64le-4.3/57
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-4.3/1817

Using search.svc, I found many more failures of the same test, but often those failures result from missing image or i/o timeout errors whereas the above failures appear to be the test timing out and being terminated by the test runner.

https://search.svc.ci.openshift.org/?search=Managed%20cluster%20should%20ensure%20pods%20use%20downstream%20images%20from%20our%20release%20image%20with%20proper%20ImagePullPolicy

Comment 3 Oleg Bulatov 2020-04-27 17:00:31 UTC
Apr 24 12:13:42.904: INFO: Running 'oc --namespace=openshift-console --config=/tmp/admin.kubeconfig exec downloads-6fcb9d8c68-dpjsw -c download-server -- cat /etc/redhat-release'
Apr 24 12:15:58.357: INFO: Image relase info:Red Hat Enterprise Linux Server release 7.8 (Maipo)

Apr 24 12:15:58.357: INFO: Running 'oc --namespace=openshift-controller-manager-operator --config=/tmp/admin.kubeconfig exec openshift-controller-manager-operator-9495b7dbf-8vnvr -c operator -- cat /etc/redhat-release'
Apr 24 12:15:59.049: INFO: Image relase info:Red Hat Enterprise Linux Server release 7.8 (Maipo)

...

Apr 24 12:21:15.331: INFO: Running 'oc --namespace=openshift-etcd --config=/tmp/admin.kubeconfig exec etcd-member-control-plane-2 -c etcd-member -- cat /etc/redhat-release'
Apr 24 12:23:30.999: INFO: Image relase info:Red Hat Enterprise Linux Server release 7.8 (Maipo)


Sometimes `oc exec` takes considerable time (2 minutes for etcd-member-control-plane-2) to run `cat`. I don't know what may cause such delays, so I'm moving this to CLI.

Comment 4 Maciej Szulik 2020-04-28 10:29:42 UTC
It looks like the problem is not with any component per se, but rather the overall timeout on the test. 
I'm moving this to cloud team, since it looks like the lags are caused by the vsphere installation.

Comment 5 Ben Parees 2020-04-28 16:43:00 UTC
*** Bug 1813967 has been marked as a duplicate of this bug. ***

Comment 6 Alberto 2020-05-12 08:30:43 UTC
The vSphere CI/dev environment have very limited resources which might be impacting on the timing. I'm assigning this to be tracked by the team who owns this test. I'd suggest to increase the timeout and revisit it when vSphere CI is migrated to a new environment with more capacity https://docs.google.com/document/d/1f26SLA_nYpKopYUJ5_YpAKRtxAXs6x1l-sYz359utgE/edit?ts=5ea70e3a

https://github.com/openshift/origin/blob/7c3ca66a9dfce672a21172425856598e2d1a9916/cmd/openshift-tests/e2e.go#L92

https://github.com/openshift/origin/blob/5c167724f4a2c63064acf19c90e0445ad384f5d8/pkg/test/ginkgo/cmd_runsuite.go#L181

Comment 7 Danil Grigorev 2020-05-12 08:51:03 UTC
Setting timeout onto 20 minutes, https://github.com/openshift/origin/pull/24968

Comment 9 Alberto 2020-05-12 11:30:58 UTC
Hey Oleg, seemed natural to me based on the issue. Can you think of a better component home? please feel free to just move back to cloud team otherwise.

Comment 10 Oleg Bulatov 2020-05-12 19:45:11 UTC
This is a sig-arch test. It covers the platform as a whole, so I don't know a better component.


Note You need to log in before you can comment on or make changes to this bug.