See https://search.apps.build01.ci.devcluster.openshift.com/?search=Configured+cluster+with+non-gu+workload&maxAge=24h&context=2&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job Also, 2 runs from 2 separate PRs of mine in openshift/origin where those PRs are only modifying unrelated tests https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25014/pull-ci-openshift-origin-master-e2e-aws-serial/1276838464473534464 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25191/pull-ci-openshift-origin-master-e2e-aws-serial/1276851544263757824 Starting noticing churn on this 5 PM Eastern Friday June 27 First instance from one of my PRs: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/25191/pull-ci-openshift-origin-master-e2e-aws-serial/1276622124408115200
The failing test is among the simplest for topology manager and perhaps in general: to check against non-regression when topology manager is enabled, it want to run a single pod with a single container which requests 2500 millicores. Simple as that. The test want to request >= 2 cores, so we can narrow down the request a bit but I think this is not the right direction. The test fails with Jun 29 14:19:13.752: INFO: At 2020-06-29 14:14:13 +0000 UTC - event for test-2kztd: {default-scheduler } FailedScheduling: 0/6 nodes are available: 6 Insufficient cpu. Jun 29 14:19:13.826: INFO: POD NODE PHASE GRACE CONDITIONS Jun 29 14:19:13.826: INFO: test-2kztd Pending [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2020-06-29 14:14:13 +0000 UTC Unschedulable 0/6 nodes are available: 6 Insufficient cpu.}] Jun 29 14:19:13.827: INFO: Jun 29 14:19:13.902: INFO: test-2kztd[e2e-test-topology-manager-q769l].container[test-0].log Which, especially considering how simple is the test, is surprising. I believe some other test which runned before didn't free enough cluster resource fast enough, so when this test run, it is resource starved. Next step for me is to investigate the logs to see if there is a common pattern in the test run before this one.
The test was gating and blocking progress. Thus https://github.com/openshift/origin/pull/25225 was merged. However we still need to understand what broke, I'll keep investigating.
Hi Walid, and Gabe, Any progress on verifying the issue?
I'll defer to Walid as QA contact, but I just ran https://search.ci.openshift.org/?search=Configured+cluster+with+non-gu+workload&maxAge=48h&context=2&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job and for the 4.6 hits they are only for passing tests so I'm fine with verifying
Explanation of what broke, and about the fix we delivered: https://github.com/openshift/origin/pull/25231
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196