Bug 1851623
Summary: | [Serial][sig-node][Feature:TopologyManager] Configured cluster with non-gu workload should run with no regressions with single pod, single container requesting multiple cores [Suite:openshift/conformance/serial] consistently failing in 4.6/master | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Gabe Montero <gmontero> |
Component: | Node | Assignee: | Francesco Romani <fromani> |
Status: | CLOSED ERRATA | QA Contact: | Walid A. <wabouham> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.6 | CC: | aos-bugs, carangog, ddharwar, fromani, fsimonce, jokerman, mifiedle, msivak, rphillips, wabouham, weinliu, yjoseph |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 16:09:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1857220 |
Description
Gabe Montero
2020-06-27 15:28:51 UTC
The failing test is among the simplest for topology manager and perhaps in general: to check against non-regression when topology manager is enabled, it want to run a single pod with a single container which requests 2500 millicores. Simple as that. The test want to request >= 2 cores, so we can narrow down the request a bit but I think this is not the right direction. The test fails with Jun 29 14:19:13.752: INFO: At 2020-06-29 14:14:13 +0000 UTC - event for test-2kztd: {default-scheduler } FailedScheduling: 0/6 nodes are available: 6 Insufficient cpu. Jun 29 14:19:13.826: INFO: POD NODE PHASE GRACE CONDITIONS Jun 29 14:19:13.826: INFO: test-2kztd Pending [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2020-06-29 14:14:13 +0000 UTC Unschedulable 0/6 nodes are available: 6 Insufficient cpu.}] Jun 29 14:19:13.827: INFO: Jun 29 14:19:13.902: INFO: test-2kztd[e2e-test-topology-manager-q769l].container[test-0].log Which, especially considering how simple is the test, is surprising. I believe some other test which runned before didn't free enough cluster resource fast enough, so when this test run, it is resource starved. Next step for me is to investigate the logs to see if there is a common pattern in the test run before this one. The test was gating and blocking progress. Thus https://github.com/openshift/origin/pull/25225 was merged. However we still need to understand what broke, I'll keep investigating. Hi Walid, and Gabe, Any progress on verifying the issue? I'll defer to Walid as QA contact, but I just ran https://search.ci.openshift.org/?search=Configured+cluster+with+non-gu+workload&maxAge=48h&context=2&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job and for the 4.6 hits they are only for passing tests so I'm fine with verifying Explanation of what broke, and about the fix we delivered: https://github.com/openshift/origin/pull/25231 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |