Bug 1851623

Summary: [Serial][sig-node][Feature:TopologyManager] Configured cluster with non-gu workload should run with no regressions with single pod, single container requesting multiple cores [Suite:openshift/conformance/serial] consistently failing in 4.6/master
Product: OpenShift Container Platform Reporter: Gabe Montero <gmontero>
Component: NodeAssignee: Francesco Romani <fromani>
Status: CLOSED ERRATA QA Contact: Walid A. <wabouham>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: aos-bugs, carangog, ddharwar, fromani, fsimonce, jokerman, mifiedle, msivak, rphillips, wabouham, weinliu, yjoseph
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:09:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1857220    

Comment 3 Francesco Romani 2020-06-30 08:11:33 UTC
The failing test is among the simplest for topology manager and perhaps in general: to check against non-regression when topology manager is enabled, it want to run a single pod with a single container which requests 2500 millicores. Simple as that.
The test want to request >= 2 cores, so we can narrow down the request a bit but I think this is not the right direction.

The test fails with 

Jun 29 14:19:13.752: INFO: At 2020-06-29 14:14:13 +0000 UTC - event for test-2kztd: {default-scheduler } FailedScheduling: 0/6 nodes are available: 6 Insufficient cpu.
Jun 29 14:19:13.826: INFO: POD         NODE  PHASE    GRACE  CONDITIONS
Jun 29 14:19:13.826: INFO: test-2kztd        Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2020-06-29 14:14:13 +0000 UTC Unschedulable 0/6 nodes are available: 6 Insufficient cpu.}]
Jun 29 14:19:13.827: INFO: 
Jun 29 14:19:13.902: INFO: test-2kztd[e2e-test-topology-manager-q769l].container[test-0].log 

Which, especially considering how simple is the test, is surprising.
I believe some other test which runned before didn't free enough cluster resource fast enough, so when this test run, it is resource starved. Next step for me is to investigate the logs to see if there is a common pattern in the test run before this one.

Comment 4 Francesco Romani 2020-06-30 18:42:07 UTC
The test was gating and blocking progress. Thus https://github.com/openshift/origin/pull/25225 was merged. However we still need to understand what broke, I'll keep investigating.

Comment 8 Weinan Liu 2020-07-14 16:04:50 UTC
Hi Walid, and Gabe,

Any progress on verifying the issue?

Comment 9 Gabe Montero 2020-07-14 16:46:45 UTC
I'll defer to Walid as QA contact, but I just ran https://search.ci.openshift.org/?search=Configured+cluster+with+non-gu+workload&maxAge=48h&context=2&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job

and for the 4.6 hits they are only for passing tests

so I'm fine with verifying

Comment 10 Francesco Romani 2020-07-15 13:57:21 UTC
Explanation of what broke, and about the fix we delivered: https://github.com/openshift/origin/pull/25231

Comment 16 errata-xmlrpc 2020-10-27 16:09:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196