Bug 1851623

Summary:	[Serial][sig-node][Feature:TopologyManager] Configured cluster with non-gu workload should run with no regressions with single pod, single container requesting multiple cores [Suite:openshift/conformance/serial] consistently failing in 4.6/master
Product:	OpenShift Container Platform	Reporter:	Gabe Montero <gmontero>
Component:	Node	Assignee:	Francesco Romani <fromani>
Status:	CLOSED ERRATA	QA Contact:	Walid A. <wabouham>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	aos-bugs, carangog, ddharwar, fromani, fsimonce, jokerman, mifiedle, msivak, rphillips, wabouham, weinliu, yjoseph
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:09:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1857220

Description Gabe Montero 2020-06-27 15:28:51 UTC

See https://search.apps.build01.ci.devcluster.openshift.com/?search=Configured+cluster+with+non-gu+workload&maxAge=24h&context=2&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Also, 2 runs from 2 separate PRs of mine in openshift/origin where those PRs are only modifying unrelated tests

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25014/pull-ci-openshift-origin-master-e2e-aws-serial/1276838464473534464

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25191/pull-ci-openshift-origin-master-e2e-aws-serial/1276851544263757824

Starting noticing churn on this 5 PM Eastern Friday June 27

First instance from one of my PRs:  https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/25191/pull-ci-openshift-origin-master-e2e-aws-serial/1276622124408115200

Comment 3 Francesco Romani 2020-06-30 08:11:33 UTC

The failing test is among the simplest for topology manager and perhaps in general: to check against non-regression when topology manager is enabled, it want to run a single pod with a single container which requests 2500 millicores. Simple as that.
The test want to request >= 2 cores, so we can narrow down the request a bit but I think this is not the right direction.

The test fails with 

Jun 29 14:19:13.752: INFO: At 2020-06-29 14:14:13 +0000 UTC - event for test-2kztd: {default-scheduler } FailedScheduling: 0/6 nodes are available: 6 Insufficient cpu.
Jun 29 14:19:13.826: INFO: POD         NODE  PHASE    GRACE  CONDITIONS
Jun 29 14:19:13.826: INFO: test-2kztd        Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2020-06-29 14:14:13 +0000 UTC Unschedulable 0/6 nodes are available: 6 Insufficient cpu.}]
Jun 29 14:19:13.827: INFO: 
Jun 29 14:19:13.902: INFO: test-2kztd[e2e-test-topology-manager-q769l].container[test-0].log 

Which, especially considering how simple is the test, is surprising.
I believe some other test which runned before didn't free enough cluster resource fast enough, so when this test run, it is resource starved. Next step for me is to investigate the logs to see if there is a common pattern in the test run before this one.

Comment 4 Francesco Romani 2020-06-30 18:42:07 UTC

The test was gating and blocking progress. Thus https://github.com/openshift/origin/pull/25225 was merged. However we still need to understand what broke, I'll keep investigating.

Comment 8 Weinan Liu 2020-07-14 16:04:50 UTC

Hi Walid, and Gabe,

Any progress on verifying the issue?

Comment 9 Gabe Montero 2020-07-14 16:46:45 UTC

I'll defer to Walid as QA contact, but I just ran https://search.ci.openshift.org/?search=Configured+cluster+with+non-gu+workload&maxAge=48h&context=2&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job

and for the 4.6 hits they are only for passing tests

so I'm fine with verifying

Comment 10 Francesco Romani 2020-07-15 13:57:21 UTC

Explanation of what broke, and about the fix we delivered: https://github.com/openshift/origin/pull/25231

Comment 16 errata-xmlrpc 2020-10-27 16:09:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196