1851623 – [Serial][sig-node][Feature:TopologyManager] Configured cluster with non-gu workload should run with no regressions with single pod, single container requesting multiple cores [Suite:openshift/conformance/serial] consistently failing in 4.6/master

Bug 1851623 - [Serial][sig-node][Feature:TopologyManager] Configured cluster with non-gu workload should run with no regressions with single pod, single container requesting multiple cores [Suite:openshift/conformance/serial] consistently failing in 4.6/master

Summary: [Serial][sig-node][Feature:TopologyManager] Configured cluster with non-gu wo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Francesco Romani
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1857220
TreeView+	depends on / blocked

Reported:	2020-06-27 15:28 UTC by Gabe Montero
Modified:	2020-10-27 16:10 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:09:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25231	0	None	closed	Bug 1851623: e2e: extended: re-enable topology manager tests	2020-12-08 12:06:56 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:10:06 UTC

Description Gabe Montero 2020-06-27 15:28:51 UTC

See https://search.apps.build01.ci.devcluster.openshift.com/?search=Configured+cluster+with+non-gu+workload&maxAge=24h&context=2&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Also, 2 runs from 2 separate PRs of mine in openshift/origin where those PRs are only modifying unrelated tests

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25014/pull-ci-openshift-origin-master-e2e-aws-serial/1276838464473534464

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25191/pull-ci-openshift-origin-master-e2e-aws-serial/1276851544263757824

Starting noticing churn on this 5 PM Eastern Friday June 27

First instance from one of my PRs:  https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/pr-logs/pull/25191/pull-ci-openshift-origin-master-e2e-aws-serial/1276622124408115200

Comment 3 Francesco Romani 2020-06-30 08:11:33 UTC

The failing test is among the simplest for topology manager and perhaps in general: to check against non-regression when topology manager is enabled, it want to run a single pod with a single container which requests 2500 millicores. Simple as that.
The test want to request >= 2 cores, so we can narrow down the request a bit but I think this is not the right direction.

The test fails with 

Jun 29 14:19:13.752: INFO: At 2020-06-29 14:14:13 +0000 UTC - event for test-2kztd: {default-scheduler } FailedScheduling: 0/6 nodes are available: 6 Insufficient cpu.
Jun 29 14:19:13.826: INFO: POD         NODE  PHASE    GRACE  CONDITIONS
Jun 29 14:19:13.826: INFO: test-2kztd        Pending         [{PodScheduled False 0001-01-01 00:00:00 +0000 UTC 2020-06-29 14:14:13 +0000 UTC Unschedulable 0/6 nodes are available: 6 Insufficient cpu.}]
Jun 29 14:19:13.827: INFO: 
Jun 29 14:19:13.902: INFO: test-2kztd[e2e-test-topology-manager-q769l].container[test-0].log 

Which, especially considering how simple is the test, is surprising.
I believe some other test which runned before didn't free enough cluster resource fast enough, so when this test run, it is resource starved. Next step for me is to investigate the logs to see if there is a common pattern in the test run before this one.

Comment 4 Francesco Romani 2020-06-30 18:42:07 UTC

The test was gating and blocking progress. Thus https://github.com/openshift/origin/pull/25225 was merged. However we still need to understand what broke, I'll keep investigating.

Comment 8 Weinan Liu 2020-07-14 16:04:50 UTC

Hi Walid, and Gabe,

Any progress on verifying the issue?

Comment 9 Gabe Montero 2020-07-14 16:46:45 UTC

I'll defer to Walid as QA contact, but I just ran https://search.ci.openshift.org/?search=Configured+cluster+with+non-gu+workload&maxAge=48h&context=2&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job

and for the 4.6 hits they are only for passing tests

so I'm fine with verifying

Comment 10 Francesco Romani 2020-07-15 13:57:21 UTC

Explanation of what broke, and about the fix we delivered: https://github.com/openshift/origin/pull/25231

Comment 16 errata-xmlrpc 2020-10-27 16:09:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.