Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1797015

Summary:	endurance cluster went unhealthy
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Monitoring	Assignee:	Paul Gier <pgier>
Status:	CLOSED WORKSFORME	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.3.z	CC:	alegrand, anpicker, aos-bugs, ccoleman, erooth, jokerman, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1808358 (view as bug list)		Environment:
Last Closed:	2020-04-30 12:30:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1808358

Description Ben Parees 2020-01-31 18:29:30 UTC

Description of problem:
our endurance cluster went unhealthy after about 6 days.  Each day we attempt to run the e2e test against it.

When it first went unhealthy a node reported unready.  But now all nodes are reporting ready, however there are 360+ pods stuck in "terminating", along with many e2e namespaces.

I've stopped new e2e runs against the cluster, but the existing pods are not going away.



Version-Release number of selected component (if applicable):
Client Version: v4.2.0-alpha.0-249-gc276ecb
Server Version: 4.3.0-0.nightly-2020-01-26-194903
Kubernetes Version: v1.16.2


cluster credentials + node access can be provided upon request.

Comment 2 Ben Parees 2020-02-10 23:27:11 UTC

still seeing 304 pods in terminating as of today, i don't think it has changed significantly in the last week+

Comment 3 Ryan Phillips 2020-02-26 22:07:20 UTC

Looks like the node exporter is running unrestricted without limits. The node exporter pod should have some sore of reasonable limit on it so it does not use a majority of the resources on the system. Perhaps reassign to the monitoring team?

Comment 4 Ben Parees 2020-02-26 22:10:47 UTC

it might be worth spawning a bug against them (i've emailed them already to ask if they plan to backport the resource constraints they set in 4.4, to 4.3) but i'd still expect the cluster to recover even if a pod temporarily consumed a huge amount of memory.  The nodes have plenty of free memory currently, no?

Comment 5 Ryan Phillips 2020-02-26 22:43:48 UTC

After more triage, I believe part of the issue is fixed with the kubepods.slice cgroup memory limit fix (https://github.com/openshift/origin/pull/24596,  https://bugzilla.redhat.com/show_bug.cgi?id=1802687). The cgroup limit fix will correctly OOM a rogue pod.

The second part of this fix is restricting (setting a resource limit) on the node-exporter. The node exporter is not setting a limit on resources from what I can tell in the monitoring operator. I'm reassigning this ticket to the monitoring team with the request to add a pod limit to the node-exporter.

Comment 19 Ben Parees 2020-04-29 21:57:18 UTC

I don't have a current cluster where this can be observed, no