Bug 1797015
| Summary: | endurance cluster went unhealthy | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> | |
| Component: | Monitoring | Assignee: | Paul Gier <pgier> | |
| Status: | CLOSED WORKSFORME | QA Contact: | Junqi Zhao <juzhao> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.3.z | CC: | alegrand, anpicker, aos-bugs, ccoleman, erooth, jokerman, kakkoyun, lcosic, mloibl, pkrupa, surbania | |
| Target Milestone: | --- | |||
| Target Release: | 4.5.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1808358 (view as bug list) | Environment: | ||
| Last Closed: | 2020-04-30 12:30:30 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1808358 | |||
|
Description
Ben Parees
2020-01-31 18:29:30 UTC
still seeing 304 pods in terminating as of today, i don't think it has changed significantly in the last week+ Looks like the node exporter is running unrestricted without limits. The node exporter pod should have some sore of reasonable limit on it so it does not use a majority of the resources on the system. Perhaps reassign to the monitoring team? it might be worth spawning a bug against them (i've emailed them already to ask if they plan to backport the resource constraints they set in 4.4, to 4.3) but i'd still expect the cluster to recover even if a pod temporarily consumed a huge amount of memory. The nodes have plenty of free memory currently, no? After more triage, I believe part of the issue is fixed with the kubepods.slice cgroup memory limit fix (https://github.com/openshift/origin/pull/24596, https://bugzilla.redhat.com/show_bug.cgi?id=1802687). The cgroup limit fix will correctly OOM a rogue pod. The second part of this fix is restricting (setting a resource limit) on the node-exporter. The node exporter is not setting a limit on resources from what I can tell in the monitoring operator. I'm reassigning this ticket to the monitoring team with the request to add a pod limit to the node-exporter. I don't have a current cluster where this can be observed, no |