2052378 – Failed Cronjobs bring down the whole cluster with excessive memory utilization.

Bug 2052378 - Failed Cronjobs bring down the whole cluster with excessive memory utilization.

Summary: Failed Cronjobs bring down the whole cluster with excessive memory utilization.

Keywords:
Status:	CLOSED DUPLICATE of bug 2042175
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.8
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Abu Kashem
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-09 06:54 UTC by Hradayesh Shukla
Modified:	2022-05-03 00:25 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-03 13:17:50 UTC
Target Upstream Version:
Embargoed:
Flags:	sgrunert: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	2065749	1	high	CLOSED	Kubelet slowly leaking memory and pods eventually unable to start	2023-09-18 04:33:50 UTC

Description Hradayesh Shukla 2022-02-09 06:54:52 UTC

Description of problem:
- When multiple CronJobs fails, they bring down worker nodes 1 by 1 in the entire cluster until all the nodes are down & cluster turns unstable. 


Version-Release number of selected component (if applicable):
- Tested the behavior in 4.8.24 
- Customer also suggested behavior occurs in ROSA 4.8.13 (but I've not tested it). 

How reproducible:
100% 


Steps to Reproduce:
1. The cluster has 3 worker nodes. 
2. Create some CronJobs using the attached CronJob file. Once they start failing, we can see that within an hour, 1st node goes down & then in 2 more hours we see the entire cluster is down.
3. The failed jobs somehow don't release the RAM until the node crashes. 


Actual results:
- Nodes enter NotReady state one by one & eventually the whole cluster goes down completely. 


Expected results:
- Failed cronjob pods should be terminated & memory should be released. 


Additional info:

Comment 14 Sascha Grunert 2022-03-03 13:17:50 UTC

The patch has been merged this night into 4.8. Marking this one as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2042175

*** This bug has been marked as a duplicate of bug 2042175 ***

Note You need to log in before you can comment on or make changes to this bug.