Bug 2052378

Summary: Failed Cronjobs bring down the whole cluster with excessive memory utilization.
Product: OpenShift Container Platform Reporter: Hradayesh Shukla <hshukla>
Component: kube-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED DUPLICATE QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: medium    
Version: 4.8CC: aos-bugs, dumilbur, grodrigu, jcallen, mbargenq, mfojtik, mharri, nagrawal, sgrunert, wking, xxia
Target Milestone: ---Flags: sgrunert: needinfo-
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-03 13:17:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hradayesh Shukla 2022-02-09 06:54:52 UTC
Description of problem:
- When multiple CronJobs fails, they bring down worker nodes 1 by 1 in the entire cluster until all the nodes are down & cluster turns unstable. 


Version-Release number of selected component (if applicable):
- Tested the behavior in 4.8.24 
- Customer also suggested behavior occurs in ROSA 4.8.13 (but I've not tested it). 

How reproducible:
100% 


Steps to Reproduce:
1. The cluster has 3 worker nodes. 
2. Create some CronJobs using the attached CronJob file. Once they start failing, we can see that within an hour, 1st node goes down & then in 2 more hours we see the entire cluster is down.
3. The failed jobs somehow don't release the RAM until the node crashes. 


Actual results:
- Nodes enter NotReady state one by one & eventually the whole cluster goes down completely. 


Expected results:
- Failed cronjob pods should be terminated & memory should be released. 


Additional info:

Comment 14 Sascha Grunert 2022-03-03 13:17:50 UTC
The patch has been merged this night into 4.8. Marking this one as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2042175

*** This bug has been marked as a duplicate of bug 2042175 ***