DescriptionHradayesh Shukla
2022-02-09 06:54:52 UTC
Description of problem:
- When multiple CronJobs fails, they bring down worker nodes 1 by 1 in the entire cluster until all the nodes are down & cluster turns unstable.
Version-Release number of selected component (if applicable):
- Tested the behavior in 4.8.24
- Customer also suggested behavior occurs in ROSA 4.8.13 (but I've not tested it).
How reproducible:
100%
Steps to Reproduce:
1. The cluster has 3 worker nodes.
2. Create some CronJobs using the attached CronJob file. Once they start failing, we can see that within an hour, 1st node goes down & then in 2 more hours we see the entire cluster is down.
3. The failed jobs somehow don't release the RAM until the node crashes.
Actual results:
- Nodes enter NotReady state one by one & eventually the whole cluster goes down completely.
Expected results:
- Failed cronjob pods should be terminated & memory should be released.
Additional info: