Created attachment 1577379 [details] cluster listings Description of problem: After a long run period (>30 days), the cloud credentials operator on a starter cluster exited with OOMKilled. Version-Release number of selected component (if applicable): 4.1.0-rc.4 How reproducible: Unknown Steps to Reproduce: 1. Cluster with 928 namespaces 2. Allow the cluster to run for > 30 days 3. Actual results: Pod OOMKilled and sitting in ContainerCreating. Expected results: A 3.x starter cluster could have > 15k projects (but I expect run time is a factor here). Additional info: see attached listings
Created attachment 1582752 [details] kubelet log for wedged cloud-credential-operator pod
- https://bugzilla.redhat.com/show_bug.cgi?id=1711402 is tracking the fix for the memory limits for this pod. - https://bugzilla.redhat.com/show_bug.cgi?id=1722604 was discovered as the likely root cause of the excessive memory use. - https://bugzilla.redhat.com/show_bug.cgi?id=1701326#c7 looks like it is tracking the kubelet error: https://bugzilla.redhat.com/attachment.cgi?id=1582752 - I've not seen the hostname changing issue on 4.1.2, so I'm assuming this has been fixed. *** This bug has been marked as a duplicate of bug 1722604 ***
I think the OOM will persist just to due to the number of namespaces due to https://bugzilla.redhat.com/show_bug.cgi?id=1723892, proposed to backport to 4.1.z.