Description of problem:
On July 15, 2019, all CI jobs based on ci-operator were failing because Namespace SCC Allocation Controller ran out of UID ranges to assign to the temporary namespaces created by ci-operator, resulting in failure to schedule Pods in these namespaces.
According to Clayton, running out of UID ranges to assign to namespaces is something that should not happen as long as there are not that many namespaces simultaneously, and it happening could indicate a bug in the code that is supposed to reclaim it.
The post-mortem of the failure is here: https://docs.google.com/document/d/1TIvAlE8nGAcrJ5If_EqYKsie4W4ldT9MhYwpF8Bn16I/edit#heading=h.hv3tj1sat0ir
The Slack thread with the investigation process is here: https://coreos.slack.com/archives/CBN38N3MW/p1563194038029200
Unfortunately, we restarted the masters to apply some changes during the investigation process, so we lost the relevant logs. I will attach a fragment from one of the master apiservers that I have available, if that's of any help.
This problem reoccurred today. This is a failed test from the first failing PR I see: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.3/750
Hongkai Liu restarted the master controllers and now tests are running.
Slack thread: https://coreos.slack.com/archives/CEKNRGF25/p1578188048085100
This bug hasn't had any engineering activity in the last ~30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.
As such, we're marking this bug as "LifecycleStale".
If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
Apparently this is still a problem happening on api.ci cluster, last logged occurence seems to be Patrick's comment above on 2020-05-01.
Heh, Patrick's comment was actually January, not May, but because the world like its stories, we just had an occurrence *today*.
This happens when a namespace is created and deleted. CI creates ~12-14k namespaces per week. The default config is 100k namespaces can be uniquely allocated. Therefore this will fail every 8 weeks. If you restart the controller, it restarts the clock.
If this is not fixed in 4.4... it needs to be fixed. Bumping both urgency and severity because this is a "cluster stops working after 8 weeks".
This may not be marked stale.
I don't think we'll be fixing this in 3.11 and there's already bug 1808588 which is tracking that exact same thing. I'm closing this
as a duplicate. The quick and dirty workaround up to 4.5 is to restart kube-controller-manager, which triggers the repair method
responsible for cleaning the unused ranges.
*** This bug has been marked as a duplicate of bug 1808588 ***