Description of problem:
On July 15, 2019, all CI jobs based on ci-operator were failing because Namespace SCC Allocation Controller ran out of UID ranges to assign to the temporary namespaces created by ci-operator, resulting in failure to schedule Pods in these namespaces.
According to Clayton, running out of UID ranges to assign to namespaces is something that should not happen as long as there are not that many namespaces simultaneously, and it happening could indicate a bug in the code that is supposed to reclaim it.
The post-mortem of the failure is here: https://docs.google.com/document/d/1TIvAlE8nGAcrJ5If_EqYKsie4W4ldT9MhYwpF8Bn16I/edit#heading=h.hv3tj1sat0ir
The Slack thread with the investigation process is here: https://coreos.slack.com/archives/CBN38N3MW/p1563194038029200
Unfortunately, we restarted the masters to apply some changes during the investigation process, so we lost the relevant logs. I will attach a fragment from one of the master apiservers that I have available, if that's of any help.
This problem reoccurred today. This is a failed test from the first failing PR I see: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.3/750
Hongkai Liu restarted the master controllers and now tests are running.
Slack thread: https://coreos.slack.com/archives/CEKNRGF25/p1578188048085100