Bug 1730434 - api.ci ran out of UID ranges to assign to newly created namespaces
Summary: api.ci ran out of UID ranges to assign to newly created namespaces
Status: NEW
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Auth
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.11.z
Assignee: Stefan Schimanski
QA Contact: Wei Sun
Depends On:
TreeView+ depends on / blocked
Reported: 2019-07-16 17:40 UTC by Petr Muller
Modified: 2020-01-05 03:02 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed:
Target Upstream Version:

Attachments (Terms of Use)

Description Petr Muller 2019-07-16 17:40:15 UTC
Description of problem:

On July 15, 2019, all CI jobs based on ci-operator were failing because Namespace SCC Allocation Controller ran out of UID ranges to assign to the temporary namespaces created by ci-operator, resulting in failure to schedule Pods in these namespaces.

According to Clayton, running out of UID ranges to assign to namespaces is something that should not happen as long as there are not that many namespaces simultaneously, and it happening could indicate a bug in the code that is supposed to reclaim it.

Additional information:

The post-mortem of the failure is here: https://docs.google.com/document/d/1TIvAlE8nGAcrJ5If_EqYKsie4W4ldT9MhYwpF8Bn16I/edit#heading=h.hv3tj1sat0ir

The Slack thread with the investigation process is here: https://coreos.slack.com/archives/CBN38N3MW/p1563194038029200

Unfortunately, we restarted the masters to apply some changes during the investigation process, so we lost the relevant logs. I will attach a fragment from one of the master apiservers that I have available, if that's of any help.

Comment 3 Patrick Dillon 2020-01-05 03:02:02 UTC
This problem reoccurred today. This is a failed test from the first failing PR I see: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.3/750

Hongkai Liu restarted the master controllers and now tests are running.

Slack thread: https://coreos.slack.com/archives/CEKNRGF25/p1578188048085100

Note You need to log in before you can comment on or make changes to this bug.