1730434 – api.ci ran out of UID ranges to assign to newly created namespaces

Bug 1730434 - api.ci ran out of UID ranges to assign to newly created namespaces

Summary: api.ci ran out of UID ranges to assign to newly created namespaces

Keywords:
Status:	CLOSED DUPLICATE of bug 1808588
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Stefan Schimanski
QA Contact:	Wei Sun
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-16 17:40 UTC by Petr Muller
Modified:	2020-06-04 15:03 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-04 15:03:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Petr Muller 2019-07-16 17:40:15 UTC

Description of problem:

On July 15, 2019, all CI jobs based on ci-operator were failing because Namespace SCC Allocation Controller ran out of UID ranges to assign to the temporary namespaces created by ci-operator, resulting in failure to schedule Pods in these namespaces.

According to Clayton, running out of UID ranges to assign to namespaces is something that should not happen as long as there are not that many namespaces simultaneously, and it happening could indicate a bug in the code that is supposed to reclaim it.

Additional information:

The post-mortem of the failure is here: https://docs.google.com/document/d/1TIvAlE8nGAcrJ5If_EqYKsie4W4ldT9MhYwpF8Bn16I/edit#heading=h.hv3tj1sat0ir

The Slack thread with the investigation process is here: https://coreos.slack.com/archives/CBN38N3MW/p1563194038029200

Unfortunately, we restarted the masters to apply some changes during the investigation process, so we lost the relevant logs. I will attach a fragment from one of the master apiservers that I have available, if that's of any help.

Comment 3 Patrick Dillon 2020-01-05 03:02:02 UTC

This problem reoccurred today. This is a failed test from the first failing PR I see: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.3/750

Hongkai Liu restarted the master controllers and now tests are running.

Slack thread: https://coreos.slack.com/archives/CEKNRGF25/p1578188048085100

Comment 4 Michal Fojtik 2020-05-19 13:12:43 UTC

This bug hasn't had any engineering activity in the last ~30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale".

If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 5 Petr Muller 2020-05-25 09:37:02 UTC

Apparently this is still a problem happening on api.ci cluster, last logged occurence seems to be Patrick's comment above on 2020-05-01.

Comment 6 Petr Muller 2020-05-25 10:01:13 UTC

Heh, Patrick's comment was actually January, not May, but because the world like its stories, we just had an occurrence *today*.

Comment 7 Clayton Coleman 2020-06-01 13:25:30 UTC

This happens when a namespace is created and deleted.  CI creates ~12-14k namespaces per week.  The default config is 100k namespaces can be uniquely allocated.  Therefore this will fail every 8 weeks.  If you restart the controller, it restarts the clock.

If this is not fixed in 4.4... it needs to be fixed.  Bumping both urgency and severity because this is a "cluster stops working after 8 weeks".

This may not be marked stale.

Comment 8 Maciej Szulik 2020-06-04 15:03:49 UTC

I don't think we'll be fixing this in 3.11 and there's already bug 1808588 which is tracking that exact same thing. I'm closing this 
as a duplicate. The quick and dirty workaround up to 4.5 is to restart kube-controller-manager, which triggers the repair method
responsible for cleaning the unused ranges.

*** This bug has been marked as a duplicate of bug 1808588 ***

Note You need to log in before you can comment on or make changes to this bug.