Bug 1730434

Summary: api.ci ran out of UID ranges to assign to newly created namespaces
Product: OpenShift Container Platform Reporter: Petr Muller <pmuller>
Component: apiserver-authAssignee: Stefan Schimanski <sttts>
Status: CLOSED DUPLICATE QA Contact: Wei Sun <wsun>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.11.0CC: aos-bugs, ccoleman, maszulik, mfojtik, nagrawal, padillon, skuznets
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-04 15:03:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Muller 2019-07-16 17:40:15 UTC
Description of problem:

On July 15, 2019, all CI jobs based on ci-operator were failing because Namespace SCC Allocation Controller ran out of UID ranges to assign to the temporary namespaces created by ci-operator, resulting in failure to schedule Pods in these namespaces.

According to Clayton, running out of UID ranges to assign to namespaces is something that should not happen as long as there are not that many namespaces simultaneously, and it happening could indicate a bug in the code that is supposed to reclaim it.

Additional information:

The post-mortem of the failure is here: https://docs.google.com/document/d/1TIvAlE8nGAcrJ5If_EqYKsie4W4ldT9MhYwpF8Bn16I/edit#heading=h.hv3tj1sat0ir

The Slack thread with the investigation process is here: https://coreos.slack.com/archives/CBN38N3MW/p1563194038029200

Unfortunately, we restarted the masters to apply some changes during the investigation process, so we lost the relevant logs. I will attach a fragment from one of the master apiservers that I have available, if that's of any help.

Comment 3 Patrick Dillon 2020-01-05 03:02:02 UTC
This problem reoccurred today. This is a failed test from the first failing PR I see: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.3/750

Hongkai Liu restarted the master controllers and now tests are running.

Slack thread: https://coreos.slack.com/archives/CEKNRGF25/p1578188048085100

Comment 4 Michal Fojtik 2020-05-19 13:12:43 UTC
This bug hasn't had any engineering activity in the last ~30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale".

If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 5 Petr Muller 2020-05-25 09:37:02 UTC
Apparently this is still a problem happening on api.ci cluster, last logged occurence seems to be Patrick's comment above on 2020-05-01.

Comment 6 Petr Muller 2020-05-25 10:01:13 UTC
Heh, Patrick's comment was actually January, not May, but because the world like its stories, we just had an occurrence *today*.

Comment 7 Clayton Coleman 2020-06-01 13:25:30 UTC
This happens when a namespace is created and deleted.  CI creates ~12-14k namespaces per week.  The default config is 100k namespaces can be uniquely allocated.  Therefore this will fail every 8 weeks.  If you restart the controller, it restarts the clock.

If this is not fixed in 4.4... it needs to be fixed.  Bumping both urgency and severity because this is a "cluster stops working after 8 weeks".

This may not be marked stale.

Comment 8 Maciej Szulik 2020-06-04 15:03:49 UTC
I don't think we'll be fixing this in 3.11 and there's already bug 1808588 which is tracking that exact same thing. I'm closing this 
as a duplicate. The quick and dirty workaround up to 4.5 is to restart kube-controller-manager, which triggers the repair method
responsible for cleaning the unused ranges.

*** This bug has been marked as a duplicate of bug 1808588 ***