Bug 1808588 - rangeallocations.data is never updated when a project is removed
Summary: rangeallocations.data is never updated when a project is removed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Maciej Szulik
QA Contact: Mike Fiedler
URL:
Whiteboard:
: 1730434 1850144 (view as bug list)
Depends On:
Blocks: 1858798
TreeView+ depends on / blocked
 
Reported: 2020-02-28 21:29 UTC by Ryan Howe
Modified: 2023-12-15 17:25 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: UID range allocation is never updated when a project is removed. Only restarting kube-controller-manager pod was triggering repair procedure which was clearing that range. Consequence: It is possible to exhaust the UID range on cluster with high namespace create+remove turnover. Fix: Periodically run the repair job. Result: The UID range allocation should be freed periodically (currently every 8 hours) which should not require additional kube-controller-manager restarts. It should also ensure that the range is not exhausted.
Clone Of:
: 1858798 (view as bug list)
Environment:
Last Closed: 2020-10-27 15:55:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
cluster-policy-controller log (1.27 MB, text/plain)
2020-07-07 06:01 UTC, Brendan Shirren
no flags Details
scc-uid rangeallocation (114.09 KB, text/plain)
2020-07-10 01:22 UTC, Brendan Shirren
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-policy-controller pull 28 0 None closed Bug 1808588: add UID deallocation logic 2021-01-29 08:19:39 UTC
Red Hat Knowledge Base (Solution) 4951691 0 None None None 2020-04-01 23:08:29 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:55:55 UTC

Comment 6 Brendan Shirren 2020-03-31 05:13:56 UTC
Are we sure this is openshift-apiserver component?? It was originally under openshift-controller-manager component.

Only reference I see to "scc-uid" RangeAllocation is in pkg/security/controller/namespace_scc_allocation_controller.go from openshift-controller-manager.

I suspect problem might be in func (c *NamespaceSCCAllocationController) allocate(ns *corev1.Namespace)

        // do uid allocation.  We reserve the UID we want first, lock it in etcd, then update the namespace.
        // We allocate by reading in a giant bit int bitmap (one bit per offset location), finding the next step,
        // then calculating the offset location


I believe function above updates the bitmask in "scc-uid" RangeAllocation and it is only "scc-uid" RangeAllocation showing this behaviour.


[1] https://github.com/openshift/openshift-controller-manager/blob/79fb7a5f3d8417766529150986eb5648bf65b733/pkg/security/controller/namespace_scc_allocation_controller.go#L139

Comment 13 Maciej Szulik 2020-05-14 11:44:01 UTC
There's a repair method being called upon restart, so maybe not ideal but I could suggest restarting kube-controller-manager pods
which contains cluster-policy-controller. That repair method is responsible for scanning the current namespaces and fix the existing
range allocations to match ns state. That's all I can suggest for now.

I doubt we'll be able to fix it in the short term, from what I learned it's not an easy task to do right away, so we'll have to 
tackle this as a RFE if needed to, since that functionality never existed, other than that Repair I've mentioned above.

Comment 14 Maciej Szulik 2020-06-04 15:03:49 UTC
*** Bug 1730434 has been marked as a duplicate of this bug. ***

Comment 17 Maciej Szulik 2020-06-18 10:03:20 UTC
This is being actively worked on.

Comment 18 Steve Kuznetsov 2020-06-24 17:09:30 UTC
*** Bug 1850144 has been marked as a duplicate of this bug. ***

Comment 19 Brendan Shirren 2020-07-03 00:42:04 UTC
controller-manager logs:

E0701 15:13:21.517402       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
E0701 15:23:21.516920       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
E0701 15:33:21.518902       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
E0701 15:43:21.518227       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
E0701 15:53:21.519810       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
E0701 16:03:21.519510       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
E0701 16:13:21.520017       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
E0701 16:18:50.805760       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
E0701 16:28:50.806520       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
E0701 16:38:50.807027       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)


I0620 06:49:57.542182       1 ingress.go:294] Starting controller
I0620 06:49:57.565701       1 factory.go:85] deploymentconfig controller caches are synced. Starting workers.
E0620 06:49:57.567247       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
goroutine 38416 [running]:
github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1ef6e00, 0x25b9310)
        /go/src/github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x1ef6e00, 0x25b9310)
        /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/openshift-controller-manager/vendor/github.com/openshift/library-go/pkg/apps/appsutil.SetCancelledByNewerDeployment(...)
        /go/src/github.com/openshift/openshift-controller-manager/vendor/github.com/openshift/library-go/pkg/apps/appsutil/util.go:313
github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig.(*DeploymentConfigController).cancelRunningRollouts.func1(0x7f7d07243eb0, 0x0)
        /go/src/github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig/deploymentconfig_controller.go:382 +0x19c
github.com/openshift/openshift-controller-manager/vendor/k8s.io/client-go/util/retry.OnError.func1(0x2038500, 0x2221f01, 0xc00068a080)
        /go/src/github.com/openshift/openshift-controller-manager/vendor/k8s.io/client-go/util/retry/util.go:64 +0x3c
github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x4014000000000000, 0x3fb999999999999a, 0x4, 0x0, 0xc00068a080, 0x413798, 0x30)
        /go/src/github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:292 +0x51
github.com/openshift/openshift-controller-manager/vendor/k8s.io/client-go/util/retry.OnError(0x989680, 0x4014000000000000, 0x3fb999999999999a, 0x4, 0x0, 0x234b3f8, 0xc0022ae480, 0xc0022ae450, 0xc0019bb548)
        /go/src/github.com/openshift/openshift-controller-manager/vendor/k8s.io/client-go/util/retry/util.go:63 +0xb2
github.com/openshift/openshift-controller-manager/vendor/k8s.io/client-go/util/retry.RetryOnConflict(...)
        /go/src/github.com/openshift/openshift-controller-manager/vendor/k8s.io/client-go/util/retry/util.go:83
github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig.(*DeploymentConfigController).cancelRunningRollouts(0xc000642280, 0xc0008e87e0, 0xc00190c058, 0x1, 0x1, 0xc0021fc120, 0x1, 0x0)
        /go/src/github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig/deploymentconfig_controller.go:364 +0x1c2
github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig.(*DeploymentConfigController).Handle(0xc000642280, 0xc0008e87e0, 0x0, 0x0)
        /go/src/github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig/deploymentconfig_controller.go:152 +0x2180
github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig.(*DeploymentConfigController).work(0xc000642280, 0xc00044bb00)
        /go/src/github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig/factory.go:222 +0x1f8
github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig.(*DeploymentConfigController).worker(0xc000642280)
        /go/src/github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig/factory.go:195 +0x2b
github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0019b6050)
        /go/src/github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0019b6050, 0x3b9aca00, 0x0, 0x1, 0xc00045c660)
        /go/src/github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc0019b6050, 0x3b9aca00, 0xc00045c660)
        /go/src/github.com/openshift/openshift-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig.(*DeploymentConfigController).Run
        /go/src/github.com/openshift/openshift-controller-manager/pkg/apps/deploymentconfig/factory.go:88 +0x1af

I0620 06:49:57.590185       1 buildconfig_controller.go:200] Starting buildconfig controller
I0620 06:49:57.630722       1 templateinstance_finalizer.go:193] Starting TemplateInstanceFinalizer controller
I0620 06:49:57.639704       1 templateinstance_controller.go:296] Starting TemplateInstance controller

Comment 20 Brendan Shirren 2020-07-03 05:47:34 UTC
Apologies ignore my previous update it seems unrelated to this issue. No further logs to provide currently.

Comment 21 Brendan Shirren 2020-07-07 05:33:22 UTC
kube-controller-manager pod cluster-policy-controller container:

2020-06-20T06:52:49.42737378Z E0620 06:52:49.427336       1 namespace_scc_allocation_controller.go:334] error syncing namespace, it will be retried: the server could not find the requested resource (get rangeallocations.security.openshift.io scc-uid)
2020-06-20T06:52:49.433217038Z E0620 06:52:49.433190       1 namespace_scc_allocation_controller.go:334] error syncing namespace, it will be retried: the server could not find the requested resource (get rangeallocations.security.openshift.io scc-uid)

2020-06-30T20:52:51.274460847Z E0630 20:52:51.274409       1 namespace_scc_allocation_controller.go:334] error syncing namespace, it will be retried: Operation cannot be fulfilled on 
namespaces "calico-system": the object has been modified; please apply your changes to the latest version and try again

2020-07-01T10:04:21.311371413Z E0701 10:04:21.311315       1 namespace_scc_allocation_controller.go:334] error syncing namespace, it will be retried: uid range exceeded
2020-07-01T10:04:21.323634787Z E0701 10:04:21.323608       1 namespace_scc_allocation_controller.go:334] error syncing namespace, it will be retried: uid range exceeded
2020-07-01T10:04:21.339554799Z E0701 10:04:21.339525       1 namespace_scc_allocation_controller.go:334] error syncing namespace, it will be retried: uid range exceeded

Comment 22 Brendan Shirren 2020-07-07 06:01:28 UTC
Created attachment 1700107 [details]
cluster-policy-controller log

Comment 23 Maciej Szulik 2020-07-07 08:26:21 UTC
Some of the retry errors will be fixed with https://bugzilla.redhat.com/show_bug.cgi?id=1829327 (4.4) https://bugzilla.redhat.com/show_bug.cgi?id=1829328 (4.3).
The uid range error is a separate and it's being currently worked on.

Comment 27 Maciej Szulik 2020-07-09 11:01:58 UTC
I'm currently working on addressing the comments Clayton left on that PR.

Comment 28 Brendan Shirren 2020-07-10 01:22:08 UTC
Created attachment 1700506 [details]
scc-uid rangeallocation

scc-uid rangeAllocation when UID range exceeded issue affected cluster

Comment 35 Mike Fiedler 2020-07-27 18:34:22 UTC
Verified on 4.6.0-0.nightly-2020-07-25-091217

On 4.5.z after creating and deleting thousands of projects multiple times 

oc get rangeallocations scc-uid -o yaml | grep -o "/" | wc -l 

continuously increased.

Running on 4.6.0-0.nightly-2020-07-25-091217, after deleting projects (oc get rangeallocations scc-uid -o yaml | grep -o "/" | wc -l) returned to 17 which was the pre-test number.

Comment 36 Arvin Amirian 2020-09-01 21:50:16 UTC
I have a case of this where the uid continually increases in a brand new 4.5 cluster. Initially ran into it right after cluster creation with 4.5.3 and would have to reset the count manually right away. Now a new 4.5.7 exhibiting the same behavior but the increase in uid count is a little slower.

oc get rangeallocations scc-uid -o yaml | grep -o "/" | wc -l
4882


oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.7     True        False         141m    Cluster version is 4.5.7

Comment 37 Maciej Szulik 2020-09-02 09:37:48 UTC
> I have a case of this where the uid continually increases in a brand new 4.5
> cluster. Initially ran into it right after cluster creation with 4.5.3 and
> would have to reset the count manually right away. Now a new 4.5.7
> exhibiting the same behavior but the increase in uid count is a little
> slower.

The increase will always happen as that's how the mechanism works, it will 
only release the not-used ones once they are release, iow. the namespace/project
is removed.

Comment 38 Arvin Amirian 2020-09-11 15:00:14 UTC
So what is the solution for a brand new cluster encountering this? The stats I posted are typical and the odd time (1/10) it will be north of 10k and nothing can be deployed on the cluster.

Comment 39 Maciej Szulik 2020-09-14 10:50:43 UTC
(In reply to aamirian from comment #38)
> So what is the solution for a brand new cluster encountering this? The stats
> I posted are typical and the odd time (1/10) it will be north of 10k and
> nothing can be deployed on the cluster.

This was backported to all previous versions, so I'd suggest upgrading cluster.

Comment 40 Arvin Amirian 2020-09-18 16:56:44 UTC
backported to which version? We are running 4.5.7 and the issue still exists.

Comment 41 W. Trevor King 2020-09-18 17:07:43 UTC
4.5 backport landed in 4.5.5 [1], so open a new for 4.5.7 issues?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1858798#c8

Comment 45 errata-xmlrpc 2020-10-27 15:55:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.