Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1508716

Summary: [master_public_1033]Can't release the used quota when delete the resource
Product: OpenShift Container Platform Reporter: zhou ying <yinzhou>
Component: MasterAssignee: Dan Mace <dmace>
Status: CLOSED ERRATA QA Contact: Wang Haoran <haowang>
Severity: high Docs Contact:
Priority: medium    
Version: unspecifiedCC: aos-bugs, jokerman, mfojtik, mmccomas
Target Milestone: ---   
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-30 15:10:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description zhou ying 2017-11-02 02:59:38 UTC
Description of problem:
Create object count quota and resource, after delete the resource the used count quota not released.

Version-Release number of selected component (if applicable):
v1.9.0-alpha.2.36+4aaf39a5c0bc3d

How reproducible:
always

Steps to Reproduce:
1. Build up Kubernetes env;
2. Create object count quota:
  `kubectl create quota test --hard=count/deployments.extensions=2,count/replicasets.extensions=4,count/pods=3,count/secrets=4
resourcequota "test" created`
3. Create resouce:
  `kubectl run nginx --image=nginx --replicas=2`
4. Check the object count quota;
5. Delete the resouce;
6. Check the object count quota again.


Actual results:
6. After delete resouce , the used count quota not released too.


Expected results:
6. Should release the used count quota.


Additional info:

Comment 1 zhou ying 2017-11-02 05:28:58 UTC
Can't reproduce this issue every time.

Comment 2 zhou ying 2017-11-02 05:40:55 UTC
[root@ip-172-18-11-218 kubernetes]# cluster/kubectl.sh get po 
No resources found.
[root@ip-172-18-11-218 kubernetes]# cluster/kubectl.sh describe quota
Name:                         test
Namespace:                    default
Resource                      Used  Hard
--------                      ----  ----
count/deployments.extensions  0     2
count/pods                    2     3
count/replicasets.extensions  0     4
count/secrets                 1     4

Comment 3 Derek Carr 2017-11-02 14:21:11 UTC
Are you sure the pods were still not terminating?

Comment 4 zhou ying 2017-11-03 02:16:22 UTC
Derek Carr:
    Yes, please see the "https://bugzilla.redhat.com/show_bug.cgi?id=1508716#c2", when get pod, couldn't see any pod, but the used quota not released. But when the cluster restart , can't reproduce it.

Comment 7 Dan Mace 2018-01-09 20:34:31 UTC
I was able to reproduce suspicious behavior on master with a simpler setup:

$ openshift start master ...

$ oc new-project test
$ oc create quota cm --hard=count/configmaps=1
$ oc create cm test --from-literal=foo=bar && oc delete cm/test && oc describe quota

Repeat the `oc create cm test ...` command until quota becomes stale; the user will be overcharged for what seems like forever:

$ oc get cm && oc describe quota
No resources found.
Name:             cm
Namespace:        test
Resource          Used  Hard
--------          ----  ----
count/configmaps  1     1

Restarting the master corrects the issue. Setting the quota resync period to something shorter had no apparent effect:

kubernetesMasterConfig:
  controllerArguments:
    resource-quota-sync-period:
    - "1m"

I'll begin debugging.

Comment 8 Dan Mace 2018-01-10 20:18:19 UTC
There are actually several confounding issues here which need untangled:

1. Origin is starting up two ResourceQuotaController instances (one in its own controller manager, and another via kube's controller manager)
2. The ResourceQuotaController has a worker pool deadlock condition I just discovered, related to periodic discovery resync
3. The Origin ResourceQuotaController instance isn't setting up periodic discovery resync, avoiding the deadlock issue (but breaking discovery sync)
4. The other ResourceQuotaController instance via the kube controller manager has discovery resync enabled, triggering the deadlock bug, and often the kube ResourceQuotaController will steal all the work from the Origin instance (as they share informers)

There will probably be at least three things to do from here:

1. Fix the upstream ResourceQuotaController deadlock issue
2. Fix ResourceQuotaController bootstrapping in origin so we're only starting one controller
3. Ensure ResourceQuotaController discovery sync is enabled during bootstrapping for the one controller we decide to start

I'll provide updates here with origin and kube issues/PRs as I get them sorted out.

Comment 9 Dan Mace 2018-01-10 21:45:58 UTC
Upstream PR to fix the deadlock: https://github.com/kubernetes/kubernetes/pull/58107

Comment 10 Dan Mace 2018-01-11 15:44:02 UTC
Upstream 1.9 backport: https://github.com/kubernetes/kubernetes/pull/58158

Comment 11 Dan Mace 2018-01-11 15:45:23 UTC



(In reply to Dan Mace from comment #8)
> There are actually several confounding issues here which need untangled:
> 
> 1. Origin is starting up two ResourceQuotaController instances (one in its
> own controller manager, and another via kube's controller manager)

Turns out this is normal.

> 3. The Origin ResourceQuotaController instance isn't setting up periodic
> discovery resync, avoiding the deadlock issue (but breaking discovery sync)

Turns out sync isn't necessary for origin's controller (as the upstream controller already handles custom resources via discovery sync).

So, the only work product resulting from this bug will be the deadlock fix.

Comment 13 Wang Haoran 2018-02-11 07:43:58 UTC
Verified with:
openshift v3.9.0-0.42.0
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.8

Comment 15 errata-xmlrpc 2019-01-30 15:10:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0098