Created attachment 1174656 [details] screenshot Description of problem: tried the eap app on a new project, got this problem: Completed, with errors Failed to create asd in project asdasd. Cannot create service "asd". services "asd" is forbidden: Status unknown for quota: object-counts. Created image stream "asd" successfully. Created route "asd" successfully. Created build config "asd" successfully. Created deployment config "asd" successfully. please see screenshot attached. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Upon initial project creation, existing quota'ed objects are not accounted for and hence the platform prevents new objects (under quota) from being created until it can figure out what quota is consumed and what is available.
The problem is that if you attempt to create a template, then some resources are created (which are not under quota) while others are blocked (quota'ed resources). Now, if a user attempts to recreate the same template after quota is reconciled (in 30s or so), the platform will create what is missing and error out on resources that have been already created (based on name conflicts). However, users will not be expected to do that as it is quite confusing.
Ran into what seems like the same issue this morning: Steps to reproduce: Created account Selected Jenkins Quickstart Errors: Completed, with errors Failed to create jenkins-persistent in project robotmaxtron. Cannot create persistent volume claim "jenkins". persistentvolumeclaims "jenkins" is forbidden: Status unknown for quota: object-counts. Cannot create service "jenkins". services "jenkins" is forbidden: Status unknown for quota: object-counts.
(In reply to Max Whittingham from comment #4) > Ran into what seems like the same issue this morning: > > Steps to reproduce: > Created account > Selected Jenkins Quickstart > > > Errors: > Completed, with errors > Failed to create jenkins-persistent in project robotmaxtron. > Cannot create persistent volume claim "jenkins". persistentvolumeclaims > "jenkins" is forbidden: Status unknown for quota: object-counts. > Cannot create service "jenkins". services "jenkins" is forbidden: Status > unknown for quota: object-counts. yes, it looks like the same issue.
This is a bug. There are two potential resolutions: 1. We improve the error message that gets displayed in this case to highlight that this might be a transient issue and the user can retry the operation in a minute - not optimal, but its a quick-fix or perhaps a stopgap arrangement. 2. We block any/all resource creation till the quota is computed. This is more involved and we'll have to consider the messaging in the UI while resource creation is blocked and how the status (initial quota computation) can reflected in the UI to inform the user that resources can now be created. I'll work on option 1 and reach out to folks to figure out if there is a viable approach for option 2.
*** Bug 1354651 has been marked as a duplicate of this bug. ***
I vote for option #2 provided the we don't block the creation indefinitely (with a timeout perhaps).
Quota reconciliation appears to take a lot longer than 30 seconds. It's more like 5 minutes. Is there a way to automatically set the quota usage to zero on project creation, given that there shouldn't be any usage on a blank project?
> I vote for option #2 provided the we don't block the creation indefinitely (with a timeout perhaps). No, we're already outside the acceptable duration for blocking an API call. We will work to improve the quota reconcile time for new quotas. > Is there a way to automatically set the quota usage to zero on project creation, given that there shouldn't be any usage on a blank project? No, quota creation is done asynchronously, right after project creation occurs. The code that handles that creation has no way to know what other resources are in the project (and in the case of automated processes creating things in the project, it might not be empty)
> We will work to improve the quota reconcile time for new quotas. Is this a code issue or simply a load issue with the preview? > No, quota creation is done asynchronously, right after project creation > occurs. The code that handles that creation has no way to know what other > resources are in the project (and in the case of automated processes > creating things in the project, it might not be empty) At the possible risk of sounding naive, if quota calculation is done asynchronously, could you explain how the new project case is different to resource creation at any later stage? If this is how it works, you would seem to be able to run over the quota at any time until the quota is recalculated (and resources are killed/deleted?).
> Is this a code issue or simply a load issue with the preview? The way quota reconciliation is queued is not prioritizing quotas with unknown status, which means on an environment with large numbers of quotas, new quotas with unknown status have to wait for reconciliation along with all other quotas. They should be prioritized, because their unknown status blocks all creation of the resources they quota. > At the possible risk of sounding naive, if quota calculation is done > asynchronously, could you explain how the new project case is different to > resource creation at any later stage? If this is how it works, you would seem > to be able to run over the quota at any time until the quota is recalculated > (and resources are killed/deleted?). There are three stages: 1. when a quota object is created, its status starts as "unknown". While in that state, objects controlled by the quota are prevented from being created. An async process counts objects controlled by that quota and updates the status with the used count. 2. When an object is created, any controlling quota objects status is incremented by 1 for that object type. This requires the status to not be in the "unknown" state. It also has the potential to increment usage counts artifically, in the rare case where the object passes admission but is not actually persisted. 3. An async process iterates over all quota objects and continuously recalculates the used counts, and updates status. This eventually corrects any artificially high quota usage counts. The issue is that stage 1 and stage 3 share a queue. Unknown quotas should take priority over the "recalculate and correct" queue.
The proposal is to update the quota controller to prioritize quotas whose status is unknown in a separate queue from the normal quota sync queue. This should reduce the latency of initial quota calculation.
@Derek - If the current latency is say, 60 seconds, can you provide a rough estimate of the latency after the mentioned change?
@Jacob - not yet. basically, instead of going at the end of a line with 4500 things in it, it will go to the front of the line.
Based on what we're seeing, its getting half filled in (hard present, used missing), so it's already near the top of the queue, its just ending up with missing usage somehow.
After more evaluation, we have determined the following: 1) there are 2 quota controllers (kubernetes + openshift) that are racing on the quota on new-project. the openshift one appears to win most often, and will set the status.hard for the quota but not the status.used (since it doesn't know how) 2) the kubernetes controller is doing a lot more work per quota, and does not differentiate new quotas from any other quota that needs work. if there are 1500 or so projects, we have about 4500 quotas, and there are 5 workers working through that queue. we could look to bump up the number of workers, and we could have a separate queue for creates. 3) depending on who wins the race in #1, quota calculation for kubernetes object lags. 4) we can make a tweak to the controller to ignore evaluators that are not on the quota document. for online, its not clear this has any impact on the kubernetes controller, but may help the openshift controller for images. long term, we need to get the quota controller working on shared informers, and we need to see how that functions with lags on rapid delete+add scenarios without reservations per https://github.com/kubernetes/kubernetes/pull/20113
Upstream PR opened to skip evaluating all evaluators per quota. https://github.com/kubernetes/kubernetes/pull/29134 This should improve the amount of time it takes to sync quotas in the queue by ensuring we only make listing calls for resources in that particular quota.
Pulls for priority queuing of initial usage calculations. https://github.com/kubernetes/kubernetes/pull/29133 https://github.com/openshift/origin/pull/9915
I confirmed our double controller theory on prod. In openshift 3.3, the controllers have been collapsed into a single controller, so you'll never end up in a half-evaluated state, but there's probably more to do in admission. I think I'd say that any quota that is trying to quota the thing being created (spec.hard), should block object creation until the usage is satisfied. That doesn't seem to be the case today.
OSE PR to improve quota controller performance by reducing number of queries: https://github.com/openshift/ose/pull/309
3.2 pick to reduce latency using a secondary queue for out of date spec/status/usage: https://github.com/openshift/ose/pull/310
The quota controller was updated to be more responsive. 1) When creating a new quota in a project, calculation of that quota's usage is prioritized versus other quotas in the system to improve the new-project experience. 2) When calculating usage stats for a quota, the controller was made more efficient to reduce unnecessary LIST calls against the API server for resources not tracked by the quota.
Version-Release: STG(openshift v3.2.1.10-1-g668ed0a, kubernetes v1.2.0-36-g4a3f9c5, etcd 2.2.5) We can't test multi-quota in a project or create multi-project with common account due to STG restriction. We tested create<->delete template repeatedly in a project to check how quickly quota usage calculating pod-related resources, and whether resources like "secret, pvc" affect resources recreation. Generally mem/cpu quota usage deletion within 0~3s, "Status unknown" error didn't appear, and app can run finally. So move the bug to verified, thank you all. Here is a verification clip: [root@qwang_laptop qwang]# oc new-project test; oc describe project test; oc new-app eap64-mysql-persistent-s2i; oc describe project test; oc delete all --all; oc describe project test; sleep 3; oc describe project test; oc new-app eap64-mysql-persistent-s2i; oc describe project test Now using project "test" on server "https://api.dev-preview-stg.openshift.com:443". You can add applications to this project with the 'new-app' command. For example, try: $ oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-hello-world.git to build a new hello-world application in Ruby. Name: test Created: 5 seconds ago Labels: <none> Annotations: openshift.io/description= openshift.io/display-name= openshift.io/requester=qwang1 openshift.io/sa.scc.mcs=s0:c84,c24 openshift.io/sa.scc.supplemental-groups=1007020000/10000 openshift.io/sa.scc.uid-range=1007020000/10000 Display Name: <none> Description: <none> Status: Active Node Selector: <none> Quota: Name: compute-resources Resource Used Hard -------- ---- ---- limits.cpu 0 4 limits.memory 0 2Gi Name: compute-resources-timebound Resource Used Hard -------- ---- ---- limits.cpu 0 3 limits.memory 0 1536Mi Name: object-counts Resource Used Hard -------- ---- ---- persistentvolumeclaims 0 2 replicationcontrollers 0 50 secrets 9 20 services 0 10 Resource limits: Name: resource-limits Type Resource Min Max Default ---- -------- --- --- --- Pod cpu 29m 2 - Pod memory 150Mi 1Gi - Container cpu 29m 2 1 Container memory 150Mi 1Gi 512Mi --> Deploying template "eap64-mysql-persistent-s2i" in project "openshift" for "eap64-mysql-persistent-s2i" With parameters: Memory Limit=512Mi MySQL Memory Limit=512Mi Application Name=eap-app HTTP Hostname= HTTPS Hostname= Source Repository URL=https://github.com/jboss-openshift/openshift-quickstarts Source Repository Reference=1.2 Context Directory=todolist/todolist-jdbc JNDI Name=java:jboss/datasources/TodoListDS Database Name=root Volume Capacity=1Gi Queue Names= Topic Names= HTTPS Secret=eap-app-secret HTTPS Keystore=keystore.jks HTTPS Name=jboss HTTPS Password=mykeystorepass Minimum Database Pool Size= Maximum Database Pool Size= Database Transaction Isolation= Lower Case Table Names= Maximum Connections= Minimum Word Length= Maximum Word Length= MySQL AIO= HornetQ Cluster Administrator Password=WUauFSfW # generated Database Username=userEby # generated Database User Password=pVc3NQJu # generated GitHub Webhook Secret=Unp0VjMt # generated Generic Build Webhook Secret=0sCtTBIK # generated ImageStream Namespace=openshift JGroups Encryption Secret=eap-app-secret JGroups Encryption Keystore=jgroups.jceks JGroups Encryption Name=secret-key JGroups Ecryption Password=password JGroups Cluster Password=pWo1Trj2 # generated Custom Maven Mirror URL=https://mirror.openshift.com/nexus/content/groups/public/ --> Creating resources with label app=eap-app ... service "eap-app" created service "secure-eap-app" created service "eap-app-mysql" created route "eap-app" created route "secure-eap-app" created imagestream "eap-app" created buildconfig "eap-app" created serviceaccount "eap-service-account" created secret "eap-app-secret" created deploymentconfig "eap-app" created deploymentconfig "eap-app-mysql" created persistentvolumeclaim "eap-app-mysql-claim" created --> Success Build scheduled, use 'oc logs -f bc/eap-app' to track its progress. Run 'oc status' to view your app. Name: test Created: 20 seconds ago Labels: <none> Annotations: openshift.io/description= openshift.io/display-name= openshift.io/requester=qwang1 openshift.io/sa.scc.mcs=s0:c84,c24 openshift.io/sa.scc.supplemental-groups=1007020000/10000 openshift.io/sa.scc.uid-range=1007020000/10000 Display Name: <none> Description: <none> Status: Active Node Selector: <none> Quota: Name: compute-resources Resource Used Hard -------- ---- ---- limits.cpu 0 4 limits.memory 0 2Gi Name: compute-resources-timebound Resource Used Hard -------- ---- ---- limits.cpu 2 3 limits.memory 1Gi 1536Mi Name: object-counts Resource Used Hard -------- ---- ---- persistentvolumeclaims 1 2 replicationcontrollers 1 50 secrets 13 20 services 3 10 Resource limits: Name: resource-limits Type Resource Min Max Default ---- -------- --- --- --- Pod cpu 29m 2 - Pod memory 150Mi 1Gi - Container cpu 29m 2 1 Container memory 150Mi 1Gi 512Mi buildconfig "eap-app" deleted imagestream "eap-app" deleted deploymentconfig "eap-app" deleted deploymentconfig "eap-app-mysql" deleted route "eap-app" deleted route "secure-eap-app" deleted service "eap-app" deleted service "eap-app-mysql" deleted service "secure-eap-app" deleted pod "eap-app-1-build" deleted pod "eap-app-mysql-1-deploy" deleted Name: test Created: 47 seconds ago Labels: <none> Annotations: openshift.io/description= openshift.io/display-name= openshift.io/requester=qwang1 openshift.io/sa.scc.mcs=s0:c84,c24 openshift.io/sa.scc.supplemental-groups=1007020000/10000 openshift.io/sa.scc.uid-range=1007020000/10000 Display Name: <none> Description: <none> Status: Active Node Selector: <none> Quota: Name: compute-resources Resource Used Hard -------- ---- ---- limits.cpu 0 4 limits.memory 0 2Gi Name: compute-resources-timebound Resource Used Hard -------- ---- ---- limits.cpu 0 3 limits.memory 0 1536Mi Name: object-counts Resource Used Hard -------- ---- ---- persistentvolumeclaims 1 2 replicationcontrollers 0 50 secrets 13 20 services 0 10 Resource limits: Name: resource-limits Type Resource Min Max Default ---- -------- --- --- --- Pod cpu 29m 2 - Pod memory 150Mi 1Gi - Container cpu 29m 2 1 Container memory 150Mi 1Gi 512Mi Name: test Created: 54 seconds ago Labels: <none> Annotations: openshift.io/description= openshift.io/display-name= openshift.io/requester=qwang1 openshift.io/sa.scc.mcs=s0:c84,c24 openshift.io/sa.scc.supplemental-groups=1007020000/10000 openshift.io/sa.scc.uid-range=1007020000/10000 Display Name: <none> Description: <none> Status: Active Node Selector: <none> Quota: Name: compute-resources Resource Used Hard -------- ---- ---- limits.cpu 0 4 limits.memory 0 2Gi Name: compute-resources-timebound Resource Used Hard -------- ---- ---- limits.cpu 0 3 limits.memory 0 1536Mi Name: object-counts Resource Used Hard -------- ---- ---- persistentvolumeclaims 1 2 replicationcontrollers 0 50 secrets 13 20 services 0 10 Resource limits: Name: resource-limits Type Resource Min Max Default ---- -------- --- --- --- Pod cpu 29m 2 - Pod memory 150Mi 1Gi - Container cpu 29m 2 1 Container memory 150Mi 1Gi 512Mi --> Deploying template "eap64-mysql-persistent-s2i" in project "openshift" for "eap64-mysql-persistent-s2i" With parameters: Memory Limit=512Mi MySQL Memory Limit=512Mi Application Name=eap-app HTTP Hostname= HTTPS Hostname= Source Repository URL=https://github.com/jboss-openshift/openshift-quickstarts Source Repository Reference=1.2 Context Directory=todolist/todolist-jdbc JNDI Name=java:jboss/datasources/TodoListDS Database Name=root Volume Capacity=1Gi Queue Names= Topic Names= HTTPS Secret=eap-app-secret HTTPS Keystore=keystore.jks HTTPS Name=jboss HTTPS Password=mykeystorepass Minimum Database Pool Size= Maximum Database Pool Size= Database Transaction Isolation= Lower Case Table Names= Maximum Connections= Minimum Word Length= Maximum Word Length= MySQL AIO= HornetQ Cluster Administrator Password=NrUMkcEp # generated Database Username=userQlb # generated Database User Password=ODS1BoeD # generated GitHub Webhook Secret=PSgsLjel # generated Generic Build Webhook Secret=R0UyGsrR # generated ImageStream Namespace=openshift JGroups Encryption Secret=eap-app-secret JGroups Encryption Keystore=jgroups.jceks JGroups Encryption Name=secret-key JGroups Ecryption Password=password JGroups Cluster Password=PKFbI51d # generated Custom Maven Mirror URL=https://mirror.openshift.com/nexus/content/groups/public/ --> Creating resources with label app=eap-app ... service "eap-app" created service "secure-eap-app" created service "eap-app-mysql" created route "eap-app" created route "secure-eap-app" created imagestream "eap-app" created buildconfig "eap-app" created error: serviceaccounts "eap-service-account" already exists error: secrets "eap-app-secret" already exists deploymentconfig "eap-app" created deploymentconfig "eap-app-mysql" created error: persistentvolumeclaims "eap-app-mysql-claim" already exists Name: test Created: About a minute ago Labels: <none> Annotations: openshift.io/description= openshift.io/display-name= openshift.io/requester=qwang1 openshift.io/sa.scc.mcs=s0:c84,c24 openshift.io/sa.scc.supplemental-groups=1007020000/10000 openshift.io/sa.scc.uid-range=1007020000/10000 Display Name: <none> Description: <none> Status: Active Node Selector: <none> Quota: Name: compute-resources Resource Used Hard -------- ---- ---- limits.cpu 0 4 limits.memory 0 2Gi Name: compute-resources-timebound Resource Used Hard -------- ---- ---- limits.cpu 2 3 limits.memory 1Gi 1536Mi Name: object-counts Resource Used Hard -------- ---- ---- persistentvolumeclaims 2 2 replicationcontrollers 1 50 secrets 14 20 services 3 10 Resource limits: Name: resource-limits Type Resource Min Max Default ---- -------- --- --- --- Pod cpu 29m 2 - Pod memory 150Mi 1Gi - Container cpu 29m 2 1 Container memory 150Mi 1Gi 512Mi
*** Bug 1356877 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1608
Customer is getting the error ""Error from server (InternalError): Internal error occurred: services "glusterfs-cluster" is forbidden: status unknown for quota: object-counts" whenever he creates a new project using oc new-project and I am re-opening this bug again since the same issue is occuring in OpenShift v3.9.This might be a regression bug. To be precise the customer looked at the code in the upstream and reckons that the this error is caused because of the quota controller's inability to update the stats before creating new objects from the template file. The error message from the log hints the Kubernetes ResourceQuota controller found in namespace k8s.io/kubernetes/plugin/pkg/admission/resourcequota. The message "status unknown for quota: %s" can be found in quotaEvaluator,checkRequest. Within that function, the controller is trying to get usage statistics for the a given ResourceQuota. For this case, this suggests a race condition. The project template creates ResourceQuota objects right before it creates the endpoint and service object for glusterfs-cluster. By the time the later gets created, the resource quota statistics seem to be not yet available resulting in the project creation to fail. I am setting the severity to high since this is a very important for the customer since he is not able to create new projects which hinders their business.I will attach the logs which i have till date from the customer.
Hi Vedanti, it would be nice if you could provide steps to reproduce this issue, and tell whether it happens every single time a customer wants to create a secret or just sometimes.