Description of problem:
Our Dev Sandbox (https://developers.redhat.com/developer-sandbox) cluster is running on a 4.6.8 OSD cluster.
When creating a new application in that cluster, we randomly experience an error:
"Operation cannot be fulfilled on clusterresourcequotas.quota.openshift.io "<quota_resource_name>": the object has been modified; please apply your changes to the latest version and try again"
Openshift Developer Console (web console) team did the initial investigation and it turned out that api-server randomly returns 409 errors when the UI tries to create different resources: buildConfig, Service, Route, etc.
Every user in our cluster has pretty details cluster quota for all the usernamespaces: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/advanced/cluster.yaml#L6-L39
As you can see there are count limits for different resources. And it looks like creating resources in the user namespaces triggers the quota resource status update which leads to 409 conflict which is not properly handled.
You can find more details in JIRA: https://issues.redhat.com/browse/ODC-5390
Version-Release number of selected component (if applicable):
Many our users complains that they face this issue but it's not 100% reproducable.
Steps to Reproduce:
1. Create a cluster quota: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/advanced/cluster.yaml#L6-L39
2. In Web Console in Developer tab: Go to Add->Import from Git
3. Select a builder image(e.g. NodeJS), specify Git options.
4. Click on Create.
5. There is an error.
The dev console team has already done the investigation: https://issues.redhat.com/browse/ODC-5390
We created that BZ after that investigation.
It doesn't seem to be a console/UI issue.
This is affecting our production in a severe way. Also, we got some customer presentations running today and additional PR/Marketing efforts. We need this to working. This ticket now sits around for 2 days and no one is aparrently looking at it. Again, this severely affects our production and interferes with actual customer relationships TODAY. Please take a look.
Dev Console does nothing special here. It just creates a bunch of resources: Deployment, BuildConfig, Route, Service, etc. The resource creation is randomly fails. It can happen if you use oc or anything else as well. So it's not specific to the dev console.
Are you saying that clusterresourcequota are useless/broken in OCP? It doesn't look like a limitation to me but rather a bug. Could the admission plugin or whatever is trying to modify the quota resource re-try if there is 409?
Also regarding the severity of the issue. While it's not 100% reproducible but it seems that about 50% of our users experience this issue. Many of them can reproduce it almost 100%. Our clusters are multi-tenant so we have to enforce quotas.
Just for you information, to minimize the impact on our users we dropped most of the counter limits from the ClusterResourceQuota. All users in the Sandbox platform are - by default - provisioned with the basic tier (as referenced before: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/basic/cluster.yaml) that contains only the most critical counter limits now. As a result, the probability of hitting this error should be minimal for now. However, this is just a temporal workaround, definitely not a long-term solution!
If you need to reproduce the issue directly in the Sandbox platform, please, let us know - we can change your user-account to another tier with different ClusterResourceQuota that contains all needed counter limits.
If you need to reproduce the issue in your OCP cluster, then use this template for creating the ClusterResourceQuota: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/advanced/cluster.yaml
I can reproduce the same quota error using `oc` and `curl`.
Here's my test scenario.
First create 5 yaml files `1.yaml` ... `5.yaml` each containing a resource which is part of the count quota. I used `Route`.
Run multiple `oc create` commands:
oc create 1.yaml & oc create 2.yaml & ...
Run multiple `curl` commands:
curl -H "Authorization: Bearer ..." -H "Content-Type: application/yaml" -v --data-binary @./1.yaml https://.../apis/route.openshift.io/v1/namespaces/mytestsnamespace/routes & ...
When using `curl` to create multiple resources simultaneously, I fairly consistently encounter the 409 error on the quota resource. This to me definitely seems like a bug and the onus shouldn't be on the client to retry the request. The client did not modify the quota resource directly and therefore shouldn't have gotten this error.
`oc` took more attempts. I increased the number of files from 5 to 10 when testing with `oc` and eventually got the error.
@sudha This is clearly not related to anything DevConsole is doing and needs to be reassigned to whichever team in OpenShift manages the quotas. Since this is impacting DevSandbox service we need to put an "urgent" tag on this.
Every client is expected to retry 409 errors. That's part of the Kube API contract. So it looks like dev console is not following the rules.
Nevertheless, the clusterresourcequota admission could do that internally to create less noise, see https://github.com/openshift/apiserver-library-go/pull/42.
Note that https://github.com/openshift/apiserver-library-go/pull/42 is just an attempt to make this situation less likely. We identified though that this is not trivial.
The correct fix (and this must be done anyway, even with upper PR as the PR can never remove all possibilities of 409s) is in dev console: retry on 409.
Moving back to dev console.
In https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/advanced/cluster.yaml you have more than 20 resources for one quota object. The quota admission logic basically serializes creations and deletions to all these resources. It's no surprise that this leads to conflicts.
As a workaround, probably it would help a lot to split up the quota object into many more independent quota objects, e.g. one per resource.
While I acknowledge that as a client I need to handle 409 conflict response, this is normally an issue with resource updates and would be caused by the user/client themselves using stale data to perform an update. In this particular case, we're seeing a 409 on creation.
I've discussed the issue with Sam Padget and we agreed to look at implementing client side retries. Dev note: The implementation shouldn't blindly retry all 409 requests because updates that result in a 409 are not retriable; the request must be modified first. For creation, we can blindly repost the same request and make N number of attempts. Sam also pointed out that we should use the same mechanism to blindly retry all requests when receiving a 429 error code.
I'd like to see the API server take responsibility for their own conflict and pursue https://github.com/openshift/apiserver-library-go/pull/42 or a variant thereof because it is surprising as a client to receive a conflict on creation that can magically go away by reissuing the exact same request.
Created attachment 1760065 [details]
Operation can be fulfilled without error when creating OpenShift resources
Verified on build version: 4.8.0-0.nightly-2021-03-01-031258
Browser version: Chrome 84
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.