Bug 1920699
Summary: | Operation cannot be fulfilled on clusterresourcequotas.quota.openshift.io error when creating different OpenShift resources | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alexey Kazakov <alkazako> | ||||
Component: | Dev Console | Assignee: | cvogt | ||||
Status: | CLOSED ERRATA | QA Contact: | spathak <spathak> | ||||
Severity: | high | Docs Contact: | Rishu Mehra <rmehra> | ||||
Priority: | unspecified | ||||||
Version: | 4.6 | CC: | aballant, akanekar, aos-bugs, eparis, gercan, hmishra, jokerman, mfojtik, mjobanek, mkleinhe, nmukherj, sgutz, spathak, sponnaga, sttts, sudha | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.8.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Previously, an API server could fail to create a resource, which would return a 409 status code when there was a conflict updating a `resource quota` resource. Consequently, the resource would fail to create, and you might have had to retry the API request. With this update, the `OpenShift Console` web application attempts to retry the request 3 times when receiving a 409 status code, which is often sufficient for completing the request. In the event that a 409 status code continues to occur, an error will be displayed in the console. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1920699[*BZ#1920699*])
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-07-27 22:36:44 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1928228 | ||||||
Attachments: |
|
Description
Alexey Kazakov
2021-01-26 22:08:38 UTC
The dev console team has already done the investigation: https://issues.redhat.com/browse/ODC-5390 We created that BZ after that investigation. It doesn't seem to be a console/UI issue. This is affecting our production in a severe way. Also, we got some customer presentations running today and additional PR/Marketing efforts. We need this to working. This ticket now sits around for 2 days and no one is aparrently looking at it. Again, this severely affects our production and interferes with actual customer relationships TODAY. Please take a look. For clarification. Dev Console does nothing special here. It just creates a bunch of resources: Deployment, BuildConfig, Route, Service, etc. The resource creation is randomly fails. It can happen if you use oc or anything else as well. So it's not specific to the dev console. Are you saying that clusterresourcequota are useless/broken in OCP? It doesn't look like a limitation to me but rather a bug. Could the admission plugin or whatever is trying to modify the quota resource re-try if there is 409? Also regarding the severity of the issue. While it's not 100% reproducible but it seems that about 50% of our users experience this issue. Many of them can reproduce it almost 100%. Our clusters are multi-tenant so we have to enforce quotas. Just for you information, to minimize the impact on our users we dropped most of the counter limits from the ClusterResourceQuota. All users in the Sandbox platform are - by default - provisioned with the basic tier (as referenced before: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/basic/cluster.yaml) that contains only the most critical counter limits now. As a result, the probability of hitting this error should be minimal for now. However, this is just a temporal workaround, definitely not a long-term solution! If you need to reproduce the issue directly in the Sandbox platform, please, let us know - we can change your user-account to another tier with different ClusterResourceQuota that contains all needed counter limits. If you need to reproduce the issue in your OCP cluster, then use this template for creating the ClusterResourceQuota: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/advanced/cluster.yaml I can reproduce the same quota error using `oc` and `curl`. Here's my test scenario. First create 5 yaml files `1.yaml` ... `5.yaml` each containing a resource which is part of the count quota. I used `Route`. Run multiple `oc create` commands: oc create 1.yaml & oc create 2.yaml & ... Run multiple `curl` commands: curl -H "Authorization: Bearer ..." -H "Content-Type: application/yaml" -v --data-binary @./1.yaml https://.../apis/route.openshift.io/v1/namespaces/mytestsnamespace/routes & ... When using `curl` to create multiple resources simultaneously, I fairly consistently encounter the 409 error on the quota resource. This to me definitely seems like a bug and the onus shouldn't be on the client to retry the request. The client did not modify the quota resource directly and therefore shouldn't have gotten this error. `oc` took more attempts. I increased the number of files from 5 to 10 when testing with `oc` and eventually got the error. @sudha This is clearly not related to anything DevConsole is doing and needs to be reassigned to whichever team in OpenShift manages the quotas. Since this is impacting DevSandbox service we need to put an "urgent" tag on this. Every client is expected to retry 409 errors. That's part of the Kube API contract. So it looks like dev console is not following the rules. Nevertheless, the clusterresourcequota admission could do that internally to create less noise, see https://github.com/openshift/apiserver-library-go/pull/42. Note that https://github.com/openshift/apiserver-library-go/pull/42 is just an attempt to make this situation less likely. We identified though that this is not trivial. The correct fix (and this must be done anyway, even with upper PR as the PR can never remove all possibilities of 409s) is in dev console: retry on 409. Moving back to dev console. In https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/advanced/cluster.yaml you have more than 20 resources for one quota object. The quota admission logic basically serializes creations and deletions to all these resources. It's no surprise that this leads to conflicts. As a workaround, probably it would help a lot to split up the quota object into many more independent quota objects, e.g. one per resource. While I acknowledge that as a client I need to handle 409 conflict response, this is normally an issue with resource updates and would be caused by the user/client themselves using stale data to perform an update. In this particular case, we're seeing a 409 on creation. I've discussed the issue with Sam Padget and we agreed to look at implementing client side retries. Dev note: The implementation shouldn't blindly retry all 409 requests because updates that result in a 409 are not retriable; the request must be modified first. For creation, we can blindly repost the same request and make N number of attempts. Sam also pointed out that we should use the same mechanism to blindly retry all requests when receiving a 429 error code. I'd like to see the API server take responsibility for their own conflict and pursue https://github.com/openshift/apiserver-library-go/pull/42 or a variant thereof because it is surprising as a client to receive a conflict on creation that can magically go away by reissuing the exact same request. Created attachment 1760065 [details]
Operation can be fulfilled without error when creating OpenShift resources
Verified on build version: 4.8.0-0.nightly-2021-03-01-031258 Browser version: Chrome 84 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |