Bug 1920699 - Operation cannot be fulfilled on clusterresourcequotas.quota.openshift.io error when creating different OpenShift resources [NEEDINFO]
Summary: Operation cannot be fulfilled on clusterresourcequotas.quota.openshift.io err...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Dev Console
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: cvogt
QA Contact: spathak@redhat.com
Rishu Mehra
URL:
Whiteboard:
Depends On:
Blocks: 1928228
TreeView+ depends on / blocked
 
Reported: 2021-01-26 22:08 UTC by Alexey Kazakov
Modified: 2021-09-15 17:18 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, an API server could fail to create a resource, which would return a 409 status code when there was a conflict updating a `resource quota` resource. Consequently, the resource would fail to create, and you might have had to retry the API request. With this update, the `OpenShift Console` web application attempts to retry the request 3 times when receiving a 409 status code, which is often sufficient for completing the request. In the event that a 409 status code continues to occur, an error will be displayed in the console. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1920699[*BZ#1920699*])
Clone Of:
Environment:
Last Closed: 2021-07-27 22:36:44 UTC
Target Upstream Version:
jspeed: needinfo? (sudha)
sgutz: needinfo? (sudha)


Attachments (Terms of Use)
Operation can be fulfilled without error when creating OpenShift resources (2.79 MB, video/webm)
2021-03-01 20:49 UTC, spathak@redhat.com
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift console pull 8116 0 None closed Bug 1920699: retry co-fetch on 409 POST or 429 2021-02-15 09:20:01 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:37:12 UTC

Description Alexey Kazakov 2021-01-26 22:08:38 UTC
Description of problem:

Our Dev Sandbox (https://developers.redhat.com/developer-sandbox) cluster is running on a 4.6.8 OSD cluster.
When creating a new application in that cluster, we randomly experience an error:
"Operation cannot be fulfilled on clusterresourcequotas.quota.openshift.io "<quota_resource_name>": the object has been modified; please apply your changes to the latest version and try again"

Openshift Developer Console (web console) team did the initial investigation and it turned out that api-server randomly returns 409 errors when the UI tries to create different resources: buildConfig, Service, Route, etc.

Every user in our cluster has pretty details cluster quota for all the usernamespaces: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/advanced/cluster.yaml#L6-L39

As you can see there are count limits for different resources. And it looks like creating resources in the user namespaces triggers the quota resource status update which leads to 409 conflict which is not properly handled.

You can find more details in JIRA: https://issues.redhat.com/browse/ODC-5390

Version-Release number of selected component (if applicable):

OSD 4.6.8

How reproducible:

Many our users complains that they face this issue but it's not 100% reproducable.

Steps to Reproduce:
1. Create a cluster quota: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/advanced/cluster.yaml#L6-L39
2. In Web Console in Developer tab: Go to Add->Import from Git
3. Select a builder image(e.g. NodeJS), specify Git options.
4. Click on Create.
5. There is an error.

Comment 2 Alexey Kazakov 2021-01-28 03:07:47 UTC
The dev console team has already done the investigation: https://issues.redhat.com/browse/ODC-5390
We created that BZ after that investigation.
It doesn't seem to be a console/UI issue.

Comment 3 Michael Kleinhenz 2021-01-28 08:00:04 UTC
This is affecting our production in a severe way. Also, we got some customer presentations running today and additional PR/Marketing efforts. We need this to working. This ticket now sits around for 2 days and no one is aparrently looking at it. Again, this severely affects our production and interferes with actual customer relationships TODAY. Please take a look.

Comment 8 Alexey Kazakov 2021-01-28 16:04:28 UTC
For clarification.
Dev Console does nothing special here. It just creates a bunch of resources: Deployment, BuildConfig, Route, Service, etc. The resource creation is randomly fails. It can happen if you use oc or anything else as well. So it's not specific to the dev console.

Are you saying that clusterresourcequota are useless/broken in OCP? It doesn't look like a limitation to me but rather a bug. Could the admission plugin or whatever is trying to modify the quota resource re-try if there is 409?

Comment 9 Alexey Kazakov 2021-01-28 16:08:44 UTC
Also regarding the severity of the issue. While it's not 100% reproducible but it seems that about 50% of our users experience this issue. Many of them can reproduce it almost 100%. Our clusters are multi-tenant so we have to enforce quotas.

Comment 10 Matous Jobanek 2021-01-29 10:09:54 UTC
Just for you information, to minimize the impact on our users we dropped most of the counter limits from the ClusterResourceQuota. All users in the Sandbox platform are - by default - provisioned with the basic tier (as referenced before: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/basic/cluster.yaml) that contains only the most critical counter limits now. As a result, the probability of hitting this error should be minimal for now. However, this is just a temporal workaround, definitely not a long-term solution!

If you need to reproduce the issue directly in the Sandbox platform, please, let us know - we can change your user-account to another tier with different ClusterResourceQuota that contains all needed counter limits.
If you need to reproduce the issue in your OCP cluster, then use this template for creating the ClusterResourceQuota: https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/advanced/cluster.yaml

Comment 11 cvogt 2021-01-29 19:17:36 UTC
I can reproduce the same quota error using `oc` and `curl`.

Here's my test scenario.
First create 5 yaml files `1.yaml` ... `5.yaml` each containing a resource which is part of the count quota. I used `Route`.

Run multiple `oc create` commands:
oc create 1.yaml & oc create 2.yaml & ...

Run multiple `curl` commands:
curl -H "Authorization: Bearer ..." -H "Content-Type: application/yaml" -v --data-binary @./1.yaml https://.../apis/route.openshift.io/v1/namespaces/mytestsnamespace/routes & ...

When using `curl` to create multiple resources simultaneously, I fairly consistently encounter the 409 error on the quota resource. This to me definitely seems like a bug and the onus shouldn't be on the client to retry the request. The client did not modify the quota resource directly and therefore shouldn't have gotten this error.
`oc` took more attempts. I increased the number of files from 5 to 10 when testing with `oc` and eventually got the error.

Comment 12 Steve Gutz 2021-02-01 11:13:12 UTC
@sudha This is clearly not related to anything DevConsole is doing and needs to be reassigned to whichever team in OpenShift manages the quotas.  Since this is impacting DevSandbox service we need to put an "urgent" tag on this.

Comment 13 Stefan Schimanski 2021-02-01 15:18:16 UTC
Every client is expected to retry 409 errors. That's part of the Kube API contract. So it looks like dev console is not following the rules.

Nevertheless, the clusterresourcequota admission could do that internally to create less noise, see https://github.com/openshift/apiserver-library-go/pull/42.

Comment 14 Stefan Schimanski 2021-02-01 15:30:00 UTC
Note that https://github.com/openshift/apiserver-library-go/pull/42 is just an attempt to make this situation less likely. We identified though that this is not trivial.

The correct fix (and this must be done anyway, even with upper PR as the PR can never remove all possibilities of 409s) is in dev console: retry on 409.

Moving back to dev console.

Comment 15 Stefan Schimanski 2021-02-01 16:15:59 UTC
In https://github.com/codeready-toolchain/host-operator/blob/master/deploy/templates/nstemplatetiers/advanced/cluster.yaml you have more than 20 resources for one quota object. The quota admission logic basically serializes creations and deletions to all these resources. It's no surprise that this leads to conflicts.

As a workaround, probably it would help a lot to split up the quota object into many more independent quota objects, e.g. one per resource.

Comment 16 cvogt 2021-02-02 16:34:19 UTC
While I acknowledge that as a client I need to handle 409 conflict response, this is normally an issue with resource updates and would be caused by the user/client themselves using stale data to perform an update. In this particular case, we're seeing a 409 on creation.

I've discussed the issue with Sam Padget and we agreed to look at implementing client side retries. Dev note: The implementation shouldn't blindly retry all 409 requests because updates that result in a 409 are not retriable; the request must be modified first. For creation, we can blindly repost the same request and make N number of attempts. Sam also pointed out that we should use the same mechanism to blindly retry all requests when receiving a 429 error code.

I'd like to see the API server take responsibility for their own conflict and pursue https://github.com/openshift/apiserver-library-go/pull/42 or a variant thereof because it is surprising as a client to receive a conflict on creation that can magically go away by reissuing the exact same request.

Comment 18 spathak@redhat.com 2021-03-01 20:49:25 UTC
Created attachment 1760065 [details]
Operation can be fulfilled without error when creating OpenShift resources

Comment 19 spathak@redhat.com 2021-03-01 20:51:41 UTC
Verified on build version: 4.8.0-0.nightly-2021-03-01-031258
Browser version: Chrome 84

Comment 22 errata-xmlrpc 2021-07-27 22:36:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.