Any news in relation to this topic? If any additional information is needed, please let me know. Thank you.
Does the same issue occur when using Helm CLI?
KCS article about this issue written: https://access.redhat.com/solutions/5964831.
Going to try to recreate this tomorrow. Will collect existing status from previously assigned engineer and try to replicate on an OpenShift environment. It will help to know a bit more: - Can you share the chart(s) that triggered the problem? - Have you tried the same chart with helm cli, what is the result? - If it fails on both scenario to help us determine root cause can you try a dry-run from cli, and then try to apply the k8 resource manually? This will help use understand if the issue is with helm or the kubernetes/openshift side. Saludos, David
Thank you very much for checking this, David. The charts were attached on the description. As mentioned on comment #5, CLI is working fine. The problem can be exclusively reproduced by using the OpenShift web interface. If any additional information is needed, please feel free to let me know.
Apologies, but I do not see comment 5 must be my bugzilla login, working to solve that now. Many thanks for taking time to respond I will keep you posted on the progress.
I can see all private info now.
No problem, David. Thank you too for helping with this.
Hum... One thing I notice while packaging those charts to add to repo is the two charts share the same name. Can you have client correct the name collision and try again? The charts in question are swagger-service and schema-service. I noticed cause I was packaging the swagger one and got back a tar for schema. That could be the issue cause the two charts might intermittently be swapped. Can you get details on how they package these charts and published them?
From CLI I do get an error right away: dperaza@dperaza-mac charts % helm install davp-swagger ~/Downloads/charts/swagger-service/helm --set "database.pass=hellodavid" NAME: davp-swagger LAST DEPLOYED: Fri Apr 23 13:27:38 2021 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None dperaza@dperaza-mac charts % helm install davp-schema ~/Downloads/charts/schema-service/helm --set "database.pass=hellodavid" Error: rendered manifests contain a resource that already exists. Unable to continue with install: Secret "schema-service" in namespace "default" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "davp-schema": current value is "davp-swagger"
Also, ifx-resource-quota seems to be a resource quota specific to their environment. Can you try to gather this: kubectl get quota --namespace=my-openshift-project Do you see ifx-resource-quota? Can you share the output of: kubectl get quota ifx-resource-quota -o yaml --namespace=my-openshift-project
Find attached the quota definition requested on comment #15. Regarding comment #13, do you still need that information? In that case, can you let me know exactly which details would you need, please? Thank you.
Thanks on the quota, yeah on comment 13 try to get the reason why the two charts (swagger-service and schema-service) have the same name (see the Chart.yaml) and how they build and publish them. Not sure it is related to root cause but it is odd and will like to cover all variables.
I am the developer of these services. I don't know if I packaged it wrong for you, but we don't have the same names for swagger-service and schema-service in our charts. So this is not the problem. Let me know if you need further information. Thanks.
I have not been able to recreate these issue in my single worker test env. I have been trying multiple scenarios with quotas on services, pods, and nodeports. The behavior is as expected. I'm going to try again tomorrow on multi-worker sandbox we have. I'm beginning to think that this is more of a kubernetes raise condition on the updates of the quota objects as suggested in this kubernetes issue reference on the top by Lucas on the first comment: https://github.com/kubernetes/kubernetes/issues/67761. Helm on openshift just happens to do the install operations a lot faster with respect to helm on the client side cause the network is not a factor. This is still a theory based on the information I have so far. I will try to prove this tomorrow. It will help to tell me exactly which of the charts attached experienced the problem. It will also help if you post a comment on kubernetes issue to show this is still a relevant issue and the community takes a closer look. If I cannot recreate tomorrow I will recommend that we transfer this to an openshift Engineer that has experience with Kubernetes project and how to get this community moving.
Thank you very much for your detailed explanation, David. We will wait for your feedback.
I was unable to recreate issue. I tried in OpenShift 4.6, 4,7 and 4.8. In all cases I have a 3 workers and 3 masters setup. In my quota I added pods limits too and try to play with replicas during brand new install and also as part of the upgrade path. Quota object gets updated correctly from different workers and there is no raise condition. Can you provide me with a more specific test cases to try to recreate. Add the specific chart that you found the issue on and also more details on your environment (number of workers, and masters). Also provide me with the exact values you passed for the chart, not interested on secrets and such just resource configuration like number or replicas and such.
I cannot run this chart in my environment cause it looks like it has some specific requirements from their environment. However, it gives me some good ideas of how to design my tests to try to recreate. Thanks
Thank you for your feedback, David. In case you need anything else, feel free to reach me.
As the bug status is ON_QA, I assume that the fix has been already implemented. Can this be confirmed, please? Thank you.
No, I'm sorry I have not been able to recreate the problem yet, do we need to move back? Please notice also that this is likely not a helm issue but a kubernetes issue like you pointed out on your first post. So we should try to bring in some kubernetes SMEs to try to help. What I'm trying form my side now is to recreate so I can see if helm can provide a workaround and propose upstream.
No problem, David. I just wanted to clarify the status of the bug. We will wait for your feedback in that case. Thank you for the update.
I reproduced this in AWS with OCP 4.6.17 and OVNKubernetes (customer setup). It does not happen a lot, but it does, and the following script helped me catch it. In https://github.com/kubernetes/kubernetes/issues/67761, they talked about how this would fail when installing gitlab, so I used that helm chart. ~~~ cat << 'FOE' > test_gitlab.sh #!/bin/bash function create_project() { oc new-project gitlab || oc project gitlab cat <<'EOF' | oc apply -f - apiVersion: v1 kind: ResourceQuota metadata: name: quota spec: hard: configmaps: "62" limits.cpu: "170" limits.memory: 260Gi openshift.io/imagestreams: "0" persistentvolumeclaims: "50" pods: "100" replicationcontrollers: "50" requests.cpu: "71" requests.memory: 205Gi requests.storage: 350Gi limits.cpu: "71" limits.memory: 205Gi requests.storage: 350Gi resourcequotas: "1" secrets: "512" services: "70" services.nodeports: "5" EOF cat <<'EOF' | oc apply -f - apiVersion: v1 kind: LimitRange metadata: name: cpu-limit-range spec: limits: - default: memory: 256Mi cpu: 125m defaultRequest: cpu: 125m memory: 128Mi type: Container EOF oc adm policy add-scc-to-user privileged -z default oc adm policy add-scc-to-user privileged -z gitlab-shared-secrets } function delete_project() { oc project default oc delete project gitlab if [ "$?" != "1" ]; then echo "Sleeping for 60 seconds to give the project time to be deleted" sleep 60 fi } function install_gitlab() { helm upgrade --debug --install gitlab gitlab/gitlab --timeout 600s --set global.hosts.domain=example.com --set global.hosts.externalIP=10.10.10.10 --set certmanager-issuer.email=me if [ "$?" == "0" ]; then GREEN='\033[0;32m' NC='\033[0m' # No Color printf "${GREEN}SUCCEED${NC}\n" else RED='\033[0;31m' NC='\033[0m' # No Color printf "${RED}FAIL${NC}\n" exit 1 fi } function uninstall_gitlab() { helm uninstall gitlab } delete_project while true; do echo "=== $(date) ===" echo "=== Create project ===" create_project echo "=== Install gitlab ===" install_gitlab sleep 5 echo "=== Uninstall gitlab ===" uninstall_gitlab echo "=== Delete project ===" delete_project done FOE bash -x test_gitlab.sh 2>&1 | tee -a test_gitlab.log ~~~ I created a debug build of helm: ~~~ [akaris@linux helm]$ git diff |cat diff --git a/pkg/kube/client.go b/pkg/kube/client.go index 34079e7a..90388235 100644 --- a/pkg/kube/client.go +++ b/pkg/kube/client.go @@ -120,7 +120,40 @@ func (c *Client) IsReachable() error { // Create creates Kubernetes resources specified in the resource list. func (c *Client) Create(resources ResourceList) (*Result, error) { c.Log("creating %d resource(s)", len(resources)) - if err := perform(resources, createResource); err != nil { + + createResourceLogger := func(info *resource.Info) error { + c.Log("Creating resource: {" + + "Time UnixNano: " + fmt.Sprintf("%v", time.Now().UnixNano()) + ", " + + "Name: " + info.Name + ", " + + "Namespace: " + info.Namespace + ", " + + "Source: " + info.Source + ", " + + "ResourceVersion: " + info.ResourceVersion + + "}") + j, err := json.Marshal(info.Object) + if err == nil { + c.Log("==> Object: " + string(j)) + } else { + c.Log("Could not marshal Name " + info.Name + " Namespace " + info.Namespace) + } + returnErr := createResource(info) + if returnErr != nil { + c.Log( + "FAILED Time UnixNano: " + fmt.Sprintf("%v", time.Now().UnixNano()) + + " Name: " + info.Name + + ", Namespace: " + info.Namespace + + ", ERROR: " + returnErr.Error(), + ) + errJson, errErr := json.Marshal(returnErr) + if errErr == nil { + c.Log("==> ERROR JSON: " + string(errJson)) + } else { + c.Log("Could not marshal ERROR") + } + } + return returnErr + } + + if err := perform(resources, createResourceLogger); err != nil { return nil, err } return &Result{Created: resources}, nil ~~~ Please find the output for a failed run attached. In helm, we should be able to work around this easily by retrying a `perform` if it return a 409. I'll look at the code and see if I can commit something upstream. Notwithstanding, the actual kubernetes issues here https://github.com/kubernetes/kubernetes/issues/67761 should be fixed instead of asking every client to fix this client side. - Andreas
Created attachment 1784158 [details] test gitlab log fail with debug
I have tried to recreate once more today without any luck. I could not use the attached chart but I created a chart with dependency services following the same model. I tested in 4.7 and 4.8 with the following quota: dperaza@dperaza-mac ssm % kubectl get ResourceQuota NAME AGE REQUEST LIMIT ifx-resource-quota 8m47s count/routes.route.openshift.io: 0/20, pods: 7/100, services: 4/20, services.nodeports: 0/10 I will try once more to recreate with 4.6. I also see great comments from @akaris. Will Analyze and try to use his method to recreate.
Here is another log from my own tests with the gitlab helm chart: $ cat test_gitlab.log | grep 'ISSUE HIT' -B2 status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- client.go:142: [debug] FAILED Time UnixNano: 1621278426447314774 Name: gitlab-webservice-tests, Namespace: gitlab, ERROR: Operation cannot be fulfilled on resourcequotas "quota": the object has been modified; please apply your changes to the latest version and try again kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- client.go:142: [debug] FAILED Time UnixNano: 1621278426447737148 Name: gitlab-minio-config-cm, Namespace: gitlab, ERROR: Operation cannot be fulfilled on resourcequotas "quota": the object has been modified; please apply your changes to the latest version and try again kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- kind: ConfigMap client.go:150: [debug] ==> ERROR JSON: {"ErrStatus":{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on resourcequotas \"quota\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"quota","kind":"resourcequotas"},"code":409}} ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: PersistentVolumeClaim ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: PersistentVolumeClaim ISSUE HIT! CONTINUING: 0 -- status.Details.Kind: resourcequotas kind: ConfigMap ISSUE HIT! CONTINUING: 0
Created attachment 1784365 [details] My tests - log
I was able to recreate the problem as well thanks to instructions by @akaris: client.go:519: [debug] Add/Modify event for gitlab-shared-secrets-1-ozr: MODIFIED client.go:282: [debug] Starting delete for "gitlab-shared-secrets" ServiceAccount client.go:282: [debug] Starting delete for "gitlab-shared-secrets" Role client.go:282: [debug] Starting delete for "gitlab-shared-secrets" RoleBinding client.go:282: [debug] Starting delete for "gitlab-shared-secrets" ConfigMap client.go:282: [debug] Starting delete for "gitlab-shared-secrets-1-ozr" Job client.go:122: [debug] creating 115 resource(s) Error: Operation cannot be fulfilled on resourcequotas "quota": the object has been modified; please apply your changes to the latest version and try again helm.go:81: [debug] Operation cannot be fulfilled on resourcequotas "quota": the object has been modified; please apply your changes to the latest version and try again @akaris Also has a proposed workaround up for review upstream https://github.com/helm/helm/pull/9713 which is a good attempt. I will support him as much as possible with additional testing in my environment. It is really up to the community to accept the changes now since this is a workaround and not the fix. They could suggest to fix root cause on kubernetes project. Notice that this proves the problem is definitely not an OpenShift Console since we can recreate from Helm CLI. In fact the only way for me to recreate is to use the Helm CLI to be able to run multiple install/uninstall in a script to increase my chances to recreate.
Meanwhile, I filed the following against our kube api-server https://bugzilla.redhat.com/show_bug.cgi?id=1962083 (no idea if that's the correct component but let's hope that it gets the ball rolling on that side of things)
I was able to reproduce in OCP 4.7 client.go:558: [debug] gitlab-shared-secrets-1-t8n: Jobs active: 1, jobs failed: 0, jobs succeeded: 0 client.go:519: [debug] Add/Modify event for gitlab-shared-secrets-1-t8n: MODIFIED client.go:282: [debug] Starting delete for "gitlab-shared-secrets" ServiceAccount client.go:282: [debug] Starting delete for "gitlab-shared-secrets" Role client.go:282: [debug] Starting delete for "gitlab-shared-secrets" RoleBinding client.go:282: [debug] Starting delete for "gitlab-shared-secrets" ConfigMap client.go:282: [debug] Starting delete for "gitlab-shared-secrets-1-t8n" Job client.go:122: [debug] creating 115 resource(s) Error: Operation cannot be fulfilled on resourcequotas "quota": the object has been modified; please apply your changes to the latest version and try again helm.go:81: [debug] Operation cannot be fulfilled on resourcequotas "quota": the object has been modified; please apply your changes to the latest version and try again bash-5.1$ oc version Client Version: 4.7.5 Server Version: 4.7.11 Kubernetes Version: v1.20.0+75370d3
> Notice that this proves the problem is definitely not an OpenShift Console since we can recreate from Helm CLI. In fact the only way for me to recreate is to use the Helm CLI to be able to run multiple install/uninstall in a script to increase my chances to recreate. Strange, for us it's the exact opposite. It never occurs with the CLI but often with the UI.
@jweiss1 it probably has to do with the configuration of our clusters being different. Regardless the fix from @akaris seems to be getting traction upstream. I do think it will fix everyone. I have tested with his fix and cannot reproduce the problem any longer.
So ... the helm upstream team doesn't even bother to comment on or review the PR. The same as any other PR which I tried to submit to the project, by the way. I'm removing the needinfo flag for myself, as there isn't really anything that I can do here anymore :-(
@akaris thanks for your attempt I will continue to track PR https://github.com/helm/helm/pull/9713 and bring attention to this in community call. I can also try to fix this issue on the OpenShift Console side. I will not fix helm cli user but will fix OpenShift GUI users.
@jochen.weis Hi Jochen, we are currently trying to resolve the resourcequota conflict issue from OpenShift console backend side. Could you tell us how long it takes to install your chart on a sunny day scenario with the CLI? This matters to us because we can enable retries from OpenShift console backend, but if the total installation time exceeds certain amount it will cause a request timeout in the browser from the frontend.
@llopezmo See question from @abai.
Changing NEEDINFO from llopezmo to me, as I am now owning this Support Case.
Hi @abai, it takes about 25s for the helm install command until it finishes.
Giving that a retry at the console level has the chance to increase the installation time to twice on worst case scenario, we should first investigate how to design an async flow for helm installs. We have already brought it up with OpenShift console team and will work on this as a priority in our next sprints. In the mean time our best hope for this issue to be resolved is for the Helm community to merge https://github.com/helm/helm/pull/9713 which retries for each resource that sees the conflict. This will not increase your install time by a considerable amount.
Do we have an update for this BZ? It seems that the upstream PR is still open and https://github.com/openshift/console/pull/9238 was closed.
The BZ says that the assignee for this BZ cannot see private comments. So I'm making the last one public. I'm also removing the needinfo on myself. At this point, I have no idea how the helm project works and how to make a pull request land upstream. It has been 3 months with no debate around the pull request. I'm o.k. to give up my pull request in favor of some other solution, I can also rebase it against master again.
With that said, I can rebase it if someone upstream actually looks at it and commits to merge it.
Allen, any news?
@abai Allen, as the upstream issue is still not moving forward, what do you think we could do to resolve this issue? Any updates?
Sorry for the late reply. After discussing with the team, we agreed that since this bug depends on a helm/helm fix (https://github.com/helm/helm/pull/9713) and/or a kubernetes fix (https://github.com/kubernetes/kubernetes/issues/67761) which we don't think will be merged in a timely manner unless larger amount of Helm users are impacted. However, we do see an opportunity to implement new features to mitigate this issue, adding asynchronous wait ability to console while adding retrying resourcequota checks. In this case it will not be the scope of a bug fix, but a new feature and might have larger impact. Therefore, please contact David Peraza (Helm Team Lead) and Stevan LeMeur (PM of Dev Tools) for feature request enquiries.
ack that bz is closed.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days