Created attachment 1850704 [details] Screenshot of ACM pod status Description of the problem: ACM pods in the open-cluster-management namespace keep crashing: multicluster-operators-standalone-subscription multicluster-operators-hub-subscription The cause of crashes logged as: helmrelease_controller.go:335] Failed to install HelmRelease container-platform/external-secrets rendered manifests contain a resource that already exists. Unable to continue with install: CustomResourceDefinition "clustersecretstores.external-secrets.io" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "container-platform": current value is "vladimir-test" Problem was resolved by deleting Helm package that was preventing application deployment Release version: 2.4.1 Operator snapshot version: OCP version: 4.7.30 (ARO) Browser Info: Chrome 96.0.4664.55 (Incognito mode) MacOS Steps to reproduce: 1. Manually install Helm package that deploys CRD's 2. Configure ACM to try to install Helm package with the same name in the different namespace that includes same set of CRD's 3. Actual results: multicluster-operators.* pod will start crashing preventing ACM from installing any other applications. Expected results: ACM parses Helm installation errors, logs them, updates the status of the application and continues processing other apps in the reconciliation loop. Additional info: Screenshot of pod status as well as logs saved during troubleshooting will be attached
Events logged when multicluster-operators-standalone-subscription pod was crashing: Kubelet may be retrying requests that are timing out in CRI-O due to system load: the requested container k8s_multicluster-operators-standalone-subscription_multicluster-operators-standalone-subscription-778bbc7d85-zq77x_open-cluster-management_701044a8-3a89-4be5-8786-e6040db21f9a_1191 is now ready and will be provided to the kubelet on next retry: error reserving ctr name k8s_multicluster-operators-standalone-subscription_multicluster-operators-standalone-subscription-778bbc7d85-zq77x_open-cluster-management_701044a8-3a89-4be5-8786-e6040db21f9a_1191 for id a7a82f22b05500ebfa2ee9c2baaac42b346f6b48c9a0b85e3af204408fe3e6a5: name is reserved
Logs from multicluster-operators-standalone-subscription pod - exiting due to timeout: I0106 18:08:05.860732 1 git_subscriber.go:218] git UnsubscribeItem container-platform/external-secrets-internal-management-eastus I0106 18:08:05.860740 1 git_subscriber.go:218] git UnsubscribeItem container-platform/external-secrets-internal-management-eastus I0106 18:08:05.860753 1 subscription_controller.go:340] Exit Reconciling subscription: container-platform/external-secrets-internal-management-eastus I0106 18:08:07.358710 1 sync_server.go:231] stop synchronizer channel I0106 18:08:14.497483 1 helmrelease_helper.go:112] HelmRelease is not owned by a MultiClusterHub resource: container-platform/external-secrets I0106 18:08:14.497536 1 helmrelease_controller.go:233] Sync Release container-platform/external-secrets I0106 18:08:14.525808 1 helmrelease_controller.go:331] Installing Release container-platform/external-secrets E0106 18:08:15.642719 1 helmrelease_controller.go:335] Failed to install HelmRelease container-platform/external-secrets rendered manifests contain a resource that already exists. Unable to continue with install: CustomResourceDefinition "clustersecretstores. external-secrets.io" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "container-platform": current value is "vladimir-test" E0106 18:08:37.361408 1 manager.go:191] failed waiting for all runnables to end within grace period of 30s: context deadline exceededManager exited non-zero
Logs from multicluster-operators-standalone-subscription pod - panic: E0106 19:29:05.837466 1 helmrelease_controller.go:335] Failed to install HelmRelease container-platform/external-secrets rendered manifests contain a resource that already exists. Unable to continue with install: CustomResourceDefinition "clustersecretstores. external-secrets.io" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "container-platform": current value is "vladimir-test" I0106 19:29:07.748397 1 sync_server.go:231] stop synchronizer channel I0106 19:29:14.660212 1 git_subscriber_item.go:213] Git commit: c7a44e5236c7778ee787fad6a580c097763ddf19 I0106 19:29:14.669078 1 helmrelease_helper.go:118] HelmRelease is owned by a MultiClusterHub resource proceed with the removal of all CRD references: open-cluster-management/management-ingress-18d79 W0106 19:29:14.685516 1 helmrepo.go:485] subsciption.spec.package is missing for subscription: open-cluster-management/hive-clusterimagesets-subscription-fast-0 I0106 19:29:14.740013 1 panic.go:1038] exit doSubscription: open-cluster-management/hive-clusterimagesets-subscription-fast-0 E0106 19:29:14.740142 1 runtime.go:78] Observed a panic: "send on closed channel" (send on closed channel) goroutine 4220 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1e99420, 0x2477260}) /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/runtime/runtime.go:74 +0x85 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x2473b20}) /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/runtime/runtime.go:48 +0x75 panic({0x1e99420, 0x2477260}) /usr/lib/golang/src/runtime/panic.go:1038 +0x215 github.com/open-cluster-management/multicloud-operators-subscription/pkg/synchronizer/kubernetes.(*KubeSynchronizer).AddTemplates(0xc000886000, {0xc035ae99f0, 0x4a}, {{0xc001b61c20, 0xc002080fc0}, {0xc000a0f110, 0xc00127eab0}}, {0xc0360fe000, 0x48, 0x49}, ...) /remote-source/multicloud-operators-subscription/app/pkg/synchronizer/kubernetes/sync_client.go:136 +0x1c5 github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).doSubscription(0xc0004a5b00) /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:345 +0x154a github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).doSubscriptionWithRetries(0xc0004a5b00, 0xc0016157b0, 0x3) /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:158 +0x45 github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).Start.func1() /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:147 +0x159 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02aa20aa90) /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:155 +0x67 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00069cac0, {0x248e8a0, 0xc0022c5dd0}, 0x1, 0xc000a2a4e0) /remote-source/multicloud-operators- subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:156 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x34630b8a000, 0x0, 0xa8, 0x43dde5) /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0x0, 0xc0009b8778, 0xc0009b8768) /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:90 +0x25 created by github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).Start /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:129 +0x2bd panic: send on closed channel [recovered] panic: send on closed channel goroutine 4220 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x2473b20}) /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/runtime/runtime.go:55 +0xd8 panic({0x1e99420, 0x2477260}) /usr/lib/golang/src/runtime/panic.go:1038 +0x215 github.com/open-cluster-management/multicloud-operators-subscription/pkg/synchronizer/kubernetes.(*KubeSynchronizer).AddTemplates(0xc000886000, {0xc035ae99f0, 0x4a}, {{0xc001b61c20, 0xc002080fc0}, {0xc000a0f110, 0xc00127eab0}}, {0xc0360fe000, 0x48, 0x49}, ...) /remote-source/multicloud-operators-subscription/app/pkg/synchronizer/kubernetes/sync_client.go:136 +0x1c5 github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).doSubscription(0xc0004a5b00) /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:345 +0x154a github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).doSubscriptionWithRetries(0xc0004a5b00, 0xc0016157b0, 0x3) /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:158 +0x45 github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).Start.func1() /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:147 +0x159 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02aa20aa90) /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:155 +0x67 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00069cac0, {0x248e8a0, 0xc0022c5dd0}, 0x1, 0xc000a2a4e0) /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:156 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x34630b8a000, 0x0, 0xa8, 0x43dde5) /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0x0, 0xc0009b8778, 0xc0009b8768) /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:90 +0x25 created by github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).Start /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:129 +0x2bd
The panic crash loop error seems to be coming from a Git subscription which caused created by github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).Start /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:129 +0x2bd The error "helmrelease_controller.go:335] Failed to install HelmRelease container-platform..." might not be the root cause. Can you post the subscription yaml for container-platform/external-secrets-internal-management-eastus? I am interested in the spec section. I am interested to see if there is a TimeWindow related spec.
Here is the content of the requested subscription at that time: apiVersion: apps.open-cluster-management.io/v1 kind: Subscription metadata: name: "external-secrets-internal-management-eastus" namespace: "container-platform" labels: tenant: container-platform acm-app: "external-secrets" spec: channel: "container-platform-ch-helm/channel" name: "external-secrets" packageFilter: version: "0.1.0" placement: placementRef: name: "internal-management-eastus" packageOverrides: - packageName: "external-secrets" packageAlias: "external-secrets" packageOverrides: - path: spec value: secretStore: cluster-vault: provider: azurekv: authSecretRef: clientId: key: clientId name: kubernetes-external-secrets clientSecret: key: clientSecret name: kubernetes-external-secrets tenantId: e17<DELETED>b6
Forget my previous comment. The real failure is at: I0106 19:29:14.740013 1 panic.go:1038] exit doSubscription: open-cluster-management/hive-clusterimagesets-subscription-fast-0 This I assume is a Git subscription which makes more sense for the panic exit stacktrace. Can you print the output for this subscription? open-cluster-management namespace and the subscription name is hive-clusterimagesets-subscription-fast-0
Created attachment 1850784 [details] hive-clusterimagesets-subscription-fast-0 subscription CR
Added hive-clusterimagesets-subscription-fast-0 subscription CR as an attachment. Please note that currently ACM is working fine and CR status shows all good. The issue I'm reporting happened on January 6th and I don't have a copy of that CR at that time.
I cannot reproduce this issue on the latest 2.4 development branch. I am suspecting this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2000274 Yerzhan what do you think? Do you have a particular build or a release channel that I can reproduce this issue consistently. If possible can you use a public helm chart that deploys CRDs so we can both have a reference of the chart? This is what I used: channel spec: type: HelmRepo pathname: https://kyverno.github.io/kyverno/ insecureSkipVerify: true subscription spec: name: kyverno placement: local: true
Hi. It's quite possible that it is the duplicate. At the time we experienced this issue ACM cluster was not in a good shape. As determined during troubleshooting - cluster had issues with resources, all secrets were managed via external-secrets and at one point all CR's were gone. As of now everything works fine. If you cannot reproduce - let's close this bug. If this ever happens again - I'll re-open this bug or refer to it in a new one.
Closing as discussed. Thanks for all your help Yerzhan. *** This bug has been marked as a duplicate of bug 2000274 ***