Bug 2040500 - ACM multicluster-operator-standalone-subscription and other pods kept crashing because of the error during helm package deployment
Summary: ACM multicluster-operator-standalone-subscription and other pods kept crashin...
Keywords:
Status: CLOSED DUPLICATE of bug 2000274
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: App Lifecycle
Version: rhacm-2.4.z
Hardware: x86_64
OS: Other
unspecified
high
Target Milestone: ---
: ---
Assignee: Mike Ng
QA Contact: Rafat Islam
bswope@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-13 20:51 UTC by Yerzhan Beisembayev
Modified: 2022-01-28 17:33 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-28 17:33:14 UTC
Target Upstream Version:
Embargoed:
bot-tracker-sync: needinfo+


Attachments (Terms of Use)
Screenshot of ACM pod status (314.77 KB, image/png)
2022-01-13 20:51 UTC, Yerzhan Beisembayev
no flags Details
hive-clusterimagesets-subscription-fast-0 subscription CR (28.40 KB, text/plain)
2022-01-14 13:24 UTC, Yerzhan Beisembayev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github stolostron backlog issues 19114 0 None None None 2022-01-13 23:11:32 UTC

Description Yerzhan Beisembayev 2022-01-13 20:51:14 UTC
Created attachment 1850704 [details]
Screenshot of ACM pod status

Description of the problem:
ACM pods in the open-cluster-management namespace keep crashing:
multicluster-operators-standalone-subscription
multicluster-operators-hub-subscription

The cause of crashes logged as:
helmrelease_controller.go:335] Failed to install HelmRelease container-platform/external-secrets rendered manifests contain a resource that already exists. Unable to continue with install: CustomResourceDefinition "clustersecretstores.external-secrets.io" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "container-platform": current value is "vladimir-test"

Problem was resolved by deleting Helm package that was preventing application deployment

Release version: 2.4.1

Operator snapshot version:

OCP version: 4.7.30 (ARO)

Browser Info: Chrome 96.0.4664.55 (Incognito mode) MacOS

Steps to reproduce:
1. Manually install Helm package that deploys CRD's
2. Configure ACM to try to install Helm package with the same name in the different namespace that includes same set of CRD's
3.

Actual results:
multicluster-operators.* pod will start crashing preventing ACM from installing any other applications.

Expected results:
ACM parses Helm installation errors, logs them, updates the status of the application and continues processing other apps in the reconciliation loop.

Additional info:
Screenshot of pod status as well as logs saved during troubleshooting will be attached

Comment 1 Yerzhan Beisembayev 2022-01-13 20:52:10 UTC
Events logged when multicluster-operators-standalone-subscription pod was crashing:
Kubelet may be retrying requests that are timing out in CRI-O due to
            system load: the requested container
            k8s_multicluster-operators-standalone-subscription_multicluster-operators-standalone-subscription-778bbc7d85-zq77x_open-cluster-management_701044a8-3a89-4be5-8786-e6040db21f9a_1191
            is now ready and will be provided to the kubelet on next retry:
            error reserving ctr name
            k8s_multicluster-operators-standalone-subscription_multicluster-operators-standalone-subscription-778bbc7d85-zq77x_open-cluster-management_701044a8-3a89-4be5-8786-e6040db21f9a_1191
            for id
            a7a82f22b05500ebfa2ee9c2baaac42b346f6b48c9a0b85e3af204408fe3e6a5:
            name is reserved

Comment 2 Yerzhan Beisembayev 2022-01-13 20:53:00 UTC
Logs from multicluster-operators-standalone-subscription pod - exiting due to timeout:

I0106 18:08:05.860732       1 git_subscriber.go:218] git UnsubscribeItem container-platform/external-secrets-internal-management-eastus
I0106 18:08:05.860740       1 git_subscriber.go:218] git UnsubscribeItem container-platform/external-secrets-internal-management-eastus
I0106 18:08:05.860753       1 subscription_controller.go:340] Exit Reconciling subscription: container-platform/external-secrets-internal-management-eastus
I0106 18:08:07.358710       1 sync_server.go:231] stop synchronizer channel
I0106 18:08:14.497483       1 helmrelease_helper.go:112] HelmRelease is not owned by a MultiClusterHub resource: container-platform/external-secrets
I0106 18:08:14.497536       1 helmrelease_controller.go:233] Sync Release container-platform/external-secrets
I0106 18:08:14.525808       1 helmrelease_controller.go:331] Installing Release container-platform/external-secrets
E0106 18:08:15.642719       1 helmrelease_controller.go:335] Failed to install HelmRelease container-platform/external-secrets rendered manifests contain a resource that already exists. Unable to continue with install: CustomResourceDefinition "clustersecretstores.       external-secrets.io" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "container-platform": current value is "vladimir-test"
E0106 18:08:37.361408       1 manager.go:191] failed waiting for all runnables to end within grace period of 30s: context deadline exceededManager exited non-zero

Comment 3 Yerzhan Beisembayev 2022-01-13 20:53:58 UTC
Logs from multicluster-operators-standalone-subscription pod - panic:

E0106 19:29:05.837466       1 helmrelease_controller.go:335] Failed to install HelmRelease container-platform/external-secrets rendered manifests contain a resource that already exists. Unable to continue with install: CustomResourceDefinition "clustersecretstores.       external-secrets.io" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "container-platform": current value is "vladimir-test"
I0106 19:29:07.748397       1 sync_server.go:231] stop synchronizer channel
I0106 19:29:14.660212       1 git_subscriber_item.go:213] Git commit: c7a44e5236c7778ee787fad6a580c097763ddf19
I0106 19:29:14.669078       1 helmrelease_helper.go:118] HelmRelease is owned by a MultiClusterHub resource proceed with the removal of all CRD references: open-cluster-management/management-ingress-18d79
W0106 19:29:14.685516       1 helmrepo.go:485] subsciption.spec.package is missing for subscription: open-cluster-management/hive-clusterimagesets-subscription-fast-0
I0106 19:29:14.740013       1 panic.go:1038] exit doSubscription: open-cluster-management/hive-clusterimagesets-subscription-fast-0
E0106 19:29:14.740142       1 runtime.go:78] Observed a panic: "send on closed channel" (send on closed channel)
goroutine 4220 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1e99420, 0x2477260})
    /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/runtime/runtime.go:74 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x2473b20})
    /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1e99420, 0x2477260})
    /usr/lib/golang/src/runtime/panic.go:1038 +0x215
github.com/open-cluster-management/multicloud-operators-subscription/pkg/synchronizer/kubernetes.(*KubeSynchronizer).AddTemplates(0xc000886000, {0xc035ae99f0, 0x4a}, {{0xc001b61c20, 0xc002080fc0}, {0xc000a0f110, 0xc00127eab0}}, {0xc0360fe000, 0x48, 0x49}, ...)
    /remote-source/multicloud-operators-subscription/app/pkg/synchronizer/kubernetes/sync_client.go:136 +0x1c5
github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).doSubscription(0xc0004a5b00)
    /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:345 +0x154a
github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).doSubscriptionWithRetries(0xc0004a5b00, 0xc0016157b0, 0x3)
    /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:158 +0x45
github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).Start.func1()
    /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:147 +0x159
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02aa20aa90)
    /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00069cac0, {0x248e8a0, 0xc0022c5dd0}, 0x1, 0xc000a2a4e0)
    /remote-source/multicloud-operators-
subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x34630b8a000, 0x0, 0xa8, 0x43dde5)
    /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0, 0xc0009b8778, 0xc0009b8768)
    /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:90 +0x25
created by github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).Start
    /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:129 +0x2bd
panic: send on closed channel [recovered]
    panic: send on closed channel

goroutine 4220 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x2473b20})
    /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/runtime/runtime.go:55 +0xd8
panic({0x1e99420, 0x2477260})
    /usr/lib/golang/src/runtime/panic.go:1038 +0x215
github.com/open-cluster-management/multicloud-operators-subscription/pkg/synchronizer/kubernetes.(*KubeSynchronizer).AddTemplates(0xc000886000, {0xc035ae99f0, 0x4a}, {{0xc001b61c20, 0xc002080fc0}, {0xc000a0f110, 0xc00127eab0}}, {0xc0360fe000, 0x48, 0x49}, ...)
    /remote-source/multicloud-operators-subscription/app/pkg/synchronizer/kubernetes/sync_client.go:136 +0x1c5
github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).doSubscription(0xc0004a5b00)
    /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:345 +0x154a
github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).doSubscriptionWithRetries(0xc0004a5b00, 0xc0016157b0, 0x3)
    /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:158 +0x45
github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).Start.func1()
    /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:147 +0x159
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x7f02aa20aa90)
    /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:155 +0x67
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00069cac0, {0x248e8a0, 0xc0022c5dd0}, 0x1, 0xc000a2a4e0)
    /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0, 0x34630b8a000, 0x0, 0xa8, 0x43dde5)
    /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0, 0xc0009b8778, 0xc0009b8768)
    /remote-source/multicloud-operators-subscription/deps/gomod/pkg/mod/k8s.io/apimachinery.3/pkg/util/wait/wait.go:90 +0x25
created by github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).Start
    /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:129 +0x2bd

Comment 4 Mike Ng 2022-01-13 22:06:15 UTC
The panic crash loop error seems to be coming from a Git subscription which caused

created by github.com/open-cluster-management/multicloud-operators-subscription/pkg/subscriber/git.(*SubscriberItem).Start
    /remote-source/multicloud-operators-subscription/app/pkg/subscriber/git/git_subscriber_item.go:129 +0x2bd

The error "helmrelease_controller.go:335] Failed to install HelmRelease container-platform..." might not be the root cause.

Can you post the subscription yaml for container-platform/external-secrets-internal-management-eastus? I am interested in the spec section. I am interested to see if there is a TimeWindow related spec.

Comment 5 Yerzhan Beisembayev 2022-01-13 22:16:08 UTC
Here is the content of the requested subscription at that time:

apiVersion: apps.open-cluster-management.io/v1
kind: Subscription
metadata:
  name: "external-secrets-internal-management-eastus"
  namespace: "container-platform"
  labels:
    tenant: container-platform
    acm-app: "external-secrets"
spec:
  channel: "container-platform-ch-helm/channel"
  name: "external-secrets"
  packageFilter:
    version: "0.1.0"
  placement:
    placementRef:
      name: "internal-management-eastus"
  packageOverrides:
    - packageName: "external-secrets"
      packageAlias: "external-secrets"
      packageOverrides:
        - path: spec
          value:
            secretStore:
              cluster-vault:
                provider:
                  azurekv:
                    authSecretRef:
                      clientId:
                        key: clientId
                        name: kubernetes-external-secrets
                      clientSecret:
                        key: clientSecret
                        name: kubernetes-external-secrets
                    tenantId: e17<DELETED>b6

Comment 7 Mike Ng 2022-01-14 02:18:44 UTC
Forget my previous comment. The real failure is at: 

I0106 19:29:14.740013       1 panic.go:1038] exit doSubscription: open-cluster-management/hive-clusterimagesets-subscription-fast-0

This I assume is a Git subscription which makes more sense for the panic exit stacktrace.

Can you print the output for this subscription? open-cluster-management namespace and the subscription name is hive-clusterimagesets-subscription-fast-0

Comment 8 Yerzhan Beisembayev 2022-01-14 13:24:37 UTC
Created attachment 1850784 [details]
hive-clusterimagesets-subscription-fast-0 subscription CR

Comment 9 Yerzhan Beisembayev 2022-01-14 13:26:30 UTC
Added hive-clusterimagesets-subscription-fast-0 subscription CR as an attachment.
Please note that currently ACM is working fine and CR status shows all good.
The issue I'm reporting happened on January 6th and I don't have a copy of that CR at that time.

Comment 10 Mike Ng 2022-01-27 21:30:31 UTC
I cannot reproduce this issue on the latest 2.4 development branch. I am suspecting this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2000274

Yerzhan what do you think? Do you have a particular build or a release channel that I can reproduce this issue consistently. 

If possible can you use a public helm chart that deploys CRDs so we can both have a reference of the chart?

This is what I used:

channel spec:
  type: HelmRepo
  pathname: https://kyverno.github.io/kyverno/
  insecureSkipVerify: true

subscription spec:
  name: kyverno
  placement:
    local: true

Comment 11 Yerzhan Beisembayev 2022-01-28 13:10:19 UTC
Hi.

It's quite possible that it is the duplicate.
At the time we experienced this issue ACM cluster was not in a good shape. 
As determined during troubleshooting - cluster had issues with resources, all secrets were managed via external-secrets and at one point all CR's were gone.

As of now everything works fine.
If you cannot reproduce - let's close this bug.
If this ever happens again - I'll re-open this bug or refer to it in a new one.

Comment 12 Mike Ng 2022-01-28 17:33:14 UTC
Closing as discussed. Thanks for all your help Yerzhan.

*** This bug has been marked as a duplicate of bug 2000274 ***


Note You need to log in before you can comment on or make changes to this bug.