Bug 2113831

Summary: Hive Project is stuck with terminating state after attempt to upgrade to ACM 2.5
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: Mihir Lele <mlele>
Component: InstallerAssignee: Ray Harris <raharris>
Status: CLOSED NOTABUG QA Contact: txue
Severity: urgent Docs Contact:
Priority: urgent    
Version: rhacm-2.5.zCC: daliu, dhuynh, efried, huichen, jagray, jfindysz
Target Milestone: ---Flags: txue: qe_test_coverage+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-06 12:55:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mihir Lele 2022-08-02 06:01:58 UTC
Description of the problem:

Hive Project is stuck with terminating state after attempt to upgrade to ACM 2.5

Also, MCH is stuck in installing state.

Additional info:

Hive ns:

  - lastTransitionTime: "2022-07-26T01:35:33Z"
    message: 'Discovery failed for some groups, 1 failing: unable to retrieve the
      complete list of server APIs: admission.hive.openshift.io/v1: the server is
      currently unable to handle the request'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure


less 0020-acm-must-gather-2.tar.gz/acm-must-gather-2/registry-redhat-io-rhacm2-acm-must-gather-rhel8-sha256-385e5fb24b0f50ba4ada884ea8c4d5013393769261d221ab37f79e9e96e461e1/namespaces/rhacm/pods/multicluster-operators-standalone-subscription-7bc8d49776-wr28f/multicluster-operators-standalone-subscription/multicluster-operators-standalone-subscription/logs/current.log

2022-07-30T12:45:31.540674414Z E0730 12:45:31.540597       1 gitrepo.go:303] Get "https://github.com/stolostron/acm-hive-openshift-releases.git/info/refs?service=git-upload-
pack": EOF Failed to git clone with the primary channel: Get "https://github.com/stolostron/acm-hive-openshift-releases.git/info/refs?service=git-upload-pack": EOF
2022-07-30T12:45:31.540674414Z E0730 12:45:31.540655       1 git_subscriber_item.go:265] Failed to clone git: https://github.com/stolostron/acm-hive-openshift-releases.git err: Get "https://github.com/stolostron/acm-hive-openshift-releases.git/info/refs?service=git-upload-pack": EOFUnable to clone the git repo https://github.com/stolostron/acm-hive-openshift-releases.git
2022-07-30T12:45:31.540748415Z I0730 12:45:31.540674       1 git_subscriber_item.go:268] exit doSubscription: rhacm/hive-clusterimagesets-subscription-fast-0
2022-07-30T12:45:31.540748415Z E0730 12:45:31.540680       1 git_subscriber_item.go:160] Failed to clone git: https://github.com/stolostron/acm-hive-openshift-releases.git err: Get "https://github.com/stolostron/acm-hive-openshift-releases.git/info/refs?service=git-upload-pack": EOFSubscription error.


I am not sure about the tasks that are done at the background for upgrading ACM 2.4 to 2.5. But I can see that hive was being managed by mch in 2.4, and its managed by mce in 2.5, so my guess is that hive needs to be redeployed? Also, I didnt see any evidence to suggest that mce deployment was triggered.

This looks like a connected setup from the Must gather. So I am assuming that we dont need to add the mce annotation on mch manually.

Comment 4 Mihir Lele 2022-08-04 05:19:55 UTC
We have been able to get past the issue with this workaround with the help of the hive engineering:

Run: kubectl get apiservice


Look for ones listed as AVAILABLE is False and delete them


Run: kubectl delete apiservce <service-name>

Comment 5 daliu 2022-08-15 02:54:19 UTC
@efried 
@jagray 

I am not sure if hive or acm installer should fix it. 
So Could you help to take a look?

Comment 7 Eric Fried 2022-08-15 17:13:46 UTC
I believe the fix here is going to be on the ACM side. If deleting an existing deployment's namespace is part of their upgrade process, they'll need to add a step to delete the APIService (and any other non-namespaced resources). On the hive side, we need to look into better documenting the uninstallation process. (We may also try to look into having the hive-operator clean up after itself a bit better when its deployment is deleted.)

Comment 8 Eric Fried 2022-08-15 19:14:26 UTC
(In reply to Eric Fried from comment #7)
> On the hive side, we need to look into better documenting the
> uninstallation process. (We may also try to look into having the
> hive-operator clean up after itself a bit better when its deployment is
> deleted.)

https://issues.redhat.com/browse/HIVE-1998

Comment 9 bot-tracker-sync 2022-08-22 14:51:21 UTC
G2Bsync 1222456466 comment 
 ray-harris Mon, 22 Aug 2022 14:40:04 UTC 
 G2Bsync

ACM was unable to reproduce this issue. We're not going to add code to check for this as this is the first and only time it's been reported. We'll make our SRE aware of the potential issue in case they run into it again.

This issue can be closed.