Bug 1862846 - Dependency resolution failures should be accessible to the console to avoid hanging on "Installing"
Summary: Dependency resolution failures should be accessible to the console to avoid h...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Zac Herman
QA Contact: Yadan Pei
URL:
Whiteboard:
: 1882791 1884534 (view as bug list)
Depends On:
Blocks: 1873348
TreeView+ depends on / blocked
 
Reported: 2020-08-03 02:55 UTC by Jian Zhang
Modified: 2021-03-03 02:42 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-02 00:53:58 UTC
Target Upstream Version:
Embargoed:
spadgett: needinfo-


Attachments (Terms of Use)

Description Jian Zhang 2020-08-03 02:55:36 UTC
Description of problem:
When an operator installed failed, the UI display the "Installing" all the time. It should display the error info. See the screenshot: https://user-images.githubusercontent.com/15416633/89140958-b606a480-d575-11ea-8f30-21cfdcd4328d.png

Not sure if this is the correct component, feel free to move on it to the OLM component.

Version-Release number of selected component (if applicable):
Cluster version is 4.6.0-0.nightly-2020-08-02-091622
mac:~ jianzhang$ oc exec catalog-operator-869bbf9969-hsvw6 -- olm --version
OLM version: 0.16.0
git commit: 0984504f2ec07f99014bdc6211fe87525e0b0087

How reproducible:
always

Steps to Reproduce:
1. Install OCP 4.6.
2. Subscribe the OCS operator(4.4.1 version) on the WebConsole.


Actual results:
Display the "Installing" all the time. But, in fact, it failed to install on the backed end.

E0803 02:34:50.343893       1 queueinformer_operator.go:290] sync "openshift-storage" failed: [found duplicate entries for ocs-operator.v4.4.1 in {community-operators openshift-marketplace}, found duplicate entries for ocs-operator.v4.2.3 in {community-operators openshift-marketplace}]
I0803 02:34:50.344014       1 event.go:278] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"openshift-storage", UID:"90e24d18-c76a-43af-b27d-70f9218fe171", APIVersion:"v1", ResourceVersion:"75129", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' [found duplicate entries for ocs-operator.v4.4.1 in {community-operators openshift-marketplace}, found duplicate entries for ocs-operator.v4.2.3 in {community-operators openshift-marketplace}]

Expected results:
It should display the error info, or in the failed status.

Additional info:

Comment 2 Zac Herman 2020-08-20 19:23:42 UTC
jiazha - Can you give me some more details on how to reproduce this error?  I am seeing that OCS specifically can get in a bad state when you install, uninstall, and then  reinstall.  Can you get this failure with any other operator?
Also, if you delete the namespace openshift-storage after uninstalling, do you get an error next time around?

I am reliably able to install OCS everytime on a clean system so far, although I can messed up things if I try.  I am trying to decide if this is something that can be fixed on the front end "Installing" page or if we need to get OLM/OCS involved.

Comment 3 Zac Herman 2020-08-20 20:39:35 UTC
I spoke to the OLM team and it seems there is no status.state value set when things go terribly wrong.  Sadly, my code triggers off that status.state to know when we are installing, when we are finished, and when there is a failure.  We will need to change how the installing page activates; so instead of activating immediately, we will need to wait most likely for the install plan to be created. As a side note, the OLM team is aware of this problem and is working on a solution but timing is TBD.

Comment 4 Jakub Stejskal 2020-08-21 07:11:45 UTC
We were able to "reproduce" it with AMQ Streams operator. It's already migrated to new bundle image format and when I tried it last time (fc-1 build), there were only AMQ Streams operator in OperatorHub. The installation didn't even started properly - install plan wasn't created. We discuss that with Shawn Hurley and according to him it's the same problem.

Comment 5 Zac Herman 2020-08-21 17:29:32 UTC
Since the underlying problem here is that we are not getting a meaningful status from the Subscription, the actual bug lies on the OLM side of the house.  I am transferring this bug to that team. I will create a new bug to improve the usability of the console.

Comment 7 Jian Zhang 2020-08-28 06:12:50 UTC
Hi Zac,

> Can you give me some more details on how to reproduce this error?  I am seeing that OCS specifically can get in a bad state when you install, uninstall, and then  reinstall.  Can you get this failure with any other operator?
Also, if you delete the namespace openshift-storage after uninstalling, do you get an error next time around?

Apologize for the late reply. Yes, the OCS are installed well for OCP4.6 becase OCS team fix that.
Now, you can subscribe the AMQ Stream operator to reeproduce it.

Comment 8 Jian Zhang 2020-08-28 06:22:42 UTC
Hi Mohan,

I'm not sure why you add the `TestBlocker` label. I remove it now. Please feel free to add it back. Thanks!

Comment 13 Jian Zhang 2020-09-11 09:58:48 UTC
From the customer perspective, The `Installing` staus is very confusing. I has been asked about this problem for many times. It's better to fix it in 4.6, besides, it's blocking other team's test. Higher the priority.

Comment 15 Jakub Stejskal 2020-09-14 12:30:51 UTC
The issue which we hit in AMQ Streams is a little bit different - iib image for 4.6 contains some wrong binaries according to the debug which we did with Lance Galletti which cause that any operator from it cannot be installed on 4.6.

Comment 16 Ben Luddy 2020-10-03 01:58:36 UTC
*** Bug 1882791 has been marked as a duplicate of this bug. ***

Comment 17 Ben Luddy 2020-10-09 22:22:47 UTC
*** Bug 1884534 has been marked as a duplicate of this bug. ***

Comment 20 Kevin Rizza 2020-10-23 17:51:19 UTC
Hi Sam,

With the release of the Operator API in 4.6, I think this bug is resolvable through some design changes in the way that the console interacts here. Most likely, this error information is available as part of the status object of the InstallPlan that was attempting to execute the installation, which should now be aggregated there. I'm reassigning to the console here. If there's more in depth work that is non trivial, we may want to close this and instead file an RFE. But to start, I think some trivial improvement can be made by starting to watch that on this page, rather than aggregate the combination of CSV+Subscription by watching both objects.

Comment 21 Zac Herman 2020-10-26 22:21:14 UTC
I tried installing several different operators today and then deleting them and reinstalling, then deleting the namespace and then reinstalling.  However, every time the install worked (eventually) making this difficult to debug.  When this was first assigned to the console team, I could only get OCS to fail and that was specifically fixed and was deemed OCS specific.
At this point, I am curious if anyone else can create an error specific to the original issue description.  I will leave this open for now but if there is not a specific reproducible way to show this behavior, I will close this as being fix with fixes to other areas of the code.

Comment 22 Filip Brychta 2020-11-03 13:05:32 UTC
(In reply to Zac Herman from comment #21)
> I tried installing several different operators today and then deleting them
> and reinstalling, then deleting the namespace and then reinstalling. 
> However, every time the install worked (eventually) making this difficult to
> debug.  When this was first assigned to the console team, I could only get
> OCS to fail and that was specifically fixed and was deemed OCS specific.
> At this point, I am curious if anyone else can create an error specific to
> the original issue description.  I will leave this open for now but if there
> is not a specific reproducible way to show this behavior, I will close this
> as being fix with fixes to other areas of the code.

Simple way how to reproduce is to have a typo in the channel name as described in BZ1884534

Comment 23 Zac Herman 2021-03-02 00:53:58 UTC
This bug has mutated and changed from the original problem and now references other bugs that were closed as a duplicate of this one.  I am having a difficult time understanding and reproducing the problem especially given the steps listed above.  I am closing this out and ask that a new bug be opened describing what issue is occurring.


Note You need to log in before you can comment on or make changes to this bug.