Bug 1913132 - The installation of Openshift Virtualization reports success early before it 's succeeded eventually
Summary: The installation of Openshift Virtualization reports success early before it ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Alexander Greene
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-06 04:28 UTC by Guohua Ouyang
Modified: 2021-02-24 15:50 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: OLM recently introduced a new controller that updates deployments defined in a CSV with an Environment Variable that is used to identify the OperatorCondition owned by the operator. Consequence: The deployment is immediately updated after being created by OLM creating a choppy installation. Fix: OLM now creates the deployment with the OperatorCondition Environment variable. Result: OLM no longer immediately updates the list of environment variables after creating a deployment.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:50:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
video shows install failure after it reports success (11.06 MB, application/octet-stream)
2021-01-06 04:28 UTC, Guohua Ouyang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1943 0 None closed Bug 1913132: Create CSV Deployments with OpCond EnvVar 2021-02-11 17:42:02 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:50:37 UTC

Description Guohua Ouyang 2021-01-06 04:28:34 UTC
Created attachment 1744768 [details]
video shows install failure after it reports success

Description of problem:
Install Openshift Virtualization on OCP console, it reports success early and suddenly a failure is found, then it goes back to install again, it's successful after it.

Version-Release number of selected component (if applicable):
OCP 4.7 + CNV 2.6.0

How reproducible:
100%

Steps to Reproduce:
1. On OCP console, go to Operators -> OperatorHub
2. Inputs 'Openshift Virtualization' in the filter.
3. Select 2.6.0 to install


Actual results:
The installation of Openshift Virtualization reports success early before it 's succeeded eventually

Expected results:
Once it's showing success on the page, no failures occur.

Additional info:

Comment 1 Yaacov Zamir 2021-01-06 08:11:04 UTC
it looks like an installer issue, moving to Instalation

Comment 2 Oren Cohen 2021-01-06 10:48:42 UTC
From my initial investigation, looks like a regression in OLM. During the initial phase of the installation (after making the subscription, before creating the HCO CR), pods are being rolled-out, for each OLM-controlled deployment, two replica sets are being created. This is causing hco operator to report "Ready" for a brief moment, until another pod is being rotated.
This behavior is observed on OCP 4.7.0-fc.0, but not on OCP 4.6.9, for the same index image containing CNV 2.6.0:
registry-proxy.engineering.redhat.com/rh-osbs/iib:36168 <==> hco-bundle-registry-container-v2.6.0-454


# installation on OCP 4.7.0-fc.0:

$ oc get rs
NAME                                         DESIRED   CURRENT   READY   AGE
cdi-operator-7d46b49c9f                      1         1         1       23m
cdi-operator-b7b778cc5                       0         0         0       23m
cluster-network-addons-operator-5ffccdf57    0         0         0       23m
cluster-network-addons-operator-759b89f64c   1         1         1       23m
hco-operator-658dc8f879                      1         1         1       23m
hco-operator-755cc7d989                      0         0         0       23m
hco-webhook-56d6fb844d                       1         1         1       23m
hco-webhook-6dd746cddb                       0         0         0       23m
hostpath-provisioner-operator-5b6f57d6d9     0         0         0       23m
hostpath-provisioner-operator-f4649cfd9      1         1         1       23m
kubevirt-ssp-operator-6dffbcbcfb             0         0         0       23m
kubevirt-ssp-operator-8649744554             1         1         1       23m
node-maintenance-operator-7d49bf99ff         0         0         0       23m
node-maintenance-operator-d5c8786c           1         1         1       23m
virt-operator-6dcf7ffb84                     0         0         0       23m
virt-operator-8696645c98                     2         2         2       23m
vm-import-operator-56bf9fccd4                0         0         0       23m
vm-import-operator-65c86b748                 1         1         1       23m

# installation on OCP 4.6.9:

$ oc get rs
NAME                                         DESIRED   CURRENT   READY   AGE
cdi-operator-7959bcd65b                      1         1         1       13m
cluster-network-addons-operator-5678b84f6b   1         1         1       13m
hco-operator-99c776db8                       1         1         1       13m
hco-webhook-795df79cd5                       1         1         1       13m
hostpath-provisioner-operator-58b48bc6fd     1         1         1       13m
kubevirt-ssp-operator-6578f4b6fc             1         1         1       13m
node-maintenance-operator-69cf9bf685         1         1         1       13m
virt-operator-64655949c7                     2         2         2       13m
virt-template-validator-844bf5ddc9           2         2         2       10m
vm-import-operator-7bbf8fb485                1         1         1       13m

Comment 3 Oren Cohen 2021-01-06 11:24:53 UTC
For reference, I installed another Red Hat supported operator (ACM), and encountered the same issue:

$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-fc.0   True        False         6d19h   Cluster version is 4.7.0-fc.0

$ oc get rs -n open-cluster-management
NAME                                                        DESIRED   CURRENT   READY   AGE
cluster-manager-7457b7f8f9                                  3         3         3       31m
cluster-manager-8558df4566                                  0         0         0       30m
hive-operator-647fb55f9f                                    1         1         1       31m
hive-operator-85bcc96cff                                    0         0         0       30m
multicluster-observability-operator-5967f776c8              1         1         1       31m
multicluster-observability-operator-8465647ccd              0         0         0       30m
multicluster-operators-application-75477bf55d               1         1         1       31m
multicluster-operators-application-999757f6b                0         0         0       30m
multicluster-operators-hub-subscription-98f794f9            0         0         0       30m
multicluster-operators-hub-subscription-f6bd5bd99           1         1         1       31m
multicluster-operators-standalone-subscription-7f697f8db8   1         1         1       31m
multicluster-operators-standalone-subscription-db5ddc968    0         0         0       30m
multiclusterhub-operator-5dcbcb7bbf                         1         1         1       31m
multiclusterhub-operator-698b5dc7fc                         0         0         0       30m

Which strengthen the suspicion it's an OLM bug.

Comment 4 Oren Cohen 2021-01-06 18:54:18 UTC
Moving to OLM, as advised by @agreene

Comment 6 Jian Zhang 2021-01-08 03:36:12 UTC
Cluster version is 4.7.0-0.nightly-2021-01-07-235021

[root@preserve-olm-env data]# oc -n openshift-operator-lifecycle-manager  exec catalog-operator-69b986886c-r7hr7  -- olm --version
OLM version: 0.17.0
git commit: ac075ae4d1081a49c15c8c2edfeb71d8d3e0363e

1, Subscribe to the etcdoperator in the "default" project.
[root@preserve-olm-env data]# cat og.yaml 
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: test-og
  namespace: default 
spec:
  targetNamespaces:
  - default 
[root@preserve-olm-env data]# cat sub-etcd-community.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd
  namespace: default 
spec:
  channel: singlenamespace-alpha
  installPlanApproval: Automatic
  name: etcd
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: etcdoperator.v0.9.4
[root@preserve-olm-env data]# oc create -f og.yaml 
operatorgroup.operators.coreos.com/test-og created
[root@preserve-olm-env data]# oc create -f sub-etcd-community.yaml 
subscription.operators.coreos.com/etcd created

2, Checking the ReplicaSet.
[root@preserve-olm-env data]# oc get csv -n default
NAME                  DISPLAY   VERSION   REPLACES              PHASE
etcdoperator.v0.9.4   etcd      0.9.4     etcdoperator.v0.9.2   Succeeded

[root@preserve-olm-env data]# oc get deployment  -n default
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
etcd-operator   1/1     1            1           33s

[root@preserve-olm-env data]# oc get rs -n default
NAME                       DESIRED   CURRENT   READY   AGE
etcd-operator-74cd66bbff   1         1         1       43s

Only one rs generated, looks good to me, verify it.

Comment 7 Oren Cohen 2021-01-12 16:53:23 UTC
@gouyang , now that the bug on OLM has been resolved, could you please verify that the CNV installation issue when using OperatorHub UI is no longer observed?
Thanks

Comment 8 Guohua Ouyang 2021-01-18 03:17:39 UTC
(In reply to Oren Cohen from comment #7)
> @gouyang , now that the bug on OLM has been resolved, could you
> please verify that the CNV installation issue when using OperatorHub UI is
> no longer observed?
> Thanks

Verified the issue is not existing on latest environment.

Comment 11 errata-xmlrpc 2021-02-24 15:50:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.