1868712 – operator pod with OLM webhooks is getting terminated and created several times during the installation

Bug 1868712 - operator pod with OLM webhooks is getting terminated and created several times during the installation

Summary: operator pod with OLM webhooks is getting terminated and created several time...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Alexander Greene
QA Contact:	yhui
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1874938 (view as bug list)
Depends On:
Blocks:	1868229 1874797 1892372
TreeView+	depends on / blocked

Reported:	2020-08-13 14:56 UTC by Oren Cohen
Modified:	2020-10-28 15:04 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: OLM was not re-using existing valid CA Certs when installing a CSV that entered the `installing` phase multiple times. Consequence: OLM would apply a new Webhook Hash to the deployment, causing a new RS to be created. The running operator would then re-deployed, possibly many times during an install. Fix: OLM now checks if the CA already exists and reuses it if it is valid. Result: If OLM detects existing valid CAs, OLM will reuse the CAs.
Clone Of:
Environment:
Last Closed:	2020-10-27 16:28:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	operator-framework operator-lifecycle-manager pull 1745	None	closed	Bug 1874938: Set RevisionHistoryLimit per Deployment	2021-01-06 10:43:11 UTC
Github	operator-framework operator-lifecycle-manager pull 1761	None	closed	Bug 1868712: OLM should reuse existing CA if they have not expired	2021-01-06 10:43:08 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:28:26 UTC

Description Oren Cohen 2020-08-13 14:56:08 UTC

Description of problem:
An operator which is configured to use OLM admission webhooks in its CSV (.spec.webhookdefinitions), exhibits the following behavior:
1. During installation phase, the operator pod is constantly getting terminated, and a new pod is created alongside the old one, this occurs dozen of times during the deployment.
2. When (1) occurs, new ReplicaSet is being created, causing the existing active ReplicaSet to scale down to 0. All inactive ReplicaSets (with desired replicas = 0) remain in the namespace.
3. The deployment object is not being changed.
4. After the installation is finished and the operator is stable, deleting an arbitrary pod in the namespace causes a new operator pod ReplicaSet, and termination / creation of the pod. This does not occur when the olm-operator (in openshift-operator-lifecycle-manager namespace) is not running.

----
the described behavior is not happening when the webhook is not configured on the same operator.


Version-Release number of selected component (if applicable):
OCP 4.5.3
PackageServer 0.15.1

How reproducible:
100%

Steps to Reproduce:
I built two versions of HCO operator (which deploys CNV / openshift-virt components) bundle image - one with webhook and another one without it.

Positive flow:
1. Create a catalog source using the following bundle image:
quay.io/orenc/hco-container-registry:olm-webhooks
2. Install "KubeVirt HyperConverged Cluster Operator" in channel 1.2.0 using OperatorHub (or manually with CLI).
3. Create the "HyperConverged" CR (default settings).
4. watch hco-operator pod getting terminated and created, new RS are being created:
$ oc get rs -n kubevirt-hyperconverged
NAME                                            DESIRED   CURRENT   READY   AGE
cdi-apiserver-7dcb77db79                        1         1         1       4m28s
cdi-deployment-7f999c755                        1         1         1       4m28s
cdi-operator-54d5b958d6                         1         1         1       5m2s
cdi-uploadproxy-85f76cc48b                      1         1         1       4m27s
cluster-network-addons-operator-7658f658d4      1         1         1       5m3s
hco-operator-5476bf64f5                         0         0         0       2m11s
hco-operator-54dd9fcf59                         1         1         1       15s
hco-operator-56c4c6866f                         0         0         0       96s
hco-operator-59f65f4559                         0         0         0       3m34s
hco-operator-5bb486777c                         0         0         0       2m47s
hco-operator-64f4cfb7bb                         0         0         0       18s
hco-operator-6978d5bb9f                         0         0         0       61s
hco-operator-7995844456                         0         0         0       3m32s
hco-operator-7b69cf7c54                         0         0         0       2m49s
hco-operator-7b95cc76d9                         0         0         0       4m23s
hco-operator-cc87fccb8                          0         0         0       5m1s
hostpath-provisioner-operator-79cc779987        1         1         1       5m2s
kubemacpool-mac-controller-manager-6c8c6557c5   2         2         2       4m30s
kubevirt-ssp-operator-767c7dff98                1         1         1       5m2s
nmstate-webhook-7fcdbdb77d                      2         2         2       4m29s
virt-operator-695d9b7659                        2         2         2       5m3s
virt-template-validator-76db69664c              2         2         2       4m6s
vm-import-controller-785cb6d578                 1         1         0       4m30s
vm-import-operator-647cff486f                   1         1         1       5m2s


Negative flow:
same as previous, but use this bundle image instead:
quay.io/orenc/hco-container-registry:without-olm-webhooks
watch that the hco-operator pod is not disrupted, not even once, and there are no "dead" ReplicaSets

Actual results:
HCO pod getting terminated and created numerous times during installation by OLM.

Expected results:
HCO pod is not disrupted during the installation process.

Additional info:
The operator CSV (with webhooks) can be found here:
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/deploy/olm-catalog/kubevirt-hyperconverged/1.2.0/kubevirt-hyperconverged-operator.v1.2.0.clusterserviceversion.yaml

Comment 2 Alexander Greene 2020-08-24 14:41:52 UTC

Sorry for the delayed response, the operator was not available in Red Hat, Certified, or Community CatalogSources and I needed to create a CatalogSource in which the operator was available.

Steps:
1. Create the CatalogSource:
```
$ cat <<EOF | kubectl create -f -
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: hco-catalogsource
  namespace: olm
spec:
  sourceType: grpc
  image: quay.io/kubevirt/hco-container-registry:latest
  displayName: KubeVirt HyperConverged
  publisher: Red Hat
EOF
```

2. Create a namespace with an OperatorGroup:
```
$ kubectl create ns kubevirt-hyperconverged

$ cat <<EOF | kubectl create -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: hco-operatorgroup
  namespace: kubevirt-hyperconverged
spec:
  targetNamespaces:
  - "kubevirt-hyperconverged"
EOF
```

3. Create a Subscription
```
$ cat <<EOF | kubectl create -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: hco-subscription
  namespace: kubevirt-hyperconverged
spec:
  channel: "1.2.0"
  name: kubevirt-hyperconverged
  source: hco-catalogsource
  sourceNamespace: olm
EOF
```

Operator was installed correctly, did not create any webhooks:
```
$ oc get csvs
NAME                                      DISPLAY                                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.1.0   KubeVirt HyperConverged Cluster Operator   1.1.0     kubevirt-hyperconverged-operator.v1.0.0   Succeeded

$ oc get mutatingwebhookconfigurations.admissionregistration.k8s.io --all-namespaces
No resources found

$ oc get validatingwebhookconfigurations.admissionregistration.k8s.io --all-namespaces
No resources found
```

It looks like the latest version of the registry image does not contain the webhook in the CSV:
```
$ oc get csv kubevirt-hyperconverged-operator.v1.1.0 -o yaml | grep webhook
    - description: Represents a deployment of admission control webhook to validate
      displayName: KubeVirt Template Validator admission webhook
          - validatingwebhookconfigurations
          - mutatingwebhookconfigurations
          - validatingwebhookconfigurations
          - validatingwebhookconfigurations
          - mutatingwebhookconfigurations
                  name: webhooks
      message: cluster rule:{"verbs":["create","get","list","patch","watch"],"apiGroups":["admissionregistration.k8s.io"],"resources":["validatingwebhookconfigurations"]}
      message: cluster rule:{"verbs":["get","list","watch","create","delete","update","patch"],"apiGroups":["admissionregistration.k8s.io"],"resources":["validatingwebhookconfigurations","mutatingwebhookconfigurations"]}
      message: cluster rule:{"verbs":["*"],"apiGroups":["admissionregistration.k8s.io"],"resources":["validatingwebhookconfigurations","mutatingwebhookconfigurations"]}
```

I will try and create a new CatalogSource image with the correct CSV since this is a release blocking bug, but in the future please provide a CatalogSource that contains the error that you are encountering.

Comment 3 Alexander Greene 2020-08-24 14:45:35 UTC

After inspecting the Catalog image, it seems like the webhook is only defined in version 1.2.0 of the operator, whereas I was installing version 1.1.0

Comment 4 Oren Cohen 2020-08-24 16:16:28 UTC

@Alex, I provided an image containing the bundle with the webhooks configured. I removed from it a dependency for another component (NMO) for simplicity. This is the updated 1.2.0 CSV version.

You can create the following catalog source:
```
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: hco-catalogsource
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/orenc/hco-container-registry:olm-webhooks
  displayName: Openshift Virtualization
  publisher: grpc
```

Then the operator is available in OperatorHub (UI) for installation.
Or, you can deploy a Subscription and an OperatorGroup manually.

Thanks

Comment 5 Alexander Greene 2020-08-25 14:58:00 UTC

Thanks @Oren for providing a CatalogSource!

I had a 4.6 cluster up an running and was unable to reproduce the issue as shown below. Since this bug was reported against 4.5, I will spin up a 4.5 cluster and try to reproduce there.

Testing on a 4.6 cluster:
1. Confirm OpenShiuft version:
```
$ oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.ci-2020-08-24-152549   True        False         68m     Cluster version is 4.6.0-0.ci-2020-08-24-152549
```

2. Confirm catalogSource @Oren provided is installed:
```
$ oc get catsrc hco-catalogsource -n openshift-marketplace -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","metadata":{"annotations":{},"name":"hco-catalogsource","namespace":"openshift-marketplace"},"spec":{"displayName":"KubeVirt HyperConverged","image":"quay.io/orenc/hco-container-registry:olm-webhooks","publisher":"Red Hat","sourceType":"grpc"}}
  creationTimestamp: "2020-08-25T14:35:07Z"
  generation: 1
  managedFields:
  - apiVersion: operators.coreos.com/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        .: {}
        f:displayName: {}
        f:image: {}
        f:publisher: {}
        f:sourceType: {}
    manager: oc
    operation: Update
    time: "2020-08-25T14:35:07Z"
  - apiVersion: operators.coreos.com/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:icon:
          .: {}
          f:base64data: {}
          f:mediatype: {}
      f:status:
        .: {}
        f:connectionState:
          .: {}
          f:address: {}
          f:lastConnect: {}
          f:lastObservedState: {}
        f:registryService:
          .: {}
          f:createdAt: {}
          f:port: {}
          f:protocol: {}
          f:serviceName: {}
          f:serviceNamespace: {}
    manager: catalog
    operation: Update
    time: "2020-08-25T14:46:11Z"
  name: hco-catalogsource
  namespace: openshift-marketplace
  resourceVersion: "66353"
  selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/openshift-marketplace/catalogsources/hco-catalogsource
  uid: a961cc46-19fb-4021-afdf-7499fa1ff437
spec:
  displayName: KubeVirt HyperConverged
  image: quay.io/orenc/hco-container-registry:olm-webhooks
  publisher: Red Hat
  sourceType: grpc
status:
  connectionState:
    address: hco-catalogsource.openshift-marketplace.svc:50051
    lastConnect: "2020-08-25T14:46:10Z"
    lastObservedState: READY
  registryService:
    createdAt: "2020-08-25T14:35:07Z"
    port: "50051"
    protocol: grpc
    serviceName: hco-catalogsource
    serviceNamespace: openshift-marketplace
```

3. Install HCO via the UI.
4. Check that the CSV was installed successfully:
```
$ oc get csvs
NAME                                      DISPLAY                                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.2.0   KubeVirt HyperConverged Cluster Operator   1.2.0     kubevirt-hyperconverged-operator.v1.1.0   Succeeded
```

5. Check that the CSV includes the webhookdefinition:
```
$ oc get csvs kubevirt-hyperconverged-operator.v1.2.0 -o yaml | grep -A 10 webhookdefinition
        f:webhookdefinitions: {}
    manager: catalog
    operation: Update
    time: "2020-08-25T14:47:33Z"
  - apiVersion: operators.coreos.com/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:olm.operatorGroup: {}
          f:olm.operatorNamespace: {}
--
  webhookdefinitions:
  - admissionReviewVersions:
    - v1beta1
    - v1
    containerPort: 4343
    deploymentName: hco-operator
    failurePolicy: Ignore
    generateName: validate-hco.kubevirt.io
    rules:
    - apiGroups:
      - hco.kubevirt.io
```

6. Confirm that the validatingWebhook exists:
```
$ oc get validatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                             WEBHOOKS   AGE
autoscaling.openshift.io         2          84m
machine-api                      2          84m
multus.openshift.io              1          92m
prometheusrules.openshift.io     1          83m
validate-hco.kubevirt.io-xpbpq   1          8m14s
```

5. Check pod resets:
```
$ oc get pods
NAME                                               READY   STATUS    RESTARTS   AGE
cdi-operator-688b9b4cf8-m2jk9                      1/1     Running   0          7m45s
cluster-network-addons-operator-76c6f7979b-hsgbj   1/1     Running   0          7m46s
hco-operator-564fffcbdf-xlbld                      1/1     Running   0          7m44s
hostpath-provisioner-operator-6d6dc64d97-rjkns     1/1     Running   0          7m45s
kubevirt-ssp-operator-78dc7b868d-m8cv9             1/1     Running   0          7m46s
virt-operator-54744f7b8c-dtdvr                     1/1     Running   0          7m7s
virt-operator-54744f7b8c-rdtj5                     1/1     Running   0          7m7s
vm-import-operator-58d54fd7cc-ffggp                1/1     Running   0          7m45s
```

Comment 6 Alexander Greene 2020-08-25 15:09:50 UTC

Reran process outlined above on a 4.5 cluster - was still unable to reproduce the issue:
```
$ oc get clusterversion
oc NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.7     True        False         79m     Cluster version is 4.5.7

$ oc get csv
NAME                                      DISPLAY                                    VERSION   REPLACES   PHASE
kubevirt-hyperconverged-operator.v1.2.0   KubeVirt HyperConverged Cluster Operator   1.2.0                Succeeded

$ oc get csv kubevirt-hyperconverged-operator.v1.2.0 -o yaml | grep -A 10 webhookdef
        f:webhookdefinitions: {}
    manager: catalog
    operation: Update
    time: "2020-08-25T15:03:34Z"
  - apiVersion: operators.coreos.com/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:olm.operatorGroup: {}
          f:olm.operatorNamespace: {}
--
  webhookdefinitions:
  - admissionReviewVersions:
    - v1beta1
    - v1
    containerPort: 4343
    deploymentName: hco-operator
    failurePolicy: Ignore
    generateName: validate-hco.kubevirt.io
    rules:
    - apiGroups:
      - hco.kubevirt.io

$ oc get pods
NAME                                              READY   STATUS    RESTARTS   AGE
cdi-operator-67947979d6-z998n                     1/1     Running   0          2m22s
cluster-network-addons-operator-bb6c88b64-xtszs   1/1     Running   0          2m23s
hco-operator-664ff45c79-nvs5x                     1/1     Running   0          2m21s
hostpath-provisioner-operator-c4c6fdd7f-zlxg9     1/1     Running   0          2m22s
kubevirt-ssp-operator-7cff58bfd-rd9q7             1/1     Running   0          2m22s
virt-operator-5d58647dd7-d8j4j                    1/1     Running   0          106s
virt-operator-5d58647dd7-g4fqp                    1/1     Running   0          106s
vm-import-operator-645fc7bcc-t2w77                1/1     Running   0          2m22s

$ oc get validatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                             WEBHOOKS   AGE
autoscaling.openshift.io         2          99m
multus.openshift.io              1          107m
validate-hco.kubevirt.io-jr5q7   1          2m28s
```

Can you please share the credentials to a cluster where the issue is present?

Thanks!

Comment 7 Alexander Greene 2020-08-26 20:10:29 UTC

Attempted to reproduce again today on a 4.5.7 cluster using the steps outlined above - the CSV was installed successfully without multiple replicaSets being created

Comment 8 Oren Cohen 2020-08-26 20:14:08 UTC

It happens after you apply the HCO Custom Resource (just deploy the default from alm-examples), and a dozen more pods will be created.
Note that a bunch of new hco-operator ReplicaSets will be created in kubevirt-hyperconverged namespace (oc get rs).
When the installation is completed and everything settles, check the age of hco-operator pod and see it is much lower than other pods.

If it won't happen, I'll share my cluster with you in which it is always happening.
Thanks!

Comment 9 Alexander Greene 2020-08-26 21:44:38 UTC

Thanks @Oren, I was able to reproduce the issue after creating a HyperConverged Resource.

Comment 10 Alexander Greene 2020-08-31 15:01:06 UTC

@Oren - I wanted to provide an update on this bugzilla.

It looks as though creating a Hyperconverged CR is triggering OLM to apply an update to the operator's deployment. This happens because OLM watches resources related to the operator (Deployments, webhooks, etc...) for unexpected updates as a part of OLM's "Lifecycle Management" process.

Although the CSV eventually reaches the succeeded state, OLM is updating the deployment multiple times to get it to the correct state. Each time an update is made to a deployment, a ReplicaSet is created, typically these resources are used for rollbacks [1]. I am speaking with my team to see if we should limit the number of replicaSets per deployment managed by OLM given that OLM does not support rollbacks at this time.

I am investigating why your operator is triggering updates to the operator, and will report back shortly.

Ref:
[1] https://www.weave.works/blog/how-many-kubernetes-replicasets-are-in-your-cluster-

Comment 11 Oren Cohen 2020-08-31 15:38:01 UTC

Thanks @Alex,
Note that the HCO operator is doing so only when it configured to use olm webhooks. When the webhook is removed (and everything else stay the same), no updates are triggered to the deployment, as you can see in the parallel image:
quay.io/orenc/hco-container-registry:without-olm-webhooks

Comment 12 Alexander Greene 2020-09-01 01:15:08 UTC

@oren

I continued to investigate the issue today. You had opened this BZ because OLM makes a number of updates to the deployment associated with your CSV when a hyperconvereged CR is created while the CSV defines a webhook. OLM does this because of 2 reasons:

1. OLM does not set the revisionHistoryLimit when creating deployments, resulting in a single deployment having up to 10 replicaSets associated with it. These are typically kept around for rollbacks, a concept that OLM doesn't support. I have created https://github.com/operator-framework/operator-lifecycle-manager/pull/1745 to set the deployments revisionHistoryLimit to 1, so we can see the current and previous replicaSet for the deployment for debugging purposes.
2. The HCO operator updates the Validating webhook so it is no longer has a namespace selector. This is not supported by OLM. OLM creates the validating webhook so it is scoped to the namespaces defined in the OperatorGroup. The HCO operator modifies the webhook, causing OLM to notice that one of the resources it deployed is not in the expected state, resulting in OLM reinstalling the operator. This happens multiple times, which is why there are so many ReplicaSets for the HCO Deployment. Please update the operator so the webhook's scope is not modified.

Comment 13 Alexander Greene 2020-09-01 01:16:10 UTC

2b. It is possible that operator updates other resources defined in the CSV. Please do not modify resources defined in the CSV as a best practice.

Comment 14 Yuval Turgeman 2020-09-01 06:51:47 UTC

Alex, that was our initial thought also, so we created a version of the HCO with the webhook implmentation that doesn't modify the scope of the namespace, but it didn't help for this bug.  Oren please correct me if I'm wrong.

Comment 15 Oren Cohen 2020-09-01 07:04:18 UTC

Correct, I compiled an HCO Operator without the code which removes the namespaceSelector, i.e. all of this code was commented out:
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/pkg/apis/hco/v1beta1/hyperconverged_webhook.go#L51-L77
And the issue was still persisting.
In that case:
Every 2.0s: oc get rs                                                                                                                                                      ocohen.tlv.csb: Thu Aug 20 10:55:45 2020

NAME                                            DESIRED   CURRENT   READY   AGE
cdi-apiserver-7dcb77db79                        1         1         1       7m30s
cdi-deployment-7f999c755                        1         1         1       7m30s
cdi-operator-8948f6999                          1         1         1       8m42s
cdi-uploadproxy-85f76cc48b                      1         1         1       7m29s
cluster-network-addons-operator-8489fb5bb5      1         1         1       8m42s
hco-operator-575765b4d9                         0         0         0       38s
hco-operator-58684f9897                         0         0         0       3m18s
hco-operator-5dd665ddc4                         1         1         0       35s
hco-operator-67549c7dcb                         0         0         0       3m16s
hco-operator-6766794f45                         0         0         0       4m17s
hco-operator-6846bff7c9                         0         0         0       2m44s
hco-operator-685dcf8b97                         0         0         0       2m5s
hco-operator-69c7cb66d8                         0         0         0       5m1s
hco-operator-7bcc988d4c                         0         0         0       2m42s
hco-operator-7cb4bc99f4                         0         0         0       2m7s
hco-operator-7dc754cdbc                         0         0         0       80s
hostpath-provisioner-operator-75468c4cd8        1         1         1       8m42s
kubemacpool-mac-controller-manager-5699f48684   1         1         1       7m31s
kubevirt-ssp-operator-9c9d45887                 1         1         1       8m42s
nmstate-webhook-5bc9777476                      2         2         2       7m30s
virt-operator-5c4bf4d4bb                        2         2         2       8m42s
virt-template-validator-76db69664c              2         2         2       7m4s
vm-import-controller-785cb6d578                 1         1         0       7m32s
vm-import-operator-6cd957b4cd                   1         1         1       8m41s
```

```
$ oc get validatingwebhookconfigurations validate-hco.kubevirt.io-shd9j -o jsonpath='{.webhooks[*].namespaceSelector}'
map[matchLabels:map[olm.operatorgroup.uid/266b9d05-1ed6-479d-9a65-d4b540fca78e:]]


And indeed it allowed me to create an HCO CR at any namespace, but not with a name differs than "kubevirt-hyperconverged" (i.e. the webhook worked).

Comment 16 Alexander Greene 2020-09-01 22:50:04 UTC

Following up on this BZ after looking into the issue and having a discussion with @Oren offline.

I have identified why we are observing the behavior described in this bugzilla.

Some notes:
* Using the CatalogSource @Oren provided earlier, I was able to successfully install version 1.2.0 of the HCO operator which includes the Webhook.
* The HCO CSV would remain in the succeeded phase until I created a hyperconverged CR, which caused the CSV to constantly shift between the following phases: Succeeded->Failed->Pending->InstallReady->Installing->Succeeded->Failed->repeat...

It seemed odd that OLM would transition the CSV from the Succeeded Phase to the Failed Phase based on the creation of a CR until I noticed that the HCO operator was failing its ReadinessProbe check. @Oren let me know that the HCO operator reports "ready" if and only if all of its owned custom resources have condition available true, progressing false, degraded false and upgradable true. As such, when the hyperconverged CR is created, the HCO operator no longer reports that it is ready. OLM notices that a Pod fails the ReadinessCheck and moves the CSV from the succeeded state to the Failed phase. OLM then attempts to reinstall the operator. When reinstalling the operator, OLM creates a new CA and updates the deployment's pod template with an annotation containing CA information. This process repeats itself causing a number of rapid updates to the HCO Deployment and resulting in the large number of ReplicaSets mentioned in this BZ.

Part of this issue exists in both 1.1.0 and 1.2.0 versions of the operator - OLM will always move the HCO CSV from the Succeeded Phase to the Failed Phase when HCO fails its readiness check. However, in 1.2.0, OLM updates the Deployments Pod Template with a new CA Hash, making the issue more apparent.

Comment 17 Alexander Greene 2020-09-03 13:22:17 UTC

Based on the behavior described above, this is not a bug and OLM is working as expected. OLM could handle this operator's behavior by expanding enhancement [1] to include a "Healthy" condition that would override the check described above that places the operator in  earlier, but this would not be available until 4.7. If this is desired, please create an RFE requesting the feature.

Ref:
[1] https://github.com/operator-framework/enhancements/blob/master/enhancements/operator-conditions.md

Comment 18 Oren Cohen 2020-09-07 12:11:07 UTC

Hi @Alex,

We understand that you're not considering it as an OLM bug.
We'll appreciate if you could provide us with a workaround in which we can preserve the OLM webhook in HCO operator while having the deployment less fluctuating.
Is there something we could do besides splitting the webhook service to an auxilary pod, as we are approaching to the feature freeze.

Thanks

Comment 19 Dan Kenigsberg 2020-09-09 12:12:18 UTC

(In reply to Alexander Greene from comment #17)
> Based on the behavior described above, this is not a bug and OLM is working
> as expected. OLM could handle this operator's behavior by expanding
> enhancement [1] to include a "Healthy" condition that would override the
> check described above that places the operator in  earlier, but this would
> not be available until 4.7. If this is desired, please create an RFE
> requesting the feature.
> 
> Ref:
> [1]
> https://github.com/operator-framework/enhancements/blob/master/enhancements/
> operator-conditions.md

Another option that OLM may take is to be less aggressive when replacing certificates. I understand why OLM should replace a certificate of a non-ready operator, but it should do this less often. I don't think it makes sense to repeat a certificate that was issued by OLM only 5 minutes ago.

Currently, OLM treats operators with webhooks in (what seems to me) an untenable fashion. If the operator is found in ready=false, the certs are immediately replaced, which restarts the operator, which typically brings it back to ready=false.

Can you add a grace period for cert replacement procedure?

Comment 20 Evan Cordell 2020-09-09 22:39:21 UTC

> When reinstalling the operator, OLM creates a new CA and updates the deployment's pod template with an annotation containing CA information.

This behavior sounds like a bug to me - OLM shouldn't recreate the CA if it isn't in need of rotation.

Comment 21 Alexander Greene 2020-09-12 15:07:36 UTC

The OLM and HCO teams have discussed this a bit offline and I wanted to provide an update to the ticket:

* OLM will be updated so it no longer recreates the CA if it exists and has not expired. This change will prevent OLM from updating the deployment with a new CA hash, allow the HCO operator deployment to remain up and reconcile the CR.
* The change proposed above will not prevent the HCO CSV from rotating through different phases when reconciling the hyperconverged CR. The change will allow the HCO operator deployment to remain steady and resolve the hyperconvered CR, after which HCO will report READY and the CSV will reach the succeeded phase. This issue will persist until HCO stops reporting that it is NotReady when reconciling the hyperconverged CR or OLM implements the communication channel mentioned in this comment [1].

Ref:
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1868712#c17

Comment 22 Alexander Greene 2020-09-14 22:29:58 UTC

I wanted to provide another update on this ticket given its urgency.

I have created a PR against OLM [1] that updates OLM so it will reuse existing Certificates. I then confirmed that the bug was addressed by:

1. Creating a CatalogSource that contains the HCO operator (provided by  Yuval) and installing via the UI:
cat <<EOF | kubectl create -f -
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: hco-catalogsource
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/yuvalturg/hco-container-registry:with_webhook
  displayName: KubeVirt HyperConverged
  publisher: Red Hat
EOF

$ oc get csvs -n kubevirt-hyperconverged
NAME                                      DISPLAY                                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.2.0   KubeVirt HyperConverged Cluster Operator   1.2.0     kubevirt-hyperconverged-operator.v1.1.0   Succeeded


2. Creating a Hyperconverged CR:
$ oc get hyperconverged kubevirt-hyperconverged -n kubevirt-hyperconverged -o yaml
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
  name: kubevirt-hyperconverged
  namespace: kubevirt-hyperconverged
spec:
  infra: {}
  version: 1.2.0
  workloads: {}

3. Waiting for all pods created by HCO to rollout:
$ oc get pods -n kubevirt-hyperconverged
NAME                                                  READY   STATUS    RESTARTS   AGE
bridge-marker-2zqbp                                   1/1     Running   0          21m
bridge-marker-7tsml                                   1/1     Running   0          21m
bridge-marker-82vlf                                   1/1     Running   0          21m
bridge-marker-95fpm                                   1/1     Running   0          21m
bridge-marker-t8rl5                                   1/1     Running   0          21m
bridge-marker-wjcn9                                   1/1     Running   0          21m
cdi-apiserver-5b778f4b54-zckmv                        1/1     Running   0          20m
cdi-deployment-66bf58d759-rmpql                       1/1     Running   0          20m
cdi-operator-549cfb484c-rdj7k                         1/1     Running   0          25m
cdi-uploadproxy-857b9675cd-tvxl9                      1/1     Running   0          20m
cluster-network-addons-operator-7f6ff76fc9-4n7hd      1/1     Running   0          25m
hco-operator-b94494766-45qrf                          1/1     Running   0          25m
hostpath-provisioner-operator-f846d6bbf-zdhn2         1/1     Running   0          25m
kube-cni-linux-bridge-plugin-4dg8f                    1/1     Running   0          21m
kube-cni-linux-bridge-plugin-bn2zx                    1/1     Running   0          21m
kube-cni-linux-bridge-plugin-f57j4                    1/1     Running   0          21m
kube-cni-linux-bridge-plugin-tlqv2                    1/1     Running   0          21m
kube-cni-linux-bridge-plugin-tpswt                    1/1     Running   0          21m
kube-cni-linux-bridge-plugin-zq8fd                    1/1     Running   0          21m
kubemacpool-mac-controller-manager-866c4b4c8c-sjk2f   1/1     Running   0          21m
kubevirt-node-labeller-hn6hz                          1/1     Running   0          17m
kubevirt-node-labeller-shl8n                          1/1     Running   0          17m
kubevirt-node-labeller-tk9gg                          1/1     Running   0          17m
kubevirt-ssp-operator-58d46cbcc9-q78fj                1/1     Running   0          25m
nmstate-handler-89888                                 1/1     Running   0          20m
nmstate-handler-cc94b                                 1/1     Running   0          20m
nmstate-handler-mmjbr                                 1/1     Running   0          20m
nmstate-handler-n5nct                                 1/1     Running   0          20m
nmstate-handler-q728p                                 1/1     Running   0          20m
nmstate-handler-qcv7c                                 1/1     Running   0          20m
nmstate-webhook-768885bd4c-hvtkz                      1/1     Running   0          20m
nmstate-webhook-768885bd4c-v894l                      1/1     Running   0          20m
node-maintenance-operator-5c7bbc5bdd-bvczr            1/1     Running   0          25m
ovs-cni-amd64-68b2x                                   1/1     Running   0          20m
ovs-cni-amd64-68lb7                                   1/1     Running   0          20m
ovs-cni-amd64-9hf8w                                   1/1     Running   0          20m
ovs-cni-amd64-gps84                                   1/1     Running   0          20m
ovs-cni-amd64-lx8x2                                   1/1     Running   0          20m
ovs-cni-amd64-mv6wd                                   1/1     Running   0          20m
virt-api-7f99cd7f69-cqx5t                             1/1     Running   0          20m
virt-api-7f99cd7f69-nbrtd                             1/1     Running   0          20m
virt-controller-5859979747-qj6hp                      1/1     Running   0          19m
virt-controller-5859979747-vxcnw                      1/1     Running   0          19m
virt-handler-2268m                                    1/1     Running   0          19m
virt-handler-2hx5z                                    1/1     Running   0          19m
virt-handler-nhx72                                    1/1     Running   0          19m
virt-operator-cf749c65c-njxrw                         1/1     Running   0          23m
virt-operator-cf749c65c-zjswk                         1/1     Running   0          23m
virt-template-validator-7c6d68c58c-d7txr              1/1     Running   0          18m
virt-template-validator-7c6d68c58c-v2rz6              1/1     Running   0          18m
vm-import-controller-84dd759c94-449xm                 1/1     Running   0          20m
vm-import-operator-c574869fb-jj7sk                    1/1     Running   0          25m

4. Checking the HCO CSV:
$ oc get csv -n kubevirt-hyperconverged
NAME                                      DISPLAY                                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.2.0   KubeVirt HyperConverged Cluster Operator   1.2.0     kubevirt-hyperconverged-operator.v1.1.0   Succeeded


Again - HCO's CSV will rotate through many installing phases while the HCO operator is not reporting that it is READY.

Ref:
[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/1761

Comment 24 yhui 2020-09-18 16:23:01 UTC

Version:
[root@preserve-olm-env memcached-operator]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-18-002612   True        False         72m     Cluster version is 4.6.0-0.nightly-2020-09-18-002612
[root@preserve-olm-env memcached-operator]# oc exec catalog-operator-9ff69c6cf-h8dhz -n openshift-operator-lifecycle-manager -- olm --version
OLM version: 0.16.1
git commit: 1dd1b464f0c72893561e40881f827e3d41d6c934


Steps to test:
1. Create a catalog source using the following bundle image:
quay.io/orenc/hco-container-registry:olm-webhooks

[root@preserve-olm-env 1868712]# cat cs.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: hco-catalogsource
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/orenc/hco-container-registry:olm-webhooks
  displayName: Openshift Virtualization
  publisher: grpc
[root@preserve-olm-env 1868712]# oc apply -f cs.yaml

2. Install "KubeVirt HyperConverged Cluster Operator" in channel 1.2.0 using OperatorHub.

[root@preserve-olm-env 1868712]# oc get pods -n kubevirt-hyperconverged
NAME                                               READY   STATUS    RESTARTS   AGE
cdi-operator-7c6788b9-46fns                        1/1     Running   0          2m3s
cluster-network-addons-operator-5589c74d55-vdbfs   1/1     Running   0          2m4s
hco-operator-59b7596b46-7qxx4                      1/1     Running   3          2m4s
hostpath-provisioner-operator-87959576d-q9n4r      1/1     Running   0          2m3s
kubevirt-ssp-operator-5b4964cb9-x4dtm              1/1     Running   0          2m4s
virt-operator-5f4cbcc474-22ghm                     1/1     Running   0          83s
virt-operator-5f4cbcc474-fkvl9                     1/1     Running   0          83s
vm-import-operator-774874c566-fjdrb                1/1     Running   0          2m3s
[root@preserve-olm-env 1868712]# oc get rs -n kubevirt-hyperconverged
NAME                                         DESIRED   CURRENT   READY   AGE
cdi-operator-7c6788b9                        1         1         1       89s
cluster-network-addons-operator-5589c74d55   1         1         1       89s
hco-operator-59b7596b46                      1         1         1       89s
hostpath-provisioner-operator-87959576d      1         1         1       88s
kubevirt-ssp-operator-5b4964cb9              1         1         1       89s
virt-operator-5f4cbcc474                     2         2         2       89s
vm-import-operator-774874c566                1         1         1       88s

3. Create the "HyperConverged" CR (default settings).

4. watch hco-operator pod getting terminated and created, new RS are being created:

[root@preserve-olm-env 1868712]# oc get rs -n kubevirt-hyperconverged
NAME                                         DESIRED   CURRENT   READY   AGE
cdi-operator-7c6788b9                        1         1         1       2m52s
cluster-network-addons-operator-5589c74d55   1         1         1       2m52s
hco-operator-59b7596b46                      1         1         0       2m52s
hostpath-provisioner-operator-87959576d      1         1         1       2m51s
kubevirt-ssp-operator-5b4964cb9              1         1         1       2m52s
virt-operator-5f4cbcc474                     2         2         2       2m52s
vm-import-operator-774874c566                1         1         1       2m51s

There are no duplicated "dead" ReplicaSets. 

But check the pods status, they are CrashLoopBackOff. And the csv is Installing.  
[root@preserve-olm-env 1868712]# oc get pods -n kubevirt-hyperconverged
NAME                                               READY   STATUS             RESTARTS   AGE
cdi-operator-7c6788b9-46fns                        1/1     Running            0          2m29s
cluster-network-addons-operator-5589c74d55-vdbfs   1/1     Running            0          2m30s
hco-operator-59b7596b46-7qxx4                      0/1     CrashLoopBackOff   3          2m30s
hostpath-provisioner-operator-87959576d-q9n4r      1/1     Running            0          2m29s
kubevirt-ssp-operator-5b4964cb9-x4dtm              1/1     Running            0          2m30s
virt-operator-5f4cbcc474-22ghm                     1/1     Running            0          109s
virt-operator-5f4cbcc474-fkvl9                     1/1     Running            0          109s
vm-import-operator-774874c566-fjdrb                1/1     Running            0          2m29s
[root@preserve-olm-env 1868712]# oc get csvs -n kubevirt-hyperconverged
NAME                                      DISPLAY                                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.2.0   KubeVirt HyperConverged Cluster Operator   1.2.0     kubevirt-hyperconverged-operator.v1.1.0   Installing

Is this the expected behavior?

Comment 25 yhui 2020-09-18 16:38:57 UTC

Describe the pods.
[root@preserve-olm-env 1868712]# oc describe pods hco-operator-969448ffd-5lmll -n kubevirt-hyperconverged
...
Events:
  Type     Reason          Age                            From                                                 Message
  ----     ------          ----                           ----                                                 -------
  Normal   Scheduled       <invalid>                                                                           Successfully assigned kubevirt-hyperconverged/hco-operator-969448ffd-5lmll to ip-10-0-147-194.us-east-2.compute.internal
  Normal   AddedInterface  <invalid>                      multus                                               Add eth0 [10.129.2.15/23]
  Normal   Pulling         <invalid>                      kubelet, ip-10-0-147-194.us-east-2.compute.internal  Pulling image "quay.io/kubevirt/hyperconverged-cluster-operator:1.2.0"
  Normal   Pulled          <invalid>                      kubelet, ip-10-0-147-194.us-east-2.compute.internal  Successfully pulled image "quay.io/kubevirt/hyperconverged-cluster-operator:1.2.0" in 27.669655327s
  Normal   Init            <invalid>                      kubevirt-hyperconverged                              Starting the HyperConverged Pod
  Normal   Init            <invalid>                      kubevirt-hyperconverged                              Starting the HyperConverged Pod
  Warning  Unhealthy       <invalid>                      kubelet, ip-10-0-147-194.us-east-2.compute.internal  Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of b8c31b85f2d43ead0e028fd39b16efea8be1dbe9ea3d292643ddce5289efe7d1 is running failed: container process not found
  Normal   Init            <invalid>                      kubevirt-hyperconverged                              Starting the HyperConverged Pod
  Normal   Init            <invalid>                      kubevirt-hyperconverged                              Starting the HyperConverged Pod
  Warning  BackOff         <invalid> (x7 over <invalid>)  kubelet, ip-10-0-147-194.us-east-2.compute.internal  Back-off restarting failed container
  Normal   Pulled          <invalid> (x4 over <invalid>)  kubelet, ip-10-0-147-194.us-east-2.compute.internal  Container image "quay.io/kubevirt/hyperconverged-cluster-operator:1.2.0" already present on machine
  Normal   Started         <invalid> (x5 over <invalid>)  kubelet, ip-10-0-147-194.us-east-2.compute.internal  Started container hyperconverged-cluster-operator
  Normal   Created         <invalid> (x5 over <invalid>)  kubelet, ip-10-0-147-194.us-east-2.compute.internal  Created container hyperconverged-cluster-operator
  Normal   Init            <invalid>                      kubevirt-hyperconverged                              Starting the HyperConverged Pod
  Warning  Unhealthy       <invalid>                      kubelet, ip-10-0-147-194.us-east-2.compute.internal  Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of ca90104aec6c9e842b187060b0467097cbe3414480b60bfbb9e6a7334b7f9288 is running failed: open /proc/66514/stat: no such file or directory: container process not found

It is complaining "Readiness probe errored".

Comment 26 yhui 2020-09-18 16:46:09 UTC

Try the procedure in the Comment 22.

1. Creating a CatalogSource that contains the HCO operator (provided by  Yuval):
cat <<EOF | kubectl create -f -
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: hco-catalogsource
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/yuvalturg/hco-container-registry:with_webhook
  displayName: KubeVirt HyperConverged
  publisher: Red Hat
EOF

But the pods is complaining ImagePullBackOff since the image "quay.io/yuvalturg/hco-container-registry:with_webhook" does not exist.
[root@preserve-olm-env 1868712]# oc get pods -n openshift-marketplace
NAME                                   READY   STATUS             RESTARTS   AGE
certified-operators-kfwfx              1/1     Running            0          41m
certified-operators-thkpc              1/1     Running            0          11m
community-operators-d9xdd              1/1     Running            0          11m
community-operators-svs5z              1/1     Running            0          43m
hco-catalogsource-4rgvj                0/1     ImagePullBackOff   0          102s
marketplace-operator-75fd49579-5q7nr   1/1     Running            0          43m
qe-app-registry-64kpb                  1/1     Running            0          11m
qe-app-registry-zvprt                  1/1     Running            0          41m
redhat-marketplace-s9f5c               1/1     Running            0          11m
redhat-marketplace-wqhlp               1/1     Running            0          41m
redhat-operators-2z95d                 1/1     Running            0          11m
redhat-operators-b5wfl                 1/1     Running            0          41m

Check the repo on the quay.io. There is no this image tag in the https://quay.io/repository/yuvalturg/hco-container-registry?tab=tags.

Comment 27 Oren Cohen 2020-09-18 17:28:33 UTC

Hi @yhui,
Please try this index image for the catalog source:
registry-proxy.engineering.redhat.com/rh-osbs/iib:12786

It's the most recent and stable downstream bundle.
HCO pod should not crashloop in this version.

Thanks

Comment 30 Oren Cohen 2020-09-22 14:04:54 UTC

verified the fix on 4.6.0-fc.7
HCO and NMO pods are no longer interrupted; no new ReplicaSets are being created.

The HCO's validating webhook works as expected on DELETE, but it is now not blocking CREATE requests for the CR on other namespaces.

you can reproduce the verification by using
registry-proxy.engineering.redhat.com/rh-osbs/iib:12786
as a CatalogSource.spec.image
and applying the ICSP generated by running:
oc adm catalog mirror registry-proxy.engineering.redhat.com/rh-osbs/iib:12786 registry-proxy.engineering.redhat.com/rh-osbs --manifests-only

--------

[ocohen@ocohen ~]$ oc get rs
NAME                                            DESIRED   CURRENT   READY   AGE
cdi-apiserver-f5fb5fc84                         1         1         1       2m36s
cdi-deployment-845766b7f5                       1         1         1       2m36s
cdi-operator-5f7dfc7dbb                         1         1         1       60m
cdi-uploadproxy-687c79ff74                      1         1         1       2m36s
cluster-network-addons-operator-75b6b8cbd4      1         1         1       60m
hco-operator-7778bc866b                         1         1         1       60m
hostpath-provisioner-operator-6d9f764f7d        1         1         1       60m
kubemacpool-mac-controller-manager-787d89f9b5   1         1         1       2m48s
kubevirt-ssp-operator-58dc49dfdb                1         1         1       60m
nmstate-webhook-69c95dcc67                      2         2         2       2m48s
node-maintenance-operator-577548cdf9            1         1         1       60m
virt-api-c9bb8c459                              2         2         2       2m24s
virt-controller-7f5554b5c5                      2         2         2       119s
virt-operator-66fcd95b8                         2         2         2       60m
virt-template-validator-6449f67f6d              2         2         2       96s
vm-import-controller-d5fcd9646                  1         1         1       2m35s
vm-import-operator-7d448954fd                   1         1         1       60m
[ocohen@ocohen ~]$ 
[ocohen@ocohen ~]$ 
[ocohen@ocohen ~]$ oc get csv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.5.0   OpenShift Virtualization   2.5.0     kubevirt-hyperconverged-operator.v2.4.1   Succeeded
[ocohen@ocohen ~]$ 
[ocohen@ocohen ~]$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-fc.7   True        False         3h36m   Cluster version is 4.6.0-fc.7

Comment 31 Alexander Greene 2020-09-22 15:56:17 UTC

Hello Oren,

OLM is behaving as expected in accordance with its multitenancy strategy implemented with OperatorGroups[1]. When OLM installs an operator, it scopes its RBAC and webhooks only to apply to namespaces included in the OperatorGroup - given that your operator is scoped to a single namespace this behavior is intended. If you want the webhook to intercept all resources on the cluster, the operator must only support the `AllNamespaces` installMode.

REF:
[1] https://olm.operatorframework.io/docs/advanced-tasks/operator-scoping-with-operatorgroups/

Comment 32 Jian Zhang 2020-09-23 00:56:47 UTC

*** Bug 1874938 has been marked as a duplicate of this bug. ***

Comment 33 yhui 2020-09-24 16:29:20 UTC

Version:
[root@preserve-olm-env ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-22-011738   True        False         151m    Cluster version is 4.6.0-0.nightly-2020-09-22-011738
[root@preserve-olm-env ~]# oc exec olm-operator-6f68947db5-pb2nc -n openshift-operator-lifecycle-manager -- olm --version
OLM version: 0.16.1
git commit: 026fa7a609b57f740b4873522eb283f0a5f11d04

Test procedure:
1. Create a catalog source using the following bundle image:
registry-proxy.engineering.redhat.com/rh-osbs/iib:12786

[root@preserve-olm-env 1868712]# cat cs.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: hco-catalogsource
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: registry-proxy.engineering.redhat.com/rh-osbs/iib:12786
  displayName: Openshift Virtualization
  publisher: grpc

But the catalogsource pods is ImagePullBackOff.
[root@preserve-olm-env 1868712]# oc get pods -n openshift-marketplace
NAME                                                              READY   STATUS             RESTARTS   AGE
...
hco-catalogsource-q7rgr                                           0/1     ImagePullBackOff   0          17m
...
[root@preserve-olm-env 1868712]# oc describe pods hco-catalogsource-5vlfx -n openshift-marketplace
Events:
  Type     Reason          Age                            From               Message
  ----     ------          ----                           ----               -------
  Normal   Scheduled       <invalid>                      default-scheduler  Successfully assigned openshift-marketplace/hco-catalogsource-5vlfx to rbrattai-o46g4a-8ks9s-worker-b-kqzpx.c.openshift-qe.internal
  Normal   AddedInterface  <invalid>                      multus             Add eth0 [10.129.2.5/23]
  Normal   Pulling         <invalid> (x3 over <invalid>)  kubelet            Pulling image "registry-proxy.engineering.redhat.com/rh-osbs/iib:12786"
  Warning  Failed          <invalid> (x3 over <invalid>)  kubelet            Failed to pull image "registry-proxy.engineering.redhat.com/rh-osbs/iib:12786": rpc error: code = Unknown desc = error pinging docker registry registry-proxy.engineering.redhat.com: Get "https://registry-proxy.engineering.redhat.com/v2/": dial tcp: lookup registry-proxy.engineering.redhat.com on 169.254.169.254:53: no such host
  Warning  Failed          <invalid> (x3 over <invalid>)  kubelet            Error: ErrImagePull
  Normal   BackOff         <invalid> (x4 over <invalid>)  kubelet            Back-off pulling image "registry-proxy.engineering.redhat.com/rh-osbs/iib:12786"
  Warning  Failed          <invalid> (x4 over <invalid>)  kubelet            Error: ImagePullBackOff

Maybe it needs to set some environment variables. Could you please help to provide the info? Thanks.

Comment 35 Oren Cohen 2020-09-24 21:21:16 UTC

Is this cluster running on AWS cloud?
The registry server I mentioned is accessible in RH internal network.

@Simone, could you please update the upstream 1.2.0 HCO bundle image in upstream, so @Hui could deploy HCO from the cloud cluster.
https://quay.io/repository/kubevirt/hco-container-registry?tag=latest&tab=tags

Thanks

Comment 36 yhui 2020-09-25 08:47:28 UTC

Version:
[root@preserve-olm-env 1868712]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-24-235241   True        False         3h44m   Cluster version is 4.6.0-0.nightly-2020-09-24-235241
[root@preserve-olm-env 1868712]# oc exec olm-operator-87697bdc8-4lqj8 -n openshift-operator-lifecycle-manager -- olm --version
OLM version: 0.16.1
git commit: be18debfbb25ba768921373d48aa761b951eca59



Steps to test:
1. Create a catalog source using the following bundle image:
quay.io/kubevirt/hco-container-registry:latest

[root@preserve-olm-env 1868712]# cat cs.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: hco-catalogsource
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/kubevirt/hco-container-registry:latest
  displayName: Openshift Virtualization
  publisher: grpc
[root@preserve-olm-env 1868712]# oc apply -f cs.yaml


2. Install "KubeVirt HyperConverged Cluster Operator" in channel 1.2.0 using OperatorHub.

Check the csv, pods, rs status.
[root@preserve-olm-env 1868712]# oc get csv -n kubevirt-hyperconverged
NAME                                      DISPLAY                                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.2.0   KubeVirt HyperConverged Cluster Operator   1.2.0     kubevirt-hyperconverged-operator.v1.1.0   Succeeded
[root@preserve-olm-env 1868712]# oc get pods -n kubevirt-hyperconverged
NAME                                               READY   STATUS    RESTARTS   AGE
cdi-operator-7c6788b9-46fns                        1/1     Running   0          2m3s
cluster-network-addons-operator-5589c74d55-vdbfs   1/1     Running   0          2m4s
hco-operator-59b7596b46-7qxx4                      1/1     Running   3          2m4s
hostpath-provisioner-operator-87959576d-q9n4r      1/1     Running   0          2m3s
kubevirt-ssp-operator-5b4964cb9-x4dtm              1/1     Running   0          2m4s
virt-operator-5f4cbcc474-22ghm                     1/1     Running   0          83s
virt-operator-5f4cbcc474-fkvl9                     1/1     Running   0          83s
vm-import-operator-774874c566-fjdrb                1/1     Running   0          2m3s
[root@preserve-olm-env 1868712]# oc get rs -n kubevirt-hyperconverged
NAME                                         DESIRED   CURRENT   READY   AGE
cdi-operator-7c6788b9                        1         1         1       89s
cluster-network-addons-operator-5589c74d55   1         1         1       89s
hco-operator-59b7596b46                      1         1         1       89s
hostpath-provisioner-operator-87959576d      1         1         1       88s
kubevirt-ssp-operator-5b4964cb9              1         1         1       89s
virt-operator-5f4cbcc474                     2         2         2       89s
vm-import-operator-774874c566                1         1         1       88s


3. Create the "HyperConverged" CR (default settings).


4. watch hco-operator pod getting terminated and created, new RS are being created:
[root@preserve-olm-env 1868712]# oc get rs -n kubevirt-hyperconverged
NAME                                            DESIRED   CURRENT   READY   AGE
cdi-apiserver-8ffb6b454                         1         1         1       3m6s
cdi-deployment-5c6d7bbf8d                       1         1         1       3m5s
cdi-operator-854dc77564                         1         1         1       9m26s
cdi-uploadproxy-77c59668cd                      1         1         1       3m5s
cluster-network-addons-operator-5d6b5cd449      1         1         1       9m26s
hco-operator-594df8cc85                         1         1         1       9m27s
hostpath-provisioner-operator-5796fb569b        1         1         1       9m25s
kubemacpool-mac-controller-manager-587b75fd5b   1         1         1       3m15s
kubevirt-ssp-operator-7989d9c5fb                1         1         1       9m26s
nmstate-webhook-7dc944bfb5                      2         2         2       3m14s
node-maintenance-operator-59b89c5468            1         1         1       9m26s
virt-api-85b755d986                             2         2         2       2m58s
virt-controller-7cf77fdd56                      2         2         2       2m24s
virt-operator-9f6f5b6d8                         2         2         2       9m26s
virt-template-validator-7bd44488c6              2         2         2       2m1s
vm-import-controller-75f8d65c6                  1         1         1       2m55s
vm-import-operator-59469c459d                   1         1         1       9m25s

There are no duplicated "dead" ReplicaSets. 

5. Check the pods status, they are running.
[root@preserve-olm-env 1868712]# oc get pods -n kubevirt-hyperconverged
NAME                                                  READY   STATUS    RESTARTS   AGE
bridge-marker-hmpbk                                   1/1     Running   0          7m28s
bridge-marker-hwn9j                                   1/1     Running   0          7m27s
bridge-marker-hx28n                                   1/1     Running   0          7m27s
bridge-marker-rl5m8                                   1/1     Running   0          7m27s
bridge-marker-vn2v4                                   1/1     Running   0          7m27s
bridge-marker-x9t2x                                   1/1     Running   0          7m27s
cdi-apiserver-8ffb6b454-jrhp6                         1/1     Running   0          7m18s
cdi-deployment-5c6d7bbf8d-sq8lb                       1/1     Running   0          7m18s
cdi-operator-854dc77564-cxj8s                         1/1     Running   0          13m
cdi-uploadproxy-77c59668cd-bk4jm                      1/1     Running   0          7m18s
cluster-network-addons-operator-5d6b5cd449-8g8gf      1/1     Running   0          13m
hco-operator-594df8cc85-ljpqr                         1/1     Running   0          13m
hostpath-provisioner-operator-5796fb569b-qtskl        1/1     Running   0          13m
kube-cni-linux-bridge-plugin-cncmt                    1/1     Running   0          7m29s
kube-cni-linux-bridge-plugin-fz98j                    1/1     Running   0          7m28s
kube-cni-linux-bridge-plugin-khkcg                    1/1     Running   0          7m29s
kube-cni-linux-bridge-plugin-p9ht6                    1/1     Running   0          7m29s
kube-cni-linux-bridge-plugin-rqdrf                    1/1     Running   0          7m28s
kube-cni-linux-bridge-plugin-x6z6b                    1/1     Running   0          7m28s
kubemacpool-mac-controller-manager-587b75fd5b-b4mgq   1/1     Running   0          7m28s
kubevirt-node-labeller-nmqwv                          1/1     Running   0          6m14s
kubevirt-node-labeller-wf7tl                          1/1     Running   0          6m14s
kubevirt-node-labeller-xx9gp                          1/1     Running   0          6m14s
kubevirt-ssp-operator-7989d9c5fb-9g8q9                1/1     Running   0          13m
nmstate-handler-9nznf                                 1/1     Running   0          7m26s
nmstate-handler-md2dw                                 1/1     Running   0          7m26s
nmstate-handler-n26l8                                 1/1     Running   0          7m26s
nmstate-handler-nnpq6                                 1/1     Running   0          7m26s
nmstate-handler-p4dz7                                 1/1     Running   0          7m26s
nmstate-handler-tz7pf                                 1/1     Running   0          7m26s
nmstate-webhook-7dc944bfb5-dk5pv                      1/1     Running   0          7m25s
nmstate-webhook-7dc944bfb5-tk58s                      1/1     Running   0          7m25s
node-maintenance-operator-59b89c5468-842lp            1/1     Running   0          13m
ovs-cni-amd64-2qffd                                   1/1     Running   0          7m24s
ovs-cni-amd64-d9xpb                                   1/1     Running   0          7m25s
ovs-cni-amd64-ddkzs                                   1/1     Running   0          7m24s
ovs-cni-amd64-k7ghr                                   1/1     Running   0          7m24s
ovs-cni-amd64-kjhgv                                   1/1     Running   0          7m24s
ovs-cni-amd64-mkb7h                                   1/1     Running   0          7m25s
virt-api-85b755d986-9mzfx                             1/1     Running   0          7m11s
virt-api-85b755d986-wsgsq                             1/1     Running   0          7m11s
virt-controller-7cf77fdd56-cfcqm                      1/1     Running   0          6m37s
virt-controller-7cf77fdd56-x6vjp                      1/1     Running   0          6m37s
virt-handler-5c27r                                    1/1     Running   0          6m36s
virt-handler-78h5q                                    1/1     Running   0          6m37s
virt-handler-h6w5w                                    1/1     Running   0          6m36s
virt-operator-9f6f5b6d8-6m5v9                         1/1     Running   0          11m
virt-operator-9f6f5b6d8-t62l2                         1/1     Running   0          11m
virt-template-validator-7bd44488c6-tfjrr              1/1     Running   0          6m14s
virt-template-validator-7bd44488c6-wdq2s              1/1     Running   0          6m14s
vm-import-controller-75f8d65c6-f87xx                  1/1     Running   0          7m8s
vm-import-operator-59469c459d-z56n5                   1/1     Running   0          13m


6. Check the csv status, it is succeeded.
[root@preserve-olm-env 1868712]# oc get csv -n kubevirt-hyperconverged
NAME                                      DISPLAY                                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v1.2.0   KubeVirt HyperConverged Cluster Operator   1.2.0     kubevirt-hyperconverged-operator.v1.1.0   Succeeded

Verify the bug on 4.6.

Comment 39 errata-xmlrpc 2020-10-27 16:28:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.