Description of problem: In HCO openshift-ci test, we are facing several different issues. Some of themn may be related to OLM. 1. An upgrade test on AWS (not happening on Azure): the test install previous version, updates to the PR version and then trying to remove the namespace. This namespace is protected by HCO webhook and should be rejected, but the error we get is not the rejection error but this: Error from server (InternalError): Internal error occurred: failed calling webhook "mutate-ns-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/mutate-ns-hco-kubevirt-io?timeout=30s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.") For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/991/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-azure/1336649438625533952 2. Happen in several upgrade tests, in several environments: after sunscripting to the new CSV, nothing happens for a long time until the test timed out. The deployment stays at its previous version. For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/993/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-azure/1336971505581428736
Please provide additional information on installation steps of the HCO operator.
Additionally, please provide the steps to run the CI test.
Can we separate these issues out into separate bugzillas with explicit replication steps? There's not a whole lot that is actionable here from our side. My first questions for these would be to confirm that the webhook and the update catalog is configured correctly, but if while doing that we uncover underlying bugs, we would probably want to create bz's for each of those.
Here is where the test installs the "old" version: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/d1efad33eb3e6624e4ee337f593218239e86ea48/hack/upgrade-test.sh#L165-L177 And here is where the test updates the CSV version for upgrade: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/d1efad33eb3e6624e4ee337f593218239e86ea48/hack/upgrade-test.sh#L215
Opened anew bug for the certificate issue: https://bugzilla.redhat.com/show_bug.cgi?id=1907290
Another similar issue: > Error from server (InternalError): Internal error occurred: failed calling webhook "mutate-ns-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/mutate-ns-hco-kubevirt-io?timeout=30s": service "hco-webhook-service" not found Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/997/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-aws/1338405300503318528
Script for deploy and upgrade HCO: Please notice that the index image used in this script causes the issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1907417. The issue is in 1.3.0 channel, so it will causes the upgrade not to succeed. To solve it, after the subscription to the new image, manually remove annotations field from the cluster-network-addons-operator template. The current deployment in the csv starts with: > - name: cluster-network-addons-operator > spec: > replicas: 1 > selector: > matchLabels: > name: cluster-network-addons-operator > strategy: > type: Recreate > template: > metadata: ========> annotations: ========> description: cluster-network-addons-operator manages the lifecycle of different Kubernetes network components on top of Kubernetes cluster > labels: > name: cluster-network-addons-operator After the script complete in openshift-ci, one of the following is happening randomly: 1. The upgrade does not start. After 15 minutes, the deployments are still from the previous version 2. Certificate issue when trying to remove the kubevirt-hyperconverged namespace: it should be blocked by hco-webhook. This error described in https://bugzilla.redhat.com/show_bug.cgi?id=1907290 3. hco webhook service not found, during the same usecase as in #2 ================= The script ====================== oc create ns kubevirt-hyperconverged || true cat <<EOF | oc create -f - apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: hco-operatorgroup namespace: kubevirt-hyperconverged spec: targetNamespaces: - kubevirt-hyperconverged EOF cat <<EOF | oc create -f - apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: hco-catalogsource-example namespace: openshift-marketplace spec: sourceType: grpc image: quay.io/nunnatsa/hyperconverged-cluster-index:1.3.0 displayName: KubeVirt HyperConverged publisher: Red Hat EOF sleep 15 HCO_CATALOGSOURCE_POD=`oc get pods -n openshift-marketplace | grep hco-catalogsource | head -1 | awk '{ print $1 }'` oc wait pod $HCO_CATALOGSOURCE_POD --for condition=Ready -n openshift-marketplace --timeout="120s" CATALOG_OPERATOR_POD=`oc get pods -n openshift-operator-lifecycle-manager | grep catalog-operator | head -1 | awk '{ print $1 }'` oc wait pod ${CATALOG_OPERATOR_POD} --for condition=Ready -n openshift-operator-lifecycle-manager --timeout="120s" PACKAGESERVER_POD=`oc get pods -n openshift-operator-lifecycle-manager | grep packageserver | head -1 | awk '{ print $1 }'` oc wait pod ${PACKAGESERVER_POD} --for condition=Ready -n openshift-operator-lifecycle-manager --timeout="120s" sleep 15 cat <<EOF | ${CMD} create -f - apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: hco-subscription-example namespace: kubevirt-hyperconverged spec: channel: 1.2.0 name: kubevirt-hyperconverged source: hco-catalogsource-example sourceNamespace: openshift-marketplace config: selector: matchLabels: name: hyperconverged-cluster-operator env: - name: KVM_EMULATION value: "true" EOF oc wait deployment hco-operator --for condition=Available -n kubevirt-hyperconverged --timeout="1200s" # Deploy HyperConverged CR cat <<EOF | oc create -n kubevirt-hyperconverged -f - apiVersion: hco.kubevirt.io/v1beta1 kind: HyperConverged metadata: name: kubevirt-hyperconverged spec: infra: {} workloads: {} EOF oc wait -n kubevirt-hyperconverged HyperConverged kubevirt-hyperconverged --for condition=Available --timeout=30m oc wait -n kubevirt-hyperconverged deployment hco-operator --for condition=Available --timeout="30m" oc wait -n kubevirt-hyperconverged deployment hco-webhook --for condition=Available --timeout="30m" # Perform the upgrade oc patch -n kubevirt-hyperconverged subscription hco-subscription-example -p "{\"spec\": {\"channel\": \"1.3.0\"}}" --type merge
With the certificate issue now being tracked by a separate bug, circling back around to the other issue described. That appears to be a duplicate of this recently closed bug https://bugzilla.redhat.com/show_bug.cgi?id=1907381. I'm going to close this as a duplicate. If any problems persist, please feel free to reach back out or reopen. *** This bug has been marked as a duplicate of bug 1907381 ***