Bug 1906437 - Several openshift-ci issues that may be related to OLM
Summary: Several openshift-ci issues that may be related to OLM
Status: CLOSED DUPLICATE of bug 1907381
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.7
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.7.0
Assignee: Evan Cordell
QA Contact: Jian Zhang
Depends On:
TreeView+ depends on / blocked
Reported: 2020-12-10 14:12 UTC by Nahshon Unna-Tsameret
Modified: 2021-04-06 04:40 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-01-11 19:25:24 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Nahshon Unna-Tsameret 2020-12-10 14:12:00 UTC
Description of problem:
In HCO openshift-ci test, we are facing several different issues. Some of themn may be related to OLM.

1. An upgrade test on AWS (not happening on Azure): the test install previous version, updates to the PR version and then trying to remove the namespace. This namespace is protected by HCO webhook and should be rejected, but the error we get is not the rejection error but this:

Error from server (InternalError): Internal error occurred: failed calling webhook "mutate-ns-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/mutate-ns-hco-kubevirt-io?timeout=30s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")

For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/991/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-azure/1336649438625533952

2. Happen in several upgrade tests, in several environments: after sunscripting to the new CSV, nothing happens for a long time until the test timed out. The deployment stays at its previous version.

For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/993/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-azure/1336971505581428736

Comment 1 Alexander Greene 2020-12-10 14:21:42 UTC
Please provide additional information on installation steps of the HCO operator.

Comment 2 Alexander Greene 2020-12-10 14:24:28 UTC
Additionally, please provide the steps to run the CI test.

Comment 3 Kevin Rizza 2020-12-10 16:17:02 UTC
Can we separate these issues out into separate bugzillas with explicit replication steps? There's not a whole lot that is actionable here from our side. My first questions for these would be to confirm that the webhook and the update catalog is configured correctly, but if while doing that we uncover underlying bugs, we would probably want to create bz's for each of those.

Comment 5 Nahshon Unna-Tsameret 2020-12-14 06:31:35 UTC
Opened anew bug for the certificate issue: https://bugzilla.redhat.com/show_bug.cgi?id=1907290

Comment 6 Nahshon Unna-Tsameret 2020-12-14 09:58:24 UTC
Another similar issue: 

> Error from server (InternalError): Internal error occurred: failed calling webhook "mutate-ns-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/mutate-ns-hco-kubevirt-io?timeout=30s": service "hco-webhook-service" not found

Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/997/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-aws/1338405300503318528

Comment 7 Nahshon Unna-Tsameret 2020-12-14 15:07:07 UTC
Script for deploy and upgrade HCO:

Please notice that the index image used in this script causes the issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1907417. The issue is in 1.3.0 channel, so it will causes the upgrade not to succeed. To solve it, after the subscription to the new image, manually remove annotations field from the cluster-network-addons-operator template.

The current deployment in the csv starts with:
>       - name: cluster-network-addons-operator
>         spec:
>           replicas: 1
>           selector:
>             matchLabels:
>               name: cluster-network-addons-operator
>           strategy:
>             type: Recreate
>          template:
>            metadata:
========>      annotations:
========>        description: cluster-network-addons-operator manages the lifecycle of different Kubernetes network components on top of Kubernetes cluster
>              labels:
>                name: cluster-network-addons-operator

After the script complete in openshift-ci, one of the following is happening randomly:
1. The upgrade does not start. After 15 minutes, the deployments are still from the previous version
2. Certificate issue when trying to remove the kubevirt-hyperconverged namespace: it should be blocked by hco-webhook. This error described in https://bugzilla.redhat.com/show_bug.cgi?id=1907290
3. hco webhook service not found, during the same usecase as in #2

================= The script ======================

oc create ns kubevirt-hyperconverged || true

cat <<EOF | oc create -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
  name: hco-operatorgroup
  namespace: kubevirt-hyperconverged
  - kubevirt-hyperconverged

cat <<EOF | oc create -f -
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
  name: hco-catalogsource-example
  namespace: openshift-marketplace
  sourceType: grpc
  image: quay.io/nunnatsa/hyperconverged-cluster-index:1.3.0
  displayName: KubeVirt HyperConverged
  publisher: Red Hat

sleep 15

HCO_CATALOGSOURCE_POD=`oc get pods -n openshift-marketplace | grep hco-catalogsource | head -1 | awk '{ print $1 }'`
oc wait pod $HCO_CATALOGSOURCE_POD --for condition=Ready -n openshift-marketplace --timeout="120s"

CATALOG_OPERATOR_POD=`oc get pods -n openshift-operator-lifecycle-manager | grep catalog-operator | head -1 | awk '{ print $1 }'`
oc wait pod ${CATALOG_OPERATOR_POD} --for condition=Ready -n openshift-operator-lifecycle-manager --timeout="120s"

PACKAGESERVER_POD=`oc get pods -n openshift-operator-lifecycle-manager | grep packageserver | head -1 | awk '{ print $1 }'`
oc wait pod ${PACKAGESERVER_POD} --for condition=Ready -n openshift-operator-lifecycle-manager --timeout="120s"

sleep 15

cat <<EOF | ${CMD} create -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
  name: hco-subscription-example
  namespace: kubevirt-hyperconverged
  channel: 1.2.0
  name: kubevirt-hyperconverged
  source: hco-catalogsource-example
  sourceNamespace: openshift-marketplace
        name: hyperconverged-cluster-operator
      - name: KVM_EMULATION
        value: "true"

oc wait deployment hco-operator --for condition=Available -n kubevirt-hyperconverged --timeout="1200s"

# Deploy HyperConverged CR
cat <<EOF | oc create -n kubevirt-hyperconverged -f -
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
  name: kubevirt-hyperconverged
  infra: {}
  workloads: {}

oc wait -n kubevirt-hyperconverged HyperConverged kubevirt-hyperconverged --for condition=Available --timeout=30m
oc wait -n kubevirt-hyperconverged deployment hco-operator --for condition=Available --timeout="30m"
oc wait -n kubevirt-hyperconverged deployment hco-webhook --for condition=Available --timeout="30m"

# Perform the upgrade
oc patch -n kubevirt-hyperconverged subscription hco-subscription-example -p "{\"spec\": {\"channel\": \"1.3.0\"}}"  --type merge

Comment 9 Kevin Rizza 2021-01-11 19:25:24 UTC
With the certificate issue now being tracked by a separate bug, circling back around to the other issue described. That appears to be a duplicate of this recently closed bug https://bugzilla.redhat.com/show_bug.cgi?id=1907381. I'm going to close this as a duplicate. If any problems persist, please feel free to reach back out or reopen.

*** This bug has been marked as a duplicate of bug 1907381 ***

Note You need to log in before you can comment on or make changes to this bug.