1906437 – Several openshift-ci issues that may be related to OLM

Bug 1906437 - Several openshift-ci issues that may be related to OLM

Summary: Several openshift-ci issues that may be related to OLM

Keywords:
Status:	CLOSED DUPLICATE of bug 1907381
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Evan Cordell
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-10 14:12 UTC by Nahshon Unna-Tsameret
Modified:	2021-04-06 04:40 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-11 19:25:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Nahshon Unna-Tsameret 2020-12-10 14:12:00 UTC

Description of problem:
In HCO openshift-ci test, we are facing several different issues. Some of themn may be related to OLM.

1. An upgrade test on AWS (not happening on Azure): the test install previous version, updates to the PR version and then trying to remove the namespace. This namespace is protected by HCO webhook and should be rejected, but the error we get is not the rejection error but this:

Error from server (InternalError): Internal error occurred: failed calling webhook "mutate-ns-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/mutate-ns-hco-kubevirt-io?timeout=30s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")

For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/991/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-azure/1336649438625533952

2. Happen in several upgrade tests, in several environments: after sunscripting to the new CSV, nothing happens for a long time until the test timed out. The deployment stays at its previous version.

For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/993/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-azure/1336971505581428736

Comment 1 Alexander Greene 2020-12-10 14:21:42 UTC

Please provide additional information on installation steps of the HCO operator.

Comment 2 Alexander Greene 2020-12-10 14:24:28 UTC

Additionally, please provide the steps to run the CI test.

Comment 3 Kevin Rizza 2020-12-10 16:17:02 UTC

Can we separate these issues out into separate bugzillas with explicit replication steps? There's not a whole lot that is actionable here from our side. My first questions for these would be to confirm that the webhook and the update catalog is configured correctly, but if while doing that we uncover underlying bugs, we would probably want to create bz's for each of those.

Comment 4 Nahshon Unna-Tsameret 2020-12-11 13:05:58 UTC

Here is where the test installs the "old" version:
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/d1efad33eb3e6624e4ee337f593218239e86ea48/hack/upgrade-test.sh#L165-L177

And here is where the test updates the CSV version for upgrade:
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/d1efad33eb3e6624e4ee337f593218239e86ea48/hack/upgrade-test.sh#L215

Comment 5 Nahshon Unna-Tsameret 2020-12-14 06:31:35 UTC

Opened anew bug for the certificate issue: https://bugzilla.redhat.com/show_bug.cgi?id=1907290

Comment 6 Nahshon Unna-Tsameret 2020-12-14 09:58:24 UTC

Another similar issue: 

> Error from server (InternalError): Internal error occurred: failed calling webhook "mutate-ns-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/mutate-ns-hco-kubevirt-io?timeout=30s": service "hco-webhook-service" not found

Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/997/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-aws/1338405300503318528

Comment 7 Nahshon Unna-Tsameret 2020-12-14 15:07:07 UTC

Script for deploy and upgrade HCO:

Please notice that the index image used in this script causes the issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1907417. The issue is in 1.3.0 channel, so it will causes the upgrade not to succeed. To solve it, after the subscription to the new image, manually remove annotations field from the cluster-network-addons-operator template.

The current deployment in the csv starts with:
>       - name: cluster-network-addons-operator
>         spec:
>           replicas: 1
>           selector:
>             matchLabels:
>               name: cluster-network-addons-operator
>           strategy:
>             type: Recreate
>          template:
>            metadata:
========>      annotations:
========>        description: cluster-network-addons-operator manages the lifecycle of different Kubernetes network components on top of Kubernetes cluster
>              labels:
>                name: cluster-network-addons-operator


After the script complete in openshift-ci, one of the following is happening randomly:
1. The upgrade does not start. After 15 minutes, the deployments are still from the previous version
2. Certificate issue when trying to remove the kubevirt-hyperconverged namespace: it should be blocked by hco-webhook. This error described in https://bugzilla.redhat.com/show_bug.cgi?id=1907290
3. hco webhook service not found, during the same usecase as in #2

================= The script ======================

oc create ns kubevirt-hyperconverged || true

cat <<EOF | oc create -f -
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: hco-operatorgroup
  namespace: kubevirt-hyperconverged
spec:
  targetNamespaces:
  - kubevirt-hyperconverged
EOF

cat <<EOF | oc create -f -
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: hco-catalogsource-example
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/nunnatsa/hyperconverged-cluster-index:1.3.0
  displayName: KubeVirt HyperConverged
  publisher: Red Hat
EOF

sleep 15

HCO_CATALOGSOURCE_POD=`oc get pods -n openshift-marketplace | grep hco-catalogsource | head -1 | awk '{ print $1 }'`
oc wait pod $HCO_CATALOGSOURCE_POD --for condition=Ready -n openshift-marketplace --timeout="120s"

CATALOG_OPERATOR_POD=`oc get pods -n openshift-operator-lifecycle-manager | grep catalog-operator | head -1 | awk '{ print $1 }'`
oc wait pod ${CATALOG_OPERATOR_POD} --for condition=Ready -n openshift-operator-lifecycle-manager --timeout="120s"

PACKAGESERVER_POD=`oc get pods -n openshift-operator-lifecycle-manager | grep packageserver | head -1 | awk '{ print $1 }'`
oc wait pod ${PACKAGESERVER_POD} --for condition=Ready -n openshift-operator-lifecycle-manager --timeout="120s"

sleep 15

cat <<EOF | ${CMD} create -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: hco-subscription-example
  namespace: kubevirt-hyperconverged
spec:
  channel: 1.2.0
  name: kubevirt-hyperconverged
  source: hco-catalogsource-example
  sourceNamespace: openshift-marketplace
  config:
    selector:
      matchLabels:
        name: hyperconverged-cluster-operator
    env:
      - name: KVM_EMULATION
        value: "true"
EOF

oc wait deployment hco-operator --for condition=Available -n kubevirt-hyperconverged --timeout="1200s"

# Deploy HyperConverged CR
cat <<EOF | oc create -n kubevirt-hyperconverged -f -
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
  name: kubevirt-hyperconverged
spec:
  infra: {}
  workloads: {}
EOF

oc wait -n kubevirt-hyperconverged HyperConverged kubevirt-hyperconverged --for condition=Available --timeout=30m
oc wait -n kubevirt-hyperconverged deployment hco-operator --for condition=Available --timeout="30m"
oc wait -n kubevirt-hyperconverged deployment hco-webhook --for condition=Available --timeout="30m"

# Perform the upgrade
oc patch -n kubevirt-hyperconverged subscription hco-subscription-example -p "{\"spec\": {\"channel\": \"1.3.0\"}}"  --type merge

Comment 9 Kevin Rizza 2021-01-11 19:25:24 UTC

With the certificate issue now being tracked by a separate bug, circling back around to the other issue described. That appears to be a duplicate of this recently closed bug https://bugzilla.redhat.com/show_bug.cgi?id=1907381. I'm going to close this as a duplicate. If any problems persist, please feel free to reach back out or reopen.

*** This bug has been marked as a duplicate of bug 1907381 ***

Note You need to log in before you can comment on or make changes to this bug.