1907290 – Certificat error on operator upgrade

Bug 1907290 - Certificat error on operator upgrade

Summary: Certificat error on operator upgrade

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.7
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Kevin Rizza
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-14 06:29 UTC by Nahshon Unna-Tsameret
Modified:	2023-09-15 01:31 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-26 16:01:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
OLM operator log (807.88 KB, application/gzip) 2020-12-14 06:29 UTC, Nahshon Unna-Tsameret	no flags	Details
View All

Description Nahshon Unna-Tsameret 2020-12-14 06:29:59 UTC

Created attachment 1738865 [details]
OLM operator log

Description of problem:
HCO openshift-ci failes on AWS, After upgrading CNV, while trying to send a request to hco-webhook, with the following error:

> Error from server (InternalError): Internal error occurred: failed calling webhook "mutate-ns-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/mutate-ns-hco-kubevirt-io?timeout=30s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")

For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/991/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-azure/1336649438625533952
 

Here is where the test installs the "old" version:
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/d1efad33eb3e6624e4ee337f593218239e86ea48/hack/upgrade-test.sh#L165-L177

And here is where the test updates the CSV version for upgrade:
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/d1efad33eb3e6624e4ee337f593218239e86ea48/hack/upgrade-test.sh#L215


It was reporoduced in different scenario: using this index image: 

> quay.io/nunnatsa/hyperconverged-cluster-index:1.3.0 

1. We installed the 1.3.0 channel, then had to fix its CSV manually (another issue) to remove an annotations.description field from one of the templates, then it completed the installation.
2. uninstall 1.3.0 (originally, in order to start the upgrade scenario from 1.2.0)
3. install 1.2.0
4. trying to deploy the HyperConverged CR:
> cat <<EOF | oc} create -n kubevirt-hyperconverged -f -
> apiVersion: hco.kubevirt.io/v1beta1
> kind: HyperConverged
> metadata:
>  name: kubevirt-hyperconverged
> spec:
>   infra: {}
>   workloads: {}
> EOF

At this point, we got this error:
> Error from server (InternalError): error when creating "deploy/hco.cr.yaml": Internal error occurred: failed calling webhook "validate-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/validate-hco-kubevirt-io-v1beta1-hyperconverged?timeout=30s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")


OLM operator log is attached.

Comment 1 Nahshon Unna-Tsameret 2020-12-14 06:53:56 UTC

Another example, this time on Azure: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/997/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-azure/1338226776664444928

Comment 9 Kevin Rizza 2021-02-08 19:43:41 UTC

Moving this BZ to medium due to the fact that it is not reproducible and does not block any active workflows.

Comment 10 Kevin Rizza 2021-02-22 18:35:54 UTC

Closing this due to the lack of a reproduction method. If the HCO operator CI starts seeing this problem again feel free to reopen and we can try to investigate again. Given the transient nature of this error, it's very possible this is already resolved due to an unrelated change.

Comment 15 Debarati Basu-Nag 2021-09-17 13:26:54 UTC

This was hit against cnv 4.8.1->4.8.2

ocp version:
===========
[cnv-qe-jenkins@infra-debug3-twzz6-executor ~]$ oc version
Client Version: 4.8.0-202109080022.p0.git.a0c12be.assembly.stream-a0c12be
Server Version: 4.8.12
Kubernetes Version: v1.21.1+d8043e1
[cnv-qe-jenkins@infra-debug3-twzz6-executor ~]$ 
============

Comment 19 Kevin Rizza 2022-01-05 19:06:47 UTC

We will prioritize this now that there is a must gather to determine if there's anything obvious we can see.

For now, moving the status back to NEW

Comment 25 Krzysztof Majcher 2022-05-26 16:01:45 UTC

It seems we have not hit the issue in our recent upgrade tests, therefore I'm closing it for now.
Once we will hit it - we'll reopen and secure the must-gather so it won't dissapear.

Comment 26 Ruth Netser 2022-06-08 16:43:32 UTC

@dbasunag If you encounter this issue again; please re-open.

Comment 27 Red Hat Bugzilla 2023-09-15 01:31:35 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.