Bug 1907290

Summary: Certificat error on operator upgrade
Product: OpenShift Container Platform Reporter: Nahshon Unna-Tsameret <nunnatsa>
Component: OLMAssignee: Kevin Rizza <krizza>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: medium    
Priority: medium CC: agreene, dageoffr, danken, davegord, dbasunag, dsover, jkeister, kmajcher, krizza, ocohen, rnetser, tyslaton, vdinh
Version: 4.7Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-26 16:01:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
OLM operator log none

Description Nahshon Unna-Tsameret 2020-12-14 06:29:59 UTC
Created attachment 1738865 [details]
OLM operator log

Description of problem:
HCO openshift-ci failes on AWS, After upgrading CNV, while trying to send a request to hco-webhook, with the following error:

> Error from server (InternalError): Internal error occurred: failed calling webhook "mutate-ns-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/mutate-ns-hco-kubevirt-io?timeout=30s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")

For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/991/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-azure/1336649438625533952
 

Here is where the test installs the "old" version:
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/d1efad33eb3e6624e4ee337f593218239e86ea48/hack/upgrade-test.sh#L165-L177

And here is where the test updates the CSV version for upgrade:
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/d1efad33eb3e6624e4ee337f593218239e86ea48/hack/upgrade-test.sh#L215


It was reporoduced in different scenario: using this index image: 

> quay.io/nunnatsa/hyperconverged-cluster-index:1.3.0 

1. We installed the 1.3.0 channel, then had to fix its CSV manually (another issue) to remove an annotations.description field from one of the templates, then it completed the installation.
2. uninstall 1.3.0 (originally, in order to start the upgrade scenario from 1.2.0)
3. install 1.2.0
4. trying to deploy the HyperConverged CR:
> cat <<EOF | oc} create -n kubevirt-hyperconverged -f -
> apiVersion: hco.kubevirt.io/v1beta1
> kind: HyperConverged
> metadata:
>  name: kubevirt-hyperconverged
> spec:
>   infra: {}
>   workloads: {}
> EOF

At this point, we got this error:
> Error from server (InternalError): error when creating "deploy/hco.cr.yaml": Internal error occurred: failed calling webhook "validate-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/validate-hco-kubevirt-io-v1beta1-hyperconverged?timeout=30s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.")


OLM operator log is attached.

Comment 9 Kevin Rizza 2021-02-08 19:43:41 UTC
Moving this BZ to medium due to the fact that it is not reproducible and does not block any active workflows.

Comment 10 Kevin Rizza 2021-02-22 18:35:54 UTC
Closing this due to the lack of a reproduction method. If the HCO operator CI starts seeing this problem again feel free to reopen and we can try to investigate again. Given the transient nature of this error, it's very possible this is already resolved due to an unrelated change.

Comment 15 Debarati Basu-Nag 2021-09-17 13:26:54 UTC
This was hit against cnv 4.8.1->4.8.2

ocp version:
===========
[cnv-qe-jenkins@infra-debug3-twzz6-executor ~]$ oc version
Client Version: 4.8.0-202109080022.p0.git.a0c12be.assembly.stream-a0c12be
Server Version: 4.8.12
Kubernetes Version: v1.21.1+d8043e1
[cnv-qe-jenkins@infra-debug3-twzz6-executor ~]$ 
============

Comment 19 Kevin Rizza 2022-01-05 19:06:47 UTC
We will prioritize this now that there is a must gather to determine if there's anything obvious we can see.

For now, moving the status back to NEW

Comment 25 Krzysztof Majcher 2022-05-26 16:01:45 UTC
It seems we have not hit the issue in our recent upgrade tests, therefore I'm closing it for now.
Once we will hit it - we'll reopen and secure the must-gather so it won't dissapear.

Comment 26 Ruth Netser 2022-06-08 16:43:32 UTC
@dbasunag If you encounter this issue again; please re-open.

Comment 27 Red Hat Bugzilla 2023-09-15 01:31:35 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days