Created attachment 1738865 [details] OLM operator log Description of problem: HCO openshift-ci failes on AWS, After upgrading CNV, while trying to send a request to hco-webhook, with the following error: > Error from server (InternalError): Internal error occurred: failed calling webhook "mutate-ns-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/mutate-ns-hco-kubevirt-io?timeout=30s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.") For example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/991/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-azure/1336649438625533952 Here is where the test installs the "old" version: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/d1efad33eb3e6624e4ee337f593218239e86ea48/hack/upgrade-test.sh#L165-L177 And here is where the test updates the CSV version for upgrade: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/d1efad33eb3e6624e4ee337f593218239e86ea48/hack/upgrade-test.sh#L215 It was reporoduced in different scenario: using this index image: > quay.io/nunnatsa/hyperconverged-cluster-index:1.3.0 1. We installed the 1.3.0 channel, then had to fix its CSV manually (another issue) to remove an annotations.description field from one of the templates, then it completed the installation. 2. uninstall 1.3.0 (originally, in order to start the upgrade scenario from 1.2.0) 3. install 1.2.0 4. trying to deploy the HyperConverged CR: > cat <<EOF | oc} create -n kubevirt-hyperconverged -f - > apiVersion: hco.kubevirt.io/v1beta1 > kind: HyperConverged > metadata: > name: kubevirt-hyperconverged > spec: > infra: {} > workloads: {} > EOF At this point, we got this error: > Error from server (InternalError): error when creating "deploy/hco.cr.yaml": Internal error occurred: failed calling webhook "validate-hco.kubevirt.io": Post "https://hco-webhook-service.kubevirt-hyperconverged.svc:4343/validate-hco-kubevirt-io-v1beta1-hyperconverged?timeout=30s": x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "Red Hat, Inc.") OLM operator log is attached.
Another example, this time on Azure: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/997/pull-ci-kubevirt-hyperconverged-cluster-operator-master-hco-e2e-upgrade-prev-azure/1338226776664444928
Moving this BZ to medium due to the fact that it is not reproducible and does not block any active workflows.
Closing this due to the lack of a reproduction method. If the HCO operator CI starts seeing this problem again feel free to reopen and we can try to investigate again. Given the transient nature of this error, it's very possible this is already resolved due to an unrelated change.
This was hit against cnv 4.8.1->4.8.2 ocp version: =========== [cnv-qe-jenkins@infra-debug3-twzz6-executor ~]$ oc version Client Version: 4.8.0-202109080022.p0.git.a0c12be.assembly.stream-a0c12be Server Version: 4.8.12 Kubernetes Version: v1.21.1+d8043e1 [cnv-qe-jenkins@infra-debug3-twzz6-executor ~]$ ============
We will prioritize this now that there is a must gather to determine if there's anything obvious we can see. For now, moving the status back to NEW
It seems we have not hit the issue in our recent upgrade tests, therefore I'm closing it for now. Once we will hit it - we'll reopen and secure the must-gather so it won't dissapear.
@dbasunag If you encounter this issue again; please re-open.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days