Description of problem: The deployment fails with the below error while creating the kubevirt-hyperconverged object after creating Namespace, OperatorGroup, and Subscription objects. ~~~ Error from server (InternalError): error when creating "hyper.yaml": Internal error occurred: failed calling webhook "validate-hco.kubevirt.io": Post "https://hco-webhook-service.openshift-cnv.svc:4343/validate-hco-kubevirt-io-v1beta1-hyperconverged?timeout=30s": service "hco-webhook-service" not found ~~~ The service hco-webhook-service is not created and OLM logs have got the error "could not create service hco-webhook-service: object is being deleted: services "hco-webhook-service" already exists". ~~~ 2022-01-28T06:21:39.901251731Z I0128 06:21:39.901171 1 event.go:282] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.8.4", UID:"835e6eb1-5e7e-40de-b922-5a22b3361778", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"105456", FieldPath:""}): type: 'Warning' reason: 'InstallComponentFailed' install strategy failed: could not create service hco-webhook-service: object is being deleted: services "hco-webhook-service" already exists ~~~ As per the audit logs, the OLM created the service, deleted it, and then created it back again which failed. Create event. ~~~ {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"029c1ae0-21aa-4d3d-80d8-ed0a06ac45bc","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services","verb":"create","user":{"username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","uid":"1e38b88c-ddf2-426f-bf8c-692fce5cf4e9","groups":["system:serviceaccounts","system:serviceaccounts:openshift-operator-lifecycle-manager","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["olm-operator-56f69cbbbf-27t6s"],"authentication.kubernetes.io/pod-uid":["3b66a65b-d54f-487c-ac8c-96f94e21b933"]}},"sourceIPs":["10.30.1.5"],"userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2022-01-28T06:21:31.931882Z","stageTimestamp":"2022-01-28T06:21:31.970348Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"olm-operator-binding-openshift-operator-lifecycle-manager\" of ClusterRole \"system:controller:operator-lifecycle-manager\" to ServiceAccount \"olm-operator-serviceaccount/openshift-operator-lifecycle-manager\""}} ~~~ Delete and create. ~~~ {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"e3cec41d-18b4-4edf-8801-33bcceec05f7","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services/hco-webhook-service","verb":"delete","user":{"username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","uid":"1e38b88c-ddf2-426f-bf8c-692fce5cf4e9","groups":["system:serviceaccounts","system:serviceaccounts:openshift-operator-lifecycle-manager","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["olm-operator-56f69cbbbf-27t6s"],"authentication.kubernetes.io/pod-uid":["3b66a65b-d54f-487c-ac8c-96f94e21b933"]}},"sourceIPs":["10.30.1.5"],"userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2022-01-28T06:21:38.743079Z","stageTimestamp":"2022-01-28T06:21:38.775489Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"olm-operator-binding-openshift-operator-lifecycle-manager\" of ClusterRole \"system:controller:operator-lifecycle-manager\" to ServiceAccount \"olm-operator-serviceaccount/openshift-operator-lifecycle-manager\""}} {"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"c03142c9-900a-4abb-bf4d-42e511d190c0","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services","verb":"create","user":{"username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","uid":"1e38b88c-ddf2-426f-bf8c-692fce5cf4e9","groups":["system:serviceaccounts","system:serviceaccounts:openshift-operator-lifecycle-manager","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["olm-operator-56f69cbbbf-27t6s"],"authentication.kubernetes.io/pod-uid":["3b66a65b-d54f-487c-ac8c-96f94e21b933"]}},"sourceIPs":["10.30.1.5"],"userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"AlreadyExists","code":409}, <<<<< "requestReceivedTimestamp":"2022-01-28T06:21:38.782905Z","stageTimestamp":"2022-01-28T06:21:39.144566Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"olm-operator-binding-openshift-operator-lifecycle-manager\" of ClusterRole \"system:controller:operator-lifecycle-manager\" to ServiceAccount \"olm-operator-serviceaccount/openshift-operator-lifecycle-manager\""}} ~~~ This is a newly deployed cluster with no previous installation history of Openshift Virtualization. Also, it's possible to create the service with the same spec after the deployment. The other webhook services are also created without any issues. Version-Release number of selected component (if applicable): kubevirt-hyperconverged-operator.v4.8.4 omg get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.17 True False 2h2m Cluster version is 4.8.17 How reproducible: Always reproducible in the customer environment. Steps to Reproduce: 1. Follow the https://access.redhat.com/documentation/en-us/openshift_container_platform/4.8/html-single/openshift_virtualization/index#installing-virt-cli 2. Deployment fails while creating kubevirt-hyperconverged. 3. Actual results: Installation of Openshift virtualization fails with error service "hco-webhook-service" not found Expected results: Installation of Openshift virtualization should work. Additional info:
This bug appears to be a bug of OLM, which is the component who manages the webhook services used by HCO. OLM tries to update the webhook service every once in a while, to adjust the certs, and it is doing so be deleting the existing service and recreating another one instead. When the deletion command is failed, it is not reflected back due to a bug, which I've opened a PR that intends to fix it.
Thanks @ocohen would you move this to the component it belongs and set the state to POST?
(In reply to Dan Kenigsberg from comment #3) > Thanks @ocohen would you move this to the component it belongs > and set the state to POST? Done. After the PR will be merged, the fix should be backported all the way down to 4.8.z (The customer reported the bug is using OCP 4.8.17).
Moving this back to ASSIGNED as automation moved this to POST but the linked PR was against the upstream repository and still needs to work itself to the downstream master branch before we can start the backporting process.
Hi. Can you provide an update on the backport process? I see that the main PR has been merged for awhile. Thanks.
I'm following up on this today to make sure the commit makes it to the downstream.
Hi Fabian, This has merged downstream. It's ready for QA.
Hi Per, (In reply to Per da Silva from comment #13) > Hi Fabian, > > This has merged downstream. It's ready for QA. I don't see any backport bugs. Should there be?
Good question. Let me get some clarity on that. I'm not sure how it works when the bug came through in a downstream sync PR =S
(In reply to Per da Silva from comment #15) > Good question. Let me get some clarity on that. I'm not sure how it works > when the bug came through in a downstream sync PR =S Ok, thanks. We are interested in 4.8 especially.
Hey, I've opened up a backport to 4.10 (cherry-picking all the way down to 4.6): https://github.com/openshift/operator-framework-olm/pull/272 (after cloning this ticket. Cloned ticket: https://bugzilla.redhat.com/show_bug.cgi?id=2070131) However, the PR informs me we need to get this ticket in a verified state.
Moved this to on QA. I'll reach out to Jian to ask him to verify.
I guess the downstream PR for this bug is https://github.com/openshift/operator-framework-olm/pull/269, linked it to this bug and working on it.
1, Install an OCP with the fixed PR. mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-03-29-152521 True False 8h Cluster version is 4.11.0-0.nightly-2022-03-29-152521 mac:~ jianzhang$ oc -n openshift-operator-lifecycle-manager exec deployment/catalog-operator -- olm --version OLM version: 0.19.0 git commit: 60a25a504809b7bf81bc80a280192228531cd3fd 2, Create a CatalogSource that contains that issued operator. mac:~ jianzhang$ cat cs-bug2048441.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: bug-operators namespace: openshift-marketplace spec: displayName: Bug Operators image: registry.redhat.io/redhat/redhat-operator-index:v4.8 priority: -100 publisher: Red Hat sourceType: grpc updateStrategy: registryPoll: interval: 10m0s mac:~ jianzhang$ oc create -f cs-bug2048441.yaml catalogsource.operators.coreos.com/bug-operators created 3, Subscribe to kubevirt-hyperconverged-operator.v4.8.4. mac:~ jianzhang$ cat sub-cnv.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: hco-operatorhub namespace: openshift-cnv spec: source: bug-operators sourceNamespace: openshift-marketplace name: kubevirt-hyperconverged startingCSV: kubevirt-hyperconverged-operator.v4.8.4 channel: "stable" mac:~ jianzhang$ oc create -f sub-cnv.yaml subscription.operators.coreos.com/hco-operatorhub created mac:~ jianzhang$ oc get sub -n openshift-cnv NAME PACKAGE SOURCE CHANNEL hco-operatorhub kubevirt-hyperconverged bug-operators stable mac:~ jianzhang$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.8.4 OpenShift Virtualization 4.8.4 kubevirt-hyperconverged-operator.v4.8.3 Succeeded mac:~ jianzhang$ oc get pods -n openshift-cnv NAME READY STATUS RESTARTS AGE cdi-operator-86f84bdb7c-hxk2w 1/1 Running 0 3m54s cluster-network-addons-operator-664dd989fd-njg8l 1/1 Running 0 3m56s hco-operator-79b8748d64-kr6jp 1/1 Running 0 3m56s hco-webhook-7ffdb6478f-qh28d 1/1 Running 0 3m56s hostpath-provisioner-operator-795986f9f-2hxjd 1/1 Running 0 3m53s node-maintenance-operator-6f5846fc8b-8bpsq 1/1 Running 0 3m54s ssp-operator-745898f5f4-pqhlb 1/1 Running 0 3m55s virt-operator-574697b7f6-7mwg4 1/1 Running 0 3m35s virt-operator-574697b7f6-hlbwx 1/1 Running 0 3m35s vm-import-operator-6d77f874bc-8qkd5 1/1 Running 0 3m53s 4, create a HyperConverged resource called kubevirt-hyperconverged mac:~ jianzhang$ cat kubevirt-hyperconverged.yaml apiVersion: hco.kubevirt.io/v1beta1 kind: HyperConverged metadata: name: kubevirt-hyperconverged namespace: openshift-cnv spec: certConfig: ca: duration: 48h0m0s renewBefore: 24h0m0s server: duration: 24h0m0s renewBefore: 12h0m0s featureGates: sriovLiveMigration: true withHostPassthroughCPU: false infra: {} liveMigrationConfig: bandwidthPerMigration: 64Mi completionTimeoutPerGiB: 800 parallelMigrationsPerCluster: 5 parallelOutboundMigrationsPerNode: 2 progressTimeout: 150 workloads: {} mac:~ jianzhang$ oc create -f kubevirt-hyperconverged.yaml hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged created mac:~ jianzhang$ oc get HyperConverged -n openshift-cnv NAME AGE kubevirt-hyperconverged 117s mac:~ jianzhang$ oc get svc -n openshift-cnv|grep hco-webhook-service hco-webhook-service ClusterIP 172.30.85.219 <none> 4343/TCP 4m32s mac:~ jianzhang$ oc get deployment -n openshift-cnv NAME READY UP-TO-DATE AVAILABLE AGE cdi-apiserver 1/1 1 1 5m9s cdi-deployment 1/1 1 1 5m9s cdi-operator 1/1 1 1 10m cdi-uploadproxy 1/1 1 1 5m8s cluster-network-addons-operator 1/1 1 1 10m hco-operator 1/1 1 1 10m hco-webhook 1/1 1 1 10m hostpath-provisioner-operator 1/1 1 1 10m kubemacpool-cert-manager 1/1 1 1 5m10s kubemacpool-mac-controller-manager 1/1 1 1 5m10s nmstate-cert-manager 1/1 1 1 5m10s nmstate-webhook 2/2 2 2 5m10s node-maintenance-operator 1/1 1 1 10m ssp-operator 1/1 1 1 10m virt-api 2/2 2 2 5m virt-controller 2/2 2 2 4m30s virt-operator 2/2 2 2 10m virt-template-validator 2/2 2 2 5m10s vm-import-controller 1/1 1 1 4m47s vm-import-operator 1/1 1 1 10m It works well. 5, Delete the HyperConverged resource ,uninstall kubevirt-hyperconverged-operator.v4.8.4 and remove the openshift-cnv project. mac:~ jianzhang$ oc delete HyperConverged kubevirt-hyperconverged -n openshift-cnv hyperconverged.hco.kubevirt.io "kubevirt-hyperconverged" deleted mac:~ jianzhang$ oc get HyperConverged -n openshift-cnv No resources found in openshift-cnv namespace. mac:~ jianzhang$ oc get csv -n openshift-cnv No resources found in openshift-cnv namespace. mac:~ jianzhang$ oc delete ns openshift-cnv namespace "openshift-cnv" deleted mac:~ jianzhang$ mac:~ jianzhang$ oc get ns |grep cnv It worked as expected, verify it. 6,
After further investigations we are pretty sure that the bug is still not properly fixed and so I'm reopening it. Looking at: https://github.com/operator-framework/operator-lifecycle-manager/blob/0fa6f2930dfd00c43e8e99c821f73f392da26378/pkg/controller/install/certresources.go#L255-L274 we see that the OLM, when deploying the certs for a managed webhook, is always trying to delete and recreate the service for that webhook instead of updating it. The delete request is asynchronous and its completion is not checked, so the code will quickly try to recreate the service but, if the cluster is not fast enough or overloaded, this is going to fail with: "status":"Failure","reason":"AlreadyExists","code":409 In that case the OLM is going to mark the whole install strategy as failed and it's never going to try it again. This is for sure a race condition that potentially can happen for any OLM managed webhook if the deletion of the service is not fast enough to complete before the create request. For sure it's systematically reproducible on some specific use cases where, for instance, we have an external controller that is adding a finalizer on the service to do some internal network configuration cleanup before letting the service be fully deleted: that external controller watching the deletion and reacting will made the whole process slower and so in that case the recreation of the service will systematically fail. Some audit logs proving it: OLM sent the delete at 18:04:18.755754Z: quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/audit_logs/kube-apiserver/ocp-root-osrevd-ctrl-2-audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"166b1992-daf5-4f63-80c7-4f4c54f67471","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services/hco-webhook-service","verb":"delete","user":{"username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","uid":"b1cd3872-6e2b-40f0-87ea-1ecbf34cbc69","groups":["system:serviceaccounts","system:serviceaccounts:openshift-operator-lifecycle-manager","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["olm-operator-67df49dbb6-m94p6"],"authentication.kubernetes.io/pod-uid":["00b4b163-2790-4086-a0ef-3242f07804a3"]}},"sourceIPs":["10.30.1.3"],"userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2022-06-09T18:04:18.725465Z","stageTimestamp":"2022-06-09T18:04:18.755754Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"olm-operator-binding-openshift-operator-lifecycle-manager\" of ClusterRole \"system:controller:operator-lifecycle-manager\" to ServiceAccount \"olm-operator-serviceaccount/openshift-operator-lifecycle-manager\""}} then *without waiting for it to complete* OLM sent the create at 2022-06-09T18:04:18.756700Z: quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/audit_logs/kube-apiserver/ocp-root-osrevd-ctrl-2-audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"93fda65c-98ee-4dd9-aebc-c6ebbe8f1b18","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services","verb":"create","user":{"username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","uid":"b1cd3872-6e2b-40f0-87ea-1ecbf34cbc69","groups":["system:serviceaccounts","system:serviceaccounts:openshift-operator-lifecycle-manager","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["olm-operator-67df49dbb6-m94p6"],"authentication.kubernetes.io/pod-uid":["00b4b163-2790-4086-a0ef-3242f07804a3"]}},"sourceIPs":["10.30.1.3"],"userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"AlreadyExists","code":409},"requestReceivedTimestamp":"2022-06-09T18:04:18.756700Z","stageTimestamp":"2022-06-09T18:04:18.923154Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"olm-operator-binding-openshift-operator-lifecycle-manager\" of ClusterRole \"system:controller:operator-lifecycle-manager\" to ServiceAccount \"olm-operator-serviceaccount/openshift-operator-lifecycle-manager\""}} and this failed due to "status":"Failure","reason":"AlreadyExists","code":409 only later on at 2022-06-09T18:04:18.936799Z the external controller removed the finalizer and so soon the service went away: quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/audit_logs/kube-apiserver/ocp-root-osrevd-ctrl-2-audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"194d33dc-f527-41b2-8203-b35f752b542b","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services/hco-webhook-service","verb":"patch","user":{"username":"system:serviceaccount:contrail:contrail-serviceaccount","uid":"135c1de1-4814-4a8f-8bf6-20e22e205430","groups":["system:serviceaccounts","system:serviceaccounts:contrail","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["contrail-k8s-kubemanager-cd6d7c67d-ckw4d"],"authentication.kubernetes.io/pod-uid":["b7bf636b-c8da-4826-a827-f4ea083e639f"]}},"sourceIPs":["10.30.1.5"],"userAgent":"kubemanager/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2022-06-09T18:04:18.874251Z","stageTimestamp":"2022-06-09T18:04:18.936799Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"contrail-rolebinding\" of ClusterRole \"contrail-role\" to ServiceAccount \"contrail-serviceaccount/contrail\""}} but later on the OLM marked the install strategy as failed and it never tried again: must-gather.local.2144885537263116241/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/namespaces/openshift-operator-lifecycle-manager/pods/olm-operator-67df49dbb6-m94p6/olm-operator/olm-operator/logs/current.log:2022-06-09T18:04:18.966790136Z I0609 18:04:18.966703 1 event.go:282] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.1", UID:"b68f76bd-a6ba-4d8f-aa03-21592d7330d0", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"107826", FieldPath:""}): type: 'Warning' reason: 'InstallComponentFailed' install strategy failed: could not create service hco-webhook-service: object is being deleted: services "hco-webhook-service" already exists
Hello Sean, Thanks for your patience, it seems as though this BZ was closed by errata during our transition from BZ to JIRA. Unfortunately, we hadn't addressed this issue because it wasn't on jira. We can't simply reopen this BZ because it has already been released with an errata -- it's a broken BZ workflow and it probably just shouldn't be allowed by automation. I've created this jira ticket [0] to track the work and will prioritize addressing it soon. Ref: [0] https://issues.redhat.com/browse/OCPBUGS-2535