2048441 – Race condition deploying certs for managed webhooks

Bug 2048441 - Race condition deploying certs for managed webhooks

Summary: Race condition deploying certs for managed webhooks

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.11
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Alexander Greene
QA Contact:	xzha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2070131
TreeView+	depends on / blocked

Reported:	2022-01-31 09:21 UTC by nijin ashok
Modified:	2022-10-18 18:07 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2070131 (view as bug list)
Environment:
Last Closed:	2022-10-18 18:07:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift operator-framework-olm pull 269	None	Merged	Sync upstream commits to the 4.11 release branch	2022-04-01 08:11:41 UTC
Github	operator-framework operator-lifecycle-manager pull 2606	None	Merged	Fix a bug in deletion of webhook service for replacement	2022-02-17 19:11:35 UTC
Red Hat Knowledge Base (Solution)	6891731	None	None	None	2022-04-06 11:01:51 UTC

Description nijin ashok 2022-01-31 09:21:18 UTC

Description of problem:

The deployment fails with the below error while creating the kubevirt-hyperconverged object after creating Namespace, OperatorGroup, and Subscription objects.

~~~
Error from server (InternalError): error when creating "hyper.yaml": Internal error occurred: failed calling webhook "validate-hco.kubevirt.io": Post "https://hco-webhook-service.openshift-cnv.svc:4343/validate-hco-kubevirt-io-v1beta1-hyperconverged?timeout=30s": service "hco-webhook-service" not found
~~~

The service hco-webhook-service is not created and OLM logs have got the error "could not create service hco-webhook-service: object is being deleted: services "hco-webhook-service" already exists".

~~~
2022-01-28T06:21:39.901251731Z I0128 06:21:39.901171       1 event.go:282] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.8.4", UID:"835e6eb1-5e7e-40de-b922-5a22b3361778", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"105456", FieldPath:""}): type: 'Warning' reason: 'InstallComponentFailed' install strategy failed: could not create service hco-webhook-service: object is being deleted: services "hco-webhook-service" already exists
~~~

As per the audit logs, the OLM created the service, deleted it, and then created it back again which failed.

Create event.

~~~
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"029c1ae0-21aa-4d3d-80d8-ed0a06ac45bc","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services","verb":"create","user":{"username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","uid":"1e38b88c-ddf2-426f-bf8c-692fce5cf4e9","groups":["system:serviceaccounts","system:serviceaccounts:openshift-operator-lifecycle-manager","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["olm-operator-56f69cbbbf-27t6s"],"authentication.kubernetes.io/pod-uid":["3b66a65b-d54f-487c-ac8c-96f94e21b933"]}},"sourceIPs":["10.30.1.5"],"userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestReceivedTimestamp":"2022-01-28T06:21:31.931882Z","stageTimestamp":"2022-01-28T06:21:31.970348Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"olm-operator-binding-openshift-operator-lifecycle-manager\" of ClusterRole \"system:controller:operator-lifecycle-manager\" to ServiceAccount \"olm-operator-serviceaccount/openshift-operator-lifecycle-manager\""}}
~~~

Delete and create.

~~~
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"e3cec41d-18b4-4edf-8801-33bcceec05f7","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services/hco-webhook-service","verb":"delete","user":{"username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","uid":"1e38b88c-ddf2-426f-bf8c-692fce5cf4e9","groups":["system:serviceaccounts","system:serviceaccounts:openshift-operator-lifecycle-manager","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["olm-operator-56f69cbbbf-27t6s"],"authentication.kubernetes.io/pod-uid":["3b66a65b-d54f-487c-ac8c-96f94e21b933"]}},"sourceIPs":["10.30.1.5"],"userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Success","code":200},"requestReceivedTimestamp":"2022-01-28T06:21:38.743079Z","stageTimestamp":"2022-01-28T06:21:38.775489Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"olm-operator-binding-openshift-operator-lifecycle-manager\" of ClusterRole \"system:controller:operator-lifecycle-manager\" to ServiceAccount \"olm-operator-serviceaccount/openshift-operator-lifecycle-manager\""}}


{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"c03142c9-900a-4abb-bf4d-42e511d190c0","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services","verb":"create","user":{"username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","uid":"1e38b88c-ddf2-426f-bf8c-692fce5cf4e9","groups":["system:serviceaccounts","system:serviceaccounts:openshift-operator-lifecycle-manager","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["olm-operator-56f69cbbbf-27t6s"],"authentication.kubernetes.io/pod-uid":["3b66a65b-d54f-487c-ac8c-96f94e21b933"]}},"sourceIPs":["10.30.1.5"],"userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"AlreadyExists","code":409},  <<<<<

"requestReceivedTimestamp":"2022-01-28T06:21:38.782905Z","stageTimestamp":"2022-01-28T06:21:39.144566Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"olm-operator-binding-openshift-operator-lifecycle-manager\" of ClusterRole \"system:controller:operator-lifecycle-manager\" to ServiceAccount \"olm-operator-serviceaccount/openshift-operator-lifecycle-manager\""}}
~~~

This is a newly deployed cluster with no previous installation history of Openshift Virtualization. Also, it's possible to create the service with the same spec after the deployment.

The other webhook services are also created without any issues.

Version-Release number of selected component (if applicable):

kubevirt-hyperconverged-operator.v4.8.4
omg get clusterversion
NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version  4.8.17   True       False        2h2m   Cluster version is 4.8.17


How reproducible:

Always reproducible in the customer environment.

Steps to Reproduce:

1. Follow the https://access.redhat.com/documentation/en-us/openshift_container_platform/4.8/html-single/openshift_virtualization/index#installing-virt-cli
2. Deployment fails while creating kubevirt-hyperconverged.
3.

Actual results:

Installation of Openshift virtualization fails with error service "hco-webhook-service" not found

Expected results:

Installation of Openshift virtualization should work.

Additional info:

Comment 2 Oren Cohen 2022-02-01 13:44:28 UTC

This bug appears to be a bug of OLM, which is the component who manages the webhook services used by HCO.
OLM tries to update the webhook service every once in a while, to adjust the certs, and it is doing so be deleting the existing service and recreating another one instead.
When the deletion command is failed, it is not reflected back due to a bug, which I've opened a PR that intends to fix it.

Comment 3 Dan Kenigsberg 2022-02-09 10:15:00 UTC

Thanks @ocohen would you move this to the component it belongs and set the state to POST?

Comment 4 Oren Cohen 2022-02-09 10:25:19 UTC

(In reply to Dan Kenigsberg from comment #3)
> Thanks @ocohen would you move this to the component it belongs
> and set the state to POST?

Done.
After the PR will be merged, the fix should be backported all the way down to 4.8.z (The customer reported the bug is using OCP 4.8.17).

Comment 5 tflannag 2022-02-10 14:55:41 UTC

Moving this back to ASSIGNED as automation moved this to POST but the linked PR was against the upstream repository and still needs to work itself to the downstream master branch before we can start the backporting process.

Comment 6 Adam Litke 2022-02-17 19:13:36 UTC

Hi.  Can you provide an update on the backport process?  I see that the main PR has been merged for awhile.  Thanks.

Comment 10 Per da Silva 2022-02-25 10:00:05 UTC

I'm following up on this today to make sure the commit makes it to the downstream.

Comment 13 Per da Silva 2022-03-29 12:14:51 UTC

Hi Fabian,

This has merged downstream. It's ready for QA.

Comment 14 Christopher Brown 2022-03-29 16:00:15 UTC

Hi Per,

(In reply to Per da Silva from comment #13)
> Hi Fabian,
> 
> This has merged downstream. It's ready for QA.

I don't see any backport bugs. Should there be?

Comment 15 Per da Silva 2022-03-30 07:47:31 UTC

Good question. Let me get some clarity on that. I'm not sure how it works when the bug came through in a downstream sync PR =S

Comment 16 Christopher Brown 2022-03-30 14:29:24 UTC

(In reply to Per da Silva from comment #15)
> Good question. Let me get some clarity on that. I'm not sure how it works
> when the bug came through in a downstream sync PR =S

Ok, thanks. We are interested in 4.8 especially.

Comment 17 Per da Silva 2022-04-01 07:50:40 UTC

Hey,

I've opened up a backport to 4.10 (cherry-picking all the way down to 4.6): https://github.com/openshift/operator-framework-olm/pull/272 (after cloning this ticket. Cloned ticket: https://bugzilla.redhat.com/show_bug.cgi?id=2070131)

However, the PR informs me we need to get this ticket in a verified state.

Comment 18 Per da Silva 2022-04-01 07:51:22 UTC

Moved this to on QA. I'll reach out to Jian to ask him to verify.

Comment 19 Jian Zhang 2022-04-01 08:11:14 UTC

I guess the downstream PR for this bug is https://github.com/openshift/operator-framework-olm/pull/269, linked it to this bug and working on it.

Comment 20 Jian Zhang 2022-04-01 09:46:38 UTC

1, Install an OCP with the fixed PR.
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-03-29-152521   True        False         8h      Cluster version is 4.11.0-0.nightly-2022-03-29-152521

mac:~ jianzhang$ oc -n openshift-operator-lifecycle-manager exec deployment/catalog-operator -- olm --version
OLM version: 0.19.0
git commit: 60a25a504809b7bf81bc80a280192228531cd3fd

2, Create a CatalogSource that contains that issued operator.
mac:~ jianzhang$ cat cs-bug2048441.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: bug-operators
  namespace: openshift-marketplace
spec:
  displayName: Bug Operators
  image: registry.redhat.io/redhat/redhat-operator-index:v4.8
  priority: -100
  publisher: Red Hat
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 10m0s
mac:~ jianzhang$ oc create -f cs-bug2048441.yaml 
catalogsource.operators.coreos.com/bug-operators created

3, Subscribe to kubevirt-hyperconverged-operator.v4.8.4.

mac:~ jianzhang$ cat sub-cnv.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: hco-operatorhub
  namespace: openshift-cnv
spec:
  source: bug-operators
  sourceNamespace: openshift-marketplace
  name: kubevirt-hyperconverged
  startingCSV: kubevirt-hyperconverged-operator.v4.8.4
  channel: "stable"
mac:~ jianzhang$ oc create -f sub-cnv.yaml 
subscription.operators.coreos.com/hco-operatorhub created

mac:~ jianzhang$ oc get sub -n openshift-cnv
NAME              PACKAGE                   SOURCE          CHANNEL
hco-operatorhub   kubevirt-hyperconverged   bug-operators   stable

mac:~ jianzhang$ oc get csv -n openshift-cnv
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.8.4   OpenShift Virtualization   4.8.4     kubevirt-hyperconverged-operator.v4.8.3   Succeeded

mac:~ jianzhang$ oc get pods -n openshift-cnv
NAME                                               READY   STATUS    RESTARTS   AGE
cdi-operator-86f84bdb7c-hxk2w                      1/1     Running   0          3m54s
cluster-network-addons-operator-664dd989fd-njg8l   1/1     Running   0          3m56s
hco-operator-79b8748d64-kr6jp                      1/1     Running   0          3m56s
hco-webhook-7ffdb6478f-qh28d                       1/1     Running   0          3m56s
hostpath-provisioner-operator-795986f9f-2hxjd      1/1     Running   0          3m53s
node-maintenance-operator-6f5846fc8b-8bpsq         1/1     Running   0          3m54s
ssp-operator-745898f5f4-pqhlb                      1/1     Running   0          3m55s
virt-operator-574697b7f6-7mwg4                     1/1     Running   0          3m35s
virt-operator-574697b7f6-hlbwx                     1/1     Running   0          3m35s
vm-import-operator-6d77f874bc-8qkd5                1/1     Running   0          3m53s


4, create a HyperConverged resource called kubevirt-hyperconverged
mac:~ jianzhang$ cat kubevirt-hyperconverged.yaml 
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
  name: kubevirt-hyperconverged
  namespace: openshift-cnv
spec:
  certConfig:
    ca:
      duration: 48h0m0s
      renewBefore: 24h0m0s
    server:
      duration: 24h0m0s
      renewBefore: 12h0m0s
  featureGates:
    sriovLiveMigration: true
    withHostPassthroughCPU: false
  infra: {}
  liveMigrationConfig:
    bandwidthPerMigration: 64Mi
    completionTimeoutPerGiB: 800
    parallelMigrationsPerCluster: 5
    parallelOutboundMigrationsPerNode: 2
    progressTimeout: 150
  workloads: {}

mac:~ jianzhang$ oc create -f kubevirt-hyperconverged.yaml 
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged created

mac:~ jianzhang$ oc get HyperConverged -n openshift-cnv
NAME                      AGE
kubevirt-hyperconverged   117s

mac:~ jianzhang$ oc get svc -n openshift-cnv|grep hco-webhook-service
hco-webhook-service                        ClusterIP   172.30.85.219    <none>        4343/TCP            4m32s

mac:~ jianzhang$ oc get deployment -n openshift-cnv
NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
cdi-apiserver                        1/1     1            1           5m9s
cdi-deployment                       1/1     1            1           5m9s
cdi-operator                         1/1     1            1           10m
cdi-uploadproxy                      1/1     1            1           5m8s
cluster-network-addons-operator      1/1     1            1           10m
hco-operator                         1/1     1            1           10m
hco-webhook                          1/1     1            1           10m
hostpath-provisioner-operator        1/1     1            1           10m
kubemacpool-cert-manager             1/1     1            1           5m10s
kubemacpool-mac-controller-manager   1/1     1            1           5m10s
nmstate-cert-manager                 1/1     1            1           5m10s
nmstate-webhook                      2/2     2            2           5m10s
node-maintenance-operator            1/1     1            1           10m
ssp-operator                         1/1     1            1           10m
virt-api                             2/2     2            2           5m
virt-controller                      2/2     2            2           4m30s
virt-operator                        2/2     2            2           10m
virt-template-validator              2/2     2            2           5m10s
vm-import-controller                 1/1     1            1           4m47s
vm-import-operator                   1/1     1            1           10m

It works well.

5, Delete the HyperConverged resource ,uninstall kubevirt-hyperconverged-operator.v4.8.4 and remove the openshift-cnv project.
mac:~ jianzhang$ oc delete HyperConverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io "kubevirt-hyperconverged" deleted

mac:~ jianzhang$ oc get HyperConverged -n openshift-cnv
No resources found in openshift-cnv namespace.

mac:~ jianzhang$ oc get csv -n openshift-cnv
No resources found in openshift-cnv namespace.

mac:~ jianzhang$ oc delete ns openshift-cnv

namespace "openshift-cnv" deleted
mac:~ jianzhang$ 
mac:~ jianzhang$ oc get ns |grep cnv

It worked as expected, verify it.

6,

Comment 24 Simone Tiraboschi 2022-08-02 16:34:51 UTC

After further investigations we are pretty sure that the bug is still not properly fixed and so I'm reopening it.

Looking at:
https://github.com/operator-framework/operator-lifecycle-manager/blob/0fa6f2930dfd00c43e8e99c821f73f392da26378/pkg/controller/install/certresources.go#L255-L274

we see that the OLM, when deploying the certs for a managed webhook, is always trying to delete and recreate the service for that webhook instead of updating it.

The delete request is asynchronous and its completion is not checked, so the code will quickly try to recreate the service but, if the cluster is not fast enough or overloaded, this is going to fail with:
"status":"Failure","reason":"AlreadyExists","code":409

In that case the OLM is going to mark the whole install strategy as failed and it's never going to try it again.

This is for sure a race condition that potentially can happen for any OLM managed webhook if the deletion of the service is not fast enough to complete before the create request.
For sure it's systematically reproducible on some specific use cases where, for instance, we have an external controller that is adding a finalizer on the service to do some internal network configuration cleanup before letting the service be fully deleted:
that external controller watching the deletion and reacting will made the whole process slower and so in that case the recreation of the service will systematically fail.

Some audit logs proving it:

OLM sent the delete at 18:04:18.755754Z:
  
  quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/audit_logs/kube-apiserver/ocp-root-osrevd-ctrl-2-audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"166b1992-daf5-4f63-80c7-4f4c54f67471","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services/hco-webhook-service","verb":"delete","user":{"username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","uid":"b1cd3872-6e2b-40f0-87ea-1ecbf34cbc69","groups":["system:serviceaccounts","system:serviceaccounts:openshift-operator-lifecycle-manager","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["olm-operator-67df49dbb6-m94p6"],"authentication.kubernetes.io/pod-uid":["00b4b163-2790-4086-a0ef-3242f07804a3"]}},"sourceIPs":["10.30.1.3"],"userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2022-06-09T18:04:18.725465Z","stageTimestamp":"2022-06-09T18:04:18.755754Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"olm-operator-binding-openshift-operator-lifecycle-manager\" of ClusterRole \"system:controller:operator-lifecycle-manager\" to ServiceAccount \"olm-operator-serviceaccount/openshift-operator-lifecycle-manager\""}}

then *without waiting for it to complete* OLM sent the create at 2022-06-09T18:04:18.756700Z:
  
  quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/audit_logs/kube-apiserver/ocp-root-osrevd-ctrl-2-audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"93fda65c-98ee-4dd9-aebc-c6ebbe8f1b18","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services","verb":"create","user":{"username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","uid":"b1cd3872-6e2b-40f0-87ea-1ecbf34cbc69","groups":["system:serviceaccounts","system:serviceaccounts:openshift-operator-lifecycle-manager","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["olm-operator-67df49dbb6-m94p6"],"authentication.kubernetes.io/pod-uid":["00b4b163-2790-4086-a0ef-3242f07804a3"]}},"sourceIPs":["10.30.1.3"],"userAgent":"olm/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"AlreadyExists","code":409},"requestReceivedTimestamp":"2022-06-09T18:04:18.756700Z","stageTimestamp":"2022-06-09T18:04:18.923154Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"olm-operator-binding-openshift-operator-lifecycle-manager\" of ClusterRole \"system:controller:operator-lifecycle-manager\" to ServiceAccount \"olm-operator-serviceaccount/openshift-operator-lifecycle-manager\""}}

and this failed due to "status":"Failure","reason":"AlreadyExists","code":409

only later on at 2022-06-09T18:04:18.936799Z the external controller removed the finalizer and so soon the service went away:

  quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/audit_logs/kube-apiserver/ocp-root-osrevd-ctrl-2-audit.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"194d33dc-f527-41b2-8203-b35f752b542b","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-cnv/services/hco-webhook-service","verb":"patch","user":{"username":"system:serviceaccount:contrail:contrail-serviceaccount","uid":"135c1de1-4814-4a8f-8bf6-20e22e205430","groups":["system:serviceaccounts","system:serviceaccounts:contrail","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["contrail-k8s-kubemanager-cd6d7c67d-ckw4d"],"authentication.kubernetes.io/pod-uid":["b7bf636b-c8da-4826-a827-f4ea083e639f"]}},"sourceIPs":["10.30.1.5"],"userAgent":"kubemanager/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"services","namespace":"openshift-cnv","name":"hco-webhook-service","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2022-06-09T18:04:18.874251Z","stageTimestamp":"2022-06-09T18:04:18.936799Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"contrail-rolebinding\" of ClusterRole \"contrail-role\" to ServiceAccount \"contrail-serviceaccount/contrail\""}}

but later on the OLM marked the install strategy as failed and it never tried again:

  must-gather.local.2144885537263116241/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-6066981e2b360c670ce2b872db3ae6cacb13bf9cb3b80c5235955ea5de29be14/namespaces/openshift-operator-lifecycle-manager/pods/olm-operator-67df49dbb6-m94p6/olm-operator/olm-operator/logs/current.log:2022-06-09T18:04:18.966790136Z I0609 18:04:18.966703       1 event.go:282] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"openshift-cnv", Name:"kubevirt-hyperconverged-operator.v4.10.1", UID:"b68f76bd-a6ba-4d8f-aa03-21592d7330d0", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"107826", FieldPath:""}): type: 'Warning' reason: 'InstallComponentFailed' install strategy failed: could not create service hco-webhook-service: object is being deleted: services "hco-webhook-service" already exists

Comment 28 Alexander Greene 2022-10-18 18:07:25 UTC

 Hello Sean,

Thanks for your patience, it seems as though this BZ was closed by errata during our transition from BZ to JIRA. Unfortunately, we hadn't addressed this issue because it wasn't on jira. We can't simply reopen this BZ because it has already been released with an errata -- it's a broken BZ workflow and it probably just shouldn't be allowed by automation.

I've created this jira ticket [0] to track the work and will prioritize addressing it soon.

Ref:
[0] https://issues.redhat.com/browse/OCPBUGS-2535

Note You need to log in before you can comment on or make changes to this bug.