Bug 2010073

Summary: uninstalling and then reinstalling sriov-network-operator is not working
Product: OpenShift Container Platform Reporter: Dave Cain <dcain>
Component: NetworkingAssignee: Marcin Mirecki <mmirecki>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aajayan, adsoni, arghosh, bleanhar, bnemeth, brault, cgoncalves, dbatliwa, dosmith, eglottma, hasingh, jboxman, mmirecki, openshift-bugs-escalate, pliu, rdey, sksingh, tvvcox, william.caban, zshi, zzhao
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2016397 2016398 2025993 (view as bug list) Environment:
Last Closed: 2022-03-10 16:16:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2016397    

Description Dave Cain 2021-10-03 13:35:11 UTC
Description of problem:
Partner wishes to uninstall the OpenShift SRIOV network operator completely  in a 4.8.12 cluster and reinstall it from scratch.  All workloads using SRIOV are stopped, and then all CRDs, CRs, etc are removed successfully relating to SRIOV. 
 The namespace `openshift-sriov-network-operator` is deleted, which is also successful.  There appear to be no remaining leftovers from the previous install.

When trying to reinstall the operator through the following:
---
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-sriov-network-operator
  annotations:
    workload.openshift.io/allowed: management
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: sriov-network-operators
  namespace: openshift-sriov-network-operator
spec:
  targetNamespaces:
  - openshift-sriov-network-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: sriov-network-operator-subscription
  namespace: openshift-sriov-network-operator
spec:
  channel: "4.8"
  installPlanApproval: Automatic
  name: sriov-network-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace

This is met with problems.  From the namespace:

$ oc get pods -n openshift-sriov-network-operator 
NAME                                      READY   STATUS             RESTARTS   AGE
sriov-network-operator-64f7489688-bp5zr   0/1     CrashLoopBackOff   6          11m

Log messages indicate a problem in creating services:

$ oc logs sriov-network-operator-64f7489688-bp5zr 
I1003 13:29:39.732730       1 request.go:645] Throttling request took 1.021458683s, request: GET:https://172.30.0.1:443/apis/console.openshift.io/v1alpha1?timeout=32s
2021-10-03T13:29:42.695Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
I1003 13:29:49.761227       1 request.go:645] Throttling request took 3.040507987s, request: GET:https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s
2021-10-03T13:29:50.727Z	INFO	setup.createDefaultPolicy	Create a default SriovNetworkNodePolicy
2021-10-03T13:29:50.735Z	ERROR	setup	unable to create default SriovNetworkNodePolicy	{"error": "Internal error occurred: failed calling webhook \"operator-webhook.sriovnetwork.openshift.io\": Post \"https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s\": service \"operator-webhook-service\" not found"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
main.main
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:126
runtime.main
	/usr/lib/golang/src/runtime/proc.go:225



Version-Release number of selected component (if applicable):
OpenShift 4.8.12
sriov-network-operator.4.8.0-202109210857


Steps to Reproduce:
1. Install 4.8.12 cluster, have SRIOV deployed and working with workloads consuming both vfio_pci and netdevice resources
2. Stop workloads, remove SRIOV CR's, uninstall sriov operator.
3. Reinstall SRIOV operator, observe operator CrashLoopBackoff behavior


Expected results:
Clean uninstall of SRIOV Network operator should produce problem free re-installation.

Comment 1 Dave Cain 2021-10-04 22:47:08 UTC
Looking at this with Mr. Dave Critch (credit to him), it was apparent that two Mutating Webooks and one Validating webhook were left behind in the project openshift-sriov-network-operator:

sriov-operator-webhook-config
network-resources-injector-config
sriov-operator-webhook-config

Once these were cleared out, the openshift-sriov-network-operator could be installed again.  Should these be cleaned up by the uninstallation of the operator?  At the very lease I think this should be called out specifically in docs as a part of the operators un-installation process.

Comment 2 zenghui.shi 2021-10-09 03:00:24 UTC
(In reply to Dave Cain from comment #1)
> Looking at this with Mr. Dave Critch (credit to him), it was apparent that
> two Mutating Webooks and one Validating webhook were left behind in the
> project openshift-sriov-network-operator:
> 
> sriov-operator-webhook-config
> network-resources-injector-config
> sriov-operator-webhook-config
> 

Yes, I agree this is the cause of re-installation failure.

> Once these were cleared out, the openshift-sriov-network-operator could be
> installed again.  Should these be cleaned up by the uninstallation of the
> operator?  At the very lease I think this should be called out specifically
> in docs as a part of the operators un-installation process.

We don't have a doc section for uninstallation, but I think it's worth adding.

May I know how did you uninstall the sriov operator?

Here are the steps used by QE for uninstalling the sriov subscription:

###
oc delete sriovnetwork --all -n openshift-sriov-network-operator
oc delete sriovnodenetworkpolicy --all -n openshift-sriov-network-operator

oc delete sub --all -n openshift-sriov-network-operator
oc delete csv --all -n openshift-sriov-network-operator

oc delete ds --all -n openshift-sriov-network-operator

oc delete crd sriovibnetworks.sriovnetwork.openshift.io sriovnetworknodepolicies.sriovnetwork.openshift.io sriovnetworknodestates.sriovnetwork.openshift.io sriovnetworkpoolconfigs.sriovnetwork.openshift.io sriovnetworks.sriovnetwork.openshift.io sriovoperatorconfigs.sriovnetwork.openshift.io

oc delete mutatingwebhookconfigurations network-resources-injector-config
oc delete MutatingWebhookConfiguration sriov-operator-webhook-config

oc delete ValidatingWebhookConfiguration sriov-operator-webhook-config

oc delete namespace openshift-sriov-network-operator
###

And the steps used by devels to uninstall sriov operator from local repo installation:

https://github.com/openshift/sriov-network-operator/blob/master/Makefile#L211

The hack/undeploy.sh deletes the webhooks at: https://github.com/openshift/sriov-network-operator/blob/master/hack/undeploy.sh#L17-L19

Comment 4 Marcin Mirecki 2021-10-20 07:19:03 UTC
We should be able to automate the removal of the webhooks by adding an ownerRef on some of the operator resources (deployment, rs, ..). Let me have a look into this.

Comment 5 Marcin Mirecki 2021-10-21 12:01:24 UTC
The issue should be fixed by: https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/177
This fix will set the webhooks to an ignore failure policy and not fail on the missing services.

Clean up of the leftover webhooks is still in progress.

Comment 7 Bertrand 2021-10-25 08:32:59 UTC
Hello Marcin,

I believe this BZ should depend on a BZ targeted for 4.10 so it can been properly targeted for 4.8.z
Is there such BZ?

Thanks,

Bertrand

Comment 9 zhaozhanqi 2021-11-19 10:46:26 UTC
Verified this bug on 4.10.0-202111170619

After uninstall the operator and reinstall again, the operator pod can running well. 
Move this to verified.

Comment 10 Dave Cain 2022-01-19 20:17:27 UTC
Hello, I see this defect was addressed in 4.7 via this BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=2016398

Where will this be addressed in 4.8 and 4.9?  Are there corresponding BZs for these releases?  Having trouble finding them, appreciate pointing me in the right direction.

Comment 11 Carlos Goncalves 2022-01-19 20:38:02 UTC
Dave, there is a linked list of Blocks/Depends On BZs.
This 4.10 BZ has a linked "Blocks" BZ to the 4.9 which similarly has a "Blocks" BZ to the 4.8 one. You could also do the (reverse, "Depends On") exercise from the 4.7 BZ :-) 

4.10: https://bugzilla.redhat.com/show_bug.cgi?id=2010073 (this BZ)
4.9: https://bugzilla.redhat.com/show_bug.cgi?id=2016397
4.8: https://bugzilla.redhat.com/show_bug.cgi?id=2025993
4.7: https://bugzilla.redhat.com/show_bug.cgi?id=2016398

Comment 28 errata-xmlrpc 2022-03-10 16:16:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 29 Marcin Mirecki 2023-01-23 09:10:51 UTC
Closed, clearing needinfo

Comment 30 Red Hat Bugzilla 2023-09-18 04:26:36 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days