Bug 2010073
| Summary: | uninstalling and then reinstalling sriov-network-operator is not working | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Dave Cain <dcain> | |
| Component: | Networking | Assignee: | Marcin Mirecki <mmirecki> | |
| Networking sub component: | SR-IOV | QA Contact: | zhaozhanqi <zzhao> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | aajayan, adsoni, arghosh, bleanhar, bnemeth, brault, cgoncalves, dbatliwa, dosmith, eglottma, hasingh, jboxman, mmirecki, openshift-bugs-escalate, pliu, rdey, sksingh, tvvcox, william.caban, zshi, zzhao | |
| Version: | 4.8 | Keywords: | Triaged | |
| Target Milestone: | --- | |||
| Target Release: | 4.10.0 | |||
| Hardware: | x86_64 | |||
| OS: | All | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2016397 2016398 2025993 (view as bug list) | Environment: | ||
| Last Closed: | 2022-03-10 16:16:28 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2016397 | |||
Looking at this with Mr. Dave Critch (credit to him), it was apparent that two Mutating Webooks and one Validating webhook were left behind in the project openshift-sriov-network-operator: sriov-operator-webhook-config network-resources-injector-config sriov-operator-webhook-config Once these were cleared out, the openshift-sriov-network-operator could be installed again. Should these be cleaned up by the uninstallation of the operator? At the very lease I think this should be called out specifically in docs as a part of the operators un-installation process. (In reply to Dave Cain from comment #1) > Looking at this with Mr. Dave Critch (credit to him), it was apparent that > two Mutating Webooks and one Validating webhook were left behind in the > project openshift-sriov-network-operator: > > sriov-operator-webhook-config > network-resources-injector-config > sriov-operator-webhook-config > Yes, I agree this is the cause of re-installation failure. > Once these were cleared out, the openshift-sriov-network-operator could be > installed again. Should these be cleaned up by the uninstallation of the > operator? At the very lease I think this should be called out specifically > in docs as a part of the operators un-installation process. We don't have a doc section for uninstallation, but I think it's worth adding. May I know how did you uninstall the sriov operator? Here are the steps used by QE for uninstalling the sriov subscription: ### oc delete sriovnetwork --all -n openshift-sriov-network-operator oc delete sriovnodenetworkpolicy --all -n openshift-sriov-network-operator oc delete sub --all -n openshift-sriov-network-operator oc delete csv --all -n openshift-sriov-network-operator oc delete ds --all -n openshift-sriov-network-operator oc delete crd sriovibnetworks.sriovnetwork.openshift.io sriovnetworknodepolicies.sriovnetwork.openshift.io sriovnetworknodestates.sriovnetwork.openshift.io sriovnetworkpoolconfigs.sriovnetwork.openshift.io sriovnetworks.sriovnetwork.openshift.io sriovoperatorconfigs.sriovnetwork.openshift.io oc delete mutatingwebhookconfigurations network-resources-injector-config oc delete MutatingWebhookConfiguration sriov-operator-webhook-config oc delete ValidatingWebhookConfiguration sriov-operator-webhook-config oc delete namespace openshift-sriov-network-operator ### And the steps used by devels to uninstall sriov operator from local repo installation: https://github.com/openshift/sriov-network-operator/blob/master/Makefile#L211 The hack/undeploy.sh deletes the webhooks at: https://github.com/openshift/sriov-network-operator/blob/master/hack/undeploy.sh#L17-L19 We should be able to automate the removal of the webhooks by adding an ownerRef on some of the operator resources (deployment, rs, ..). Let me have a look into this. The issue should be fixed by: https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/177 This fix will set the webhooks to an ignore failure policy and not fail on the missing services. Clean up of the leftover webhooks is still in progress. Hello Marcin, I believe this BZ should depend on a BZ targeted for 4.10 so it can been properly targeted for 4.8.z Is there such BZ? Thanks, Bertrand Fixed by: https://github.com/openshift/sriov-network-operator/commit/6e7eb59af93f57608e5188e6c39697013a1f1776 Verified this bug on 4.10.0-202111170619 After uninstall the operator and reinstall again, the operator pod can running well. Move this to verified. Hello, I see this defect was addressed in 4.7 via this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2016398 Where will this be addressed in 4.8 and 4.9? Are there corresponding BZs for these releases? Having trouble finding them, appreciate pointing me in the right direction. Dave, there is a linked list of Blocks/Depends On BZs. This 4.10 BZ has a linked "Blocks" BZ to the 4.9 which similarly has a "Blocks" BZ to the 4.8 one. You could also do the (reverse, "Depends On") exercise from the 4.7 BZ :-) 4.10: https://bugzilla.redhat.com/show_bug.cgi?id=2010073 (this BZ) 4.9: https://bugzilla.redhat.com/show_bug.cgi?id=2016397 4.8: https://bugzilla.redhat.com/show_bug.cgi?id=2025993 4.7: https://bugzilla.redhat.com/show_bug.cgi?id=2016398 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 Closed, clearing needinfo The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |
Description of problem: Partner wishes to uninstall the OpenShift SRIOV network operator completely in a 4.8.12 cluster and reinstall it from scratch. All workloads using SRIOV are stopped, and then all CRDs, CRs, etc are removed successfully relating to SRIOV. The namespace `openshift-sriov-network-operator` is deleted, which is also successful. There appear to be no remaining leftovers from the previous install. When trying to reinstall the operator through the following: --- apiVersion: v1 kind: Namespace metadata: name: openshift-sriov-network-operator annotations: workload.openshift.io/allowed: management --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: sriov-network-operators namespace: openshift-sriov-network-operator spec: targetNamespaces: - openshift-sriov-network-operator --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: sriov-network-operator-subscription namespace: openshift-sriov-network-operator spec: channel: "4.8" installPlanApproval: Automatic name: sriov-network-operator source: redhat-operators sourceNamespace: openshift-marketplace This is met with problems. From the namespace: $ oc get pods -n openshift-sriov-network-operator NAME READY STATUS RESTARTS AGE sriov-network-operator-64f7489688-bp5zr 0/1 CrashLoopBackOff 6 11m Log messages indicate a problem in creating services: $ oc logs sriov-network-operator-64f7489688-bp5zr I1003 13:29:39.732730 1 request.go:645] Throttling request took 1.021458683s, request: GET:https://172.30.0.1:443/apis/console.openshift.io/v1alpha1?timeout=32s 2021-10-03T13:29:42.695Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"} I1003 13:29:49.761227 1 request.go:645] Throttling request took 3.040507987s, request: GET:https://172.30.0.1:443/apis/snapshot.storage.k8s.io/v1?timeout=32s 2021-10-03T13:29:50.727Z INFO setup.createDefaultPolicy Create a default SriovNetworkNodePolicy 2021-10-03T13:29:50.735Z ERROR setup unable to create default SriovNetworkNodePolicy {"error": "Internal error occurred: failed calling webhook \"operator-webhook.sriovnetwork.openshift.io\": Post \"https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s\": service \"operator-webhook-service\" not found"} github.com/go-logr/zapr.(*zapLogger).Error /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132 main.main /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:126 runtime.main /usr/lib/golang/src/runtime/proc.go:225 Version-Release number of selected component (if applicable): OpenShift 4.8.12 sriov-network-operator.4.8.0-202109210857 Steps to Reproduce: 1. Install 4.8.12 cluster, have SRIOV deployed and working with workloads consuming both vfio_pci and netdevice resources 2. Stop workloads, remove SRIOV CR's, uninstall sriov operator. 3. Reinstall SRIOV operator, observe operator CrashLoopBackoff behavior Expected results: Clean uninstall of SRIOV Network operator should produce problem free re-installation.