Description of problem: The SRIOV-operator performs some cleanup when it's pods are shut down (here: https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/main.go#L178). This cleanup functionality works fine in almost all possible scenarios, when the pods are directly or when deleting (or scaling down) the deployment. In case of deleting the deployment the pods are of course deleted by the ownerReference chain (deployment -> replicaset -> pods). There is an issue however when the operator is installed via a subscription and a CSV appears (with the deployment having an ownerReference on the CSV). When the CSV is deleted, the pods are deleted, but the cleanup functionality is not invoked. It looks like the pods are not given any time for a graceful shutdown. We tried adding some additional logging to debug this, but nothing is logged (it is when we there is no CSV, and we delete the deployment). Theoretically it should not matter whether we delete starting with the CSV or starting with the deployment, as the pods are deleted via a cascade of ownerReferences. How is the pod shutdown different when we delete a CSV? How can we ensure a graceful shutdown of the pods in all cases? Version-Release number of selected component (if applicable): 4.10/4.9/4.8 How reproducible: Steps to Reproduce: 1. Delete sriov-operator subscription: > oc delete Subscription sriov-network-operator-subscription -n openshift-sriov-network-operator 2. >oc delete clusterserviceversion sriov-network-operator.4.10.0-202111211622 -n openshift-sriov-network-operator 3. oc delete crd sriovibnetworks.sriovnetwork.openshift.io sriovnetworknodepolicies.sriovnetwork.openshift.io sriovnetworknodestates.sriovnetwork.openshift.io sriovnetworkpoolconfigs.sriovnetwork.openshift.io sriovnetworks.sriovnetwork.openshift.io sriovoperatorconfigs.sriovnetwork.openshift.io Actual results: Check webhooks - hit the bug: oc get mutatingwebhookconfigurations network-resources-injector-config -o yaml | grep failurePolicy failurePolicy: Ignore oc get MutatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy failurePolicy: Fail # Should be set to ignore during gracefully shut down oc get ValidatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy failurePolicy: Fail # Should be set to ignore during gracefully shut down Expected results: Check webhooks - hit the bug: oc get mutatingwebhookconfigurations network-resources-injector-config -o yaml | grep failurePolicy failurePolicy: Ignore oc get MutatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy failurePolicy: Ignore # Should be set to ignore during gracefully shut down oc get ValidatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy failurePolicy: Ignore # Should be set to ignore during gracefully shut down
mac:openshift-tests-private jianzhang$ oc get sub NAME PACKAGE SOURCE CHANNEL sriov-network-operator sriov-network-operator qe-app-registry stable It's strange that there is no InstallPlan, mac:openshift-tests-private jianzhang$ oc get ip No resources found in openshift-sriov-network-operator namespace. mac:openshift-tests-private jianzhang$ oc get csv NAME DISPLAY VERSION REPLACES PHASE sriov-network-operator.4.10.0-202201121612 SR-IOV Network Operator 4.10.0-202201121612 Succeeded mac:openshift-tests-private jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE network-resources-injector-77dck 1/1 Running 0 8m23s network-resources-injector-btj2s 1/1 Running 0 8m23s network-resources-injector-knxf9 1/1 Running 0 8m23s operator-webhook-7pb7w 1/1 Running 0 8m23s operator-webhook-rsx9c 1/1 Running 0 8m23s operator-webhook-zksmx 1/1 Running 0 8m23s sriov-network-config-daemon-ckk4r 3/3 Running 0 8m23s sriov-network-config-daemon-ffbws 3/3 Running 0 8m23s sriov-network-config-daemon-nmbzn 3/3 Running 0 8m23s sriov-network-operator-6b657d75bf-fq6l6 1/1 Running 0 8m56s I just subscribe to the sriov-network-operator and didn't delete anything, the value of the `failurePolicy` is Fail too. mac:openshift-tests-private jianzhang$ oc get mutatingwebhookconfigurations network-resources-injector-config -o yaml | grep failurePolicy failurePolicy: Ignore mac:openshift-tests-private jianzhang$ mac:openshift-tests-private jianzhang$ oc get MutatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy failurePolicy: Fail mac:openshift-tests-private jianzhang$ oc get ValidatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy failurePolicy: Fail mac:openshift-tests-private jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-11-065245 True False 7h23m Cluster version is 4.10.0-0.nightly-2022-01-11-065245
This isn't an OLM issue. I've reached out to Tim Waugh (twaugh) from EDX. He's verified there's something wrong with the signature for that image and has requested someone from Cloud Distribution to get back to me. Once I hear back, hopefully we can get this to land on the right team ^^
Oh whoops - commented on the wrong bz - please ignore my previous comment
I'm not finding this operator in the 4.10 catalog. Could you share your catalog with me?
Alternatively, would you be able to verify with clusterbot whether this PR: openshift/operator-framework-olm#244 solves the issue?
(In reply to Per da Silva from comment #5) > Alternatively, would you be able to verify with clusterbot whether this PR: > openshift/operator-framework-olm#244 solves the issue? How can modify the env to include your PR? also, the SRIOV operator is definitely available in the catalog.
> also, the SRIOV operator is definitely available in the catalog. Just checked, and there is indeed no sriov operator for 4.10
I checked the PR using the cluster bot (launch openshift/operator-framework-olm#244), but unfortunately it still fails. There is no sriov operator for 4.10 in the catalog, so I used the 4.9 version instead. I verified the fix is working by scaling the operator deployment down to 0. Below is the procedure to reproduce it, including the operator installation. oc create namespace openshift-sriov-network-operator oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: mm-operators namespace: openshift-marketplace spec: displayName: MM Operators icon: base64data: '' mediatype: '' image: 'registry.redhat.io/redhat/redhat-operator-index:v4.9' priority: -100 publisher: Red Hat sourceType: grpc updateStrategy: registryPoll: interval: 10m0s EOF oc create -f - <<EOF apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: sriov-network-operators namespace: openshift-sriov-network-operator spec: targetNamespaces: - openshift-sriov-network-operator EOF oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: sriov-network-operator namespace: openshift-sriov-network-operator spec: channel: '4.9' installPlanApproval: Automatic name: sriov-network-operator source: mm-operators sourceNamespace: openshift-marketplace startingCSV: sriov-network-operator.4.9.0-202201140920 EOF # CHECK THE OPERATOR POD IS UP $oc get pod -n openshift-sriov-network-operator NAME DESIRED CURRENT READY AGE sriov-network-operator-6d4d9c66df-8cl2p 0/1 Running 0 15s # CHECK THE DEPLOYMENT IS UP $oc get deployment -n openshift-sriov-network-operator NAME READY UP-TO-DATE AVAILABLE AGE sriov-network-operator 0/1 1 0 16s #CHECK THE FAILURE POLICY, IT SHOULD BE: Fail $oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure failurePolicy: Fail #TRY IF THE CLEANUP WORKS WITH DEPLOYMENT SCALE DOWN: $oc scale deployment --replicas=0 -n openshift-sriov-network-operator sriov-network-operator $oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure failurePolicy: Ignore # SCALE BACK TO 1 TO HAVE TO OPERATOR RUNNING: $oc scale deployment --replicas=1 -n openshift-sriov-network-operator sriov-network-operator # THE WEBHOOK FAILUREPOLICY WILL GO BACK TO Fail $oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure failurePolicy: Fail # DELETE OPERATOR BY DELETING SUBSCRIPTION $oc delete subscription -n openshift-sriov-network-operator sriov-network-operator $oc delete csv -n openshift-sriov-network-operator sriov-network-operator.4.9.0-202201140920 # CHECK THE OPERATOR POD IS REMOVED $oc get pod -n openshift-sriov-network-operator # THE WEBHOOK FAILUREPOLICY WILL SADLY BE Fail: $oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure failurePolicy: Fail
Hey, we're still trying to figure this one out.
I've tested a fix with the instructions from above. It didn't fix the problem, but, looking at the sriov operator logs: 2022-02-02T21:00:53.595Z INFO controller-runtime.manager.controller.sriovibnetwork Stopping workers {"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovIBNetwork"} 2022-02-02T21:00:53.595Z INFO controller-runtime.manager.controller.sriovnetwork Stopping workers {"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovNetwork"} 2022-02-02T21:00:53.596Z INFO controller-runtime.manager.controller.sriovnetworknodepolicy Stopping workers {"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovNetworkNodePolicy"} 2022-02-02T21:00:53.597Z INFO controller-runtime.manager.controller.sriovnetworkpoolconfig Stopping workers {"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovNetworkPoolConfig"} 2022-02-02T21:00:53.597Z INFO controller-runtime.manager.controller.sriovoperatorconfig Stopping workers {"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovOperatorConfig"} 2022-02-02T21:00:53.598Z ERROR controller-runtime.manager error received after stop sequence was engaged {"error": "leader election lost"} github.com/go-logr/zapr.(*zapLogger).Error /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132 sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1 /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:530 2022-02-02T21:00:53.598Z ERROR controller-runtime.manager error received after stop sequence was engaged {"error": "context canceled"} github.com/go-logr/zapr.(*zapLogger).Error /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132 sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1 /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:530 2022-02-02T21:00:53.598Z ERROR controller-runtime.manager error received after stop sequence was engaged {"error": "context canceled"} github.com/go-logr/zapr.(*zapLogger).Error /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132 sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1 /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:530 2022-02-02T21:00:53.598Z INFO shutdown Clearing finalizers on exit 2022-02-02T21:00:53.607Z ERROR shutdown Failed to list SriovNetworks {"error": "sriovnetworks.sriovnetwork.openshift.io is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot list resource \"sriovnetworks\" in API group \"sriovnetwork.openshift.io\" at the cluster scope"} github.com/go-logr/zapr.(*zapLogger).Error /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132 sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:144 github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateFinalizers /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:42 github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.Shutdown /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:29 main.main /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:185 runtime.main /usr/lib/golang/src/runtime/proc.go:225 2022-02-02T21:00:53.607Z INFO shutdown Done clearing finalizers on exit 2022-02-02T21:00:53.607Z INFO shutdown Seting webhook failure policies to Ignore on exit 2022-02-02T21:00:53.665Z ERROR shutdown Error getting webhook {"error": "validatingwebhookconfigurations.admissionregistration.k8s.io \"sriov-operator-webhook-config\" is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot get resource \"validatingwebhookconfigurations\" in API group \"admissionregistration.k8s.io\" at the cluster scope"} github.com/go-logr/zapr.(*zapLogger).Error /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132 sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:144 github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateValidatingWebhook /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:80 github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateWebhooks /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:71 github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.Shutdown /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:30 main.main /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:185 runtime.main /usr/lib/golang/src/runtime/proc.go:225 panic: runtime error: index out of range [0] with length 0 goroutine 1 [running]: github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateValidatingWebhook(0xc000c414a0) /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:82 +0x231 github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateWebhooks() /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:71 +0x8f github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.Shutdown() /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:30 +0x2a main.main() /go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:185 +0xbf5 So, it seems the fix has allowed the shutdown process to run, but it is failing. I've tested this on CRC though.
I've run a quick test against the current version of OLM, and I get the same log output. It seems the problem isn't related to the grace period, but rather that the shutdown procedure expecting authorizations that were probably deleted together with the CSV. Resources created by the CSV are deleted in cascade in no particular order. I'd also like to reiterate that I don't think relying on a graceful shutdown is the right approach. A node running the pod could crap out at any time for any number of reasons.
Closing this as wontfix due to lack of activity. Please re-open if it's still an issue.