Bug 2039950
| Summary: | SRIOV-operator pods are not shut down gracefully when the SRIOV-operator CSV is deleted | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Nikita <nkononov> |
| Component: | OLM | Assignee: | Per da Silva <pegoncal> |
| OLM sub component: | OLM | QA Contact: | Jian Zhang <jiazha> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | cgoncalves, mmirecki, pegoncal, tyslaton |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-04-01 18:22:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Nikita
2022-01-12 18:50:15 UTC
mac:openshift-tests-private jianzhang$ oc get sub NAME PACKAGE SOURCE CHANNEL sriov-network-operator sriov-network-operator qe-app-registry stable It's strange that there is no InstallPlan, mac:openshift-tests-private jianzhang$ oc get ip No resources found in openshift-sriov-network-operator namespace. mac:openshift-tests-private jianzhang$ oc get csv NAME DISPLAY VERSION REPLACES PHASE sriov-network-operator.4.10.0-202201121612 SR-IOV Network Operator 4.10.0-202201121612 Succeeded mac:openshift-tests-private jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE network-resources-injector-77dck 1/1 Running 0 8m23s network-resources-injector-btj2s 1/1 Running 0 8m23s network-resources-injector-knxf9 1/1 Running 0 8m23s operator-webhook-7pb7w 1/1 Running 0 8m23s operator-webhook-rsx9c 1/1 Running 0 8m23s operator-webhook-zksmx 1/1 Running 0 8m23s sriov-network-config-daemon-ckk4r 3/3 Running 0 8m23s sriov-network-config-daemon-ffbws 3/3 Running 0 8m23s sriov-network-config-daemon-nmbzn 3/3 Running 0 8m23s sriov-network-operator-6b657d75bf-fq6l6 1/1 Running 0 8m56s I just subscribe to the sriov-network-operator and didn't delete anything, the value of the `failurePolicy` is Fail too. mac:openshift-tests-private jianzhang$ oc get mutatingwebhookconfigurations network-resources-injector-config -o yaml | grep failurePolicy failurePolicy: Ignore mac:openshift-tests-private jianzhang$ mac:openshift-tests-private jianzhang$ oc get MutatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy failurePolicy: Fail mac:openshift-tests-private jianzhang$ oc get ValidatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy failurePolicy: Fail mac:openshift-tests-private jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-11-065245 True False 7h23m Cluster version is 4.10.0-0.nightly-2022-01-11-065245 This isn't an OLM issue. I've reached out to Tim Waugh (twaugh) from EDX. He's verified there's something wrong with the signature for that image and has requested someone from Cloud Distribution to get back to me. Once I hear back, hopefully we can get this to land on the right team ^^ Oh whoops - commented on the wrong bz - please ignore my previous comment I'm not finding this operator in the 4.10 catalog. Could you share your catalog with me? Alternatively, would you be able to verify with clusterbot whether this PR: openshift/operator-framework-olm#244 solves the issue? (In reply to Per da Silva from comment #5) > Alternatively, would you be able to verify with clusterbot whether this PR: > openshift/operator-framework-olm#244 solves the issue? How can modify the env to include your PR? also, the SRIOV operator is definitely available in the catalog. > also, the SRIOV operator is definitely available in the catalog.
Just checked, and there is indeed no sriov operator for 4.10
I checked the PR using the cluster bot (launch openshift/operator-framework-olm#244), but unfortunately it still fails.
There is no sriov operator for 4.10 in the catalog, so I used the 4.9 version instead. I verified the fix is working by scaling the operator deployment down to 0.
Below is the procedure to reproduce it, including the operator installation.
oc create namespace openshift-sriov-network-operator
oc create -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: mm-operators
namespace: openshift-marketplace
spec:
displayName: MM Operators
icon:
base64data: ''
mediatype: ''
image: 'registry.redhat.io/redhat/redhat-operator-index:v4.9'
priority: -100
publisher: Red Hat
sourceType: grpc
updateStrategy:
registryPoll:
interval: 10m0s
EOF
oc create -f - <<EOF
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: sriov-network-operators
namespace: openshift-sriov-network-operator
spec:
targetNamespaces:
- openshift-sriov-network-operator
EOF
oc create -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: sriov-network-operator
namespace: openshift-sriov-network-operator
spec:
channel: '4.9'
installPlanApproval: Automatic
name: sriov-network-operator
source: mm-operators
sourceNamespace: openshift-marketplace
startingCSV: sriov-network-operator.4.9.0-202201140920
EOF
# CHECK THE OPERATOR POD IS UP
$oc get pod -n openshift-sriov-network-operator
NAME DESIRED CURRENT READY AGE
sriov-network-operator-6d4d9c66df-8cl2p 0/1 Running 0 15s
# CHECK THE DEPLOYMENT IS UP
$oc get deployment -n openshift-sriov-network-operator
NAME READY UP-TO-DATE AVAILABLE AGE
sriov-network-operator 0/1 1 0 16s
#CHECK THE FAILURE POLICY, IT SHOULD BE: Fail
$oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure
failurePolicy: Fail
#TRY IF THE CLEANUP WORKS WITH DEPLOYMENT SCALE DOWN:
$oc scale deployment --replicas=0 -n openshift-sriov-network-operator sriov-network-operator
$oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure
failurePolicy: Ignore
# SCALE BACK TO 1 TO HAVE TO OPERATOR RUNNING:
$oc scale deployment --replicas=1 -n openshift-sriov-network-operator sriov-network-operator
# THE WEBHOOK FAILUREPOLICY WILL GO BACK TO Fail
$oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure
failurePolicy: Fail
# DELETE OPERATOR BY DELETING SUBSCRIPTION
$oc delete subscription -n openshift-sriov-network-operator sriov-network-operator
$oc delete csv -n openshift-sriov-network-operator sriov-network-operator.4.9.0-202201140920
# CHECK THE OPERATOR POD IS REMOVED
$oc get pod -n openshift-sriov-network-operator
# THE WEBHOOK FAILUREPOLICY WILL SADLY BE Fail:
$oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure
failurePolicy: Fail
Hey, we're still trying to figure this one out. I've tested a fix with the instructions from above. It didn't fix the problem, but, looking at the sriov operator logs:
2022-02-02T21:00:53.595Z INFO controller-runtime.manager.controller.sriovibnetwork Stopping workers {"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovIBNetwork"}
2022-02-02T21:00:53.595Z INFO controller-runtime.manager.controller.sriovnetwork Stopping workers {"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovNetwork"}
2022-02-02T21:00:53.596Z INFO controller-runtime.manager.controller.sriovnetworknodepolicy Stopping workers {"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovNetworkNodePolicy"}
2022-02-02T21:00:53.597Z INFO controller-runtime.manager.controller.sriovnetworkpoolconfig Stopping workers {"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovNetworkPoolConfig"}
2022-02-02T21:00:53.597Z INFO controller-runtime.manager.controller.sriovoperatorconfig Stopping workers {"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovOperatorConfig"}
2022-02-02T21:00:53.598Z ERROR controller-runtime.manager error received after stop sequence was engaged {"error": "leader election lost"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:530
2022-02-02T21:00:53.598Z ERROR controller-runtime.manager error received after stop sequence was engaged {"error": "context canceled"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:530
2022-02-02T21:00:53.598Z ERROR controller-runtime.manager error received after stop sequence was engaged {"error": "context canceled"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:530
2022-02-02T21:00:53.598Z INFO shutdown Clearing finalizers on exit
2022-02-02T21:00:53.607Z ERROR shutdown Failed to list SriovNetworks {"error": "sriovnetworks.sriovnetwork.openshift.io is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot list resource \"sriovnetworks\" in API group \"sriovnetwork.openshift.io\" at the cluster scope"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:144
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateFinalizers
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:42
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.Shutdown
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:29
main.main
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:185
runtime.main
/usr/lib/golang/src/runtime/proc.go:225
2022-02-02T21:00:53.607Z INFO shutdown Done clearing finalizers on exit
2022-02-02T21:00:53.607Z INFO shutdown Seting webhook failure policies to Ignore on exit
2022-02-02T21:00:53.665Z ERROR shutdown Error getting webhook {"error": "validatingwebhookconfigurations.admissionregistration.k8s.io \"sriov-operator-webhook-config\" is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot get resource \"validatingwebhookconfigurations\" in API group \"admissionregistration.k8s.io\" at the cluster scope"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:144
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateValidatingWebhook
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:80
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateWebhooks
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:71
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.Shutdown
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:30
main.main
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:185
runtime.main
/usr/lib/golang/src/runtime/proc.go:225
panic: runtime error: index out of range [0] with length 0
goroutine 1 [running]:
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateValidatingWebhook(0xc000c414a0)
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:82 +0x231
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateWebhooks()
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:71 +0x8f
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.Shutdown()
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:30 +0x2a
main.main()
/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:185 +0xbf5
So, it seems the fix has allowed the shutdown process to run, but it is failing.
I've tested this on CRC though.
I've run a quick test against the current version of OLM, and I get the same log output. It seems the problem isn't related to the grace period, but rather that the shutdown procedure expecting authorizations that were probably deleted together with the CSV. Resources created by the CSV are deleted in cascade in no particular order. I'd also like to reiterate that I don't think relying on a graceful shutdown is the right approach. A node running the pod could crap out at any time for any number of reasons. Closing this as wontfix due to lack of activity. Please re-open if it's still an issue. |