Bug 2039950 - SRIOV-operator pods are not shut down gracefully when the SRIOV-operator CSV is deleted
Summary: SRIOV-operator pods are not shut down gracefully when the SRIOV-operator CSV ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.10
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Per da Silva
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-12 18:50 UTC by Nikita
Modified: 2022-04-01 18:22 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-01 18:22:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Nikita 2022-01-12 18:50:15 UTC
Description of problem:

The SRIOV-operator performs some cleanup when it's pods are shut down (here: https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/master/main.go#L178).
This cleanup functionality works fine in almost all possible scenarios, when the pods are directly or when deleting (or scaling down) the deployment. In case of deleting the deployment the pods are of course deleted by the ownerReference chain (deployment -> replicaset -> pods).
There is an issue however when the operator is installed via a subscription and a CSV appears (with the deployment having an ownerReference on the CSV). When the CSV is deleted, the pods are deleted, but the cleanup functionality is not invoked. It looks like the pods are not given any time for a graceful shutdown. We tried adding some additional logging to debug this, but nothing is logged (it is when we there is no CSV, and we delete the deployment). Theoretically it should not matter whether we delete starting with the CSV or starting with the deployment, as the pods are deleted via a cascade of ownerReferences.
How is the pod shutdown different when we delete a CSV? How can we ensure a graceful shutdown of the pods in all cases?


Version-Release number of selected component (if applicable):
4.10/4.9/4.8

How reproducible:


Steps to Reproduce:
1. Delete sriov-operator subscription: 
> oc delete Subscription sriov-network-operator-subscription -n openshift-sriov-network-operator

2. >oc delete clusterserviceversion sriov-network-operator.4.10.0-202111211622 -n openshift-sriov-network-operator

3. oc delete crd sriovibnetworks.sriovnetwork.openshift.io sriovnetworknodepolicies.sriovnetwork.openshift.io sriovnetworknodestates.sriovnetwork.openshift.io sriovnetworkpoolconfigs.sriovnetwork.openshift.io sriovnetworks.sriovnetwork.openshift.io sriovoperatorconfigs.sriovnetwork.openshift.io


Actual results:
Check webhooks - hit the bug:

oc get mutatingwebhookconfigurations network-resources-injector-config -o yaml | grep failurePolicy

failurePolicy: Ignore 


oc get MutatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy

failurePolicy: Fail # Should be set to ignore during gracefully shut down 


oc get ValidatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy

failurePolicy: Fail  # Should be set to ignore during gracefully shut down 




Expected results:
Check webhooks - hit the bug:

oc get mutatingwebhookconfigurations network-resources-injector-config -o yaml | grep failurePolicy

failurePolicy: Ignore 


oc get MutatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy

failurePolicy: Ignore # Should be set to ignore during gracefully shut down 


oc get ValidatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy

failurePolicy: Ignore  # Should be set to ignore during gracefully shut down

Comment 1 Jian Zhang 2022-01-14 09:52:27 UTC
mac:openshift-tests-private jianzhang$ oc get sub 
NAME                     PACKAGE                  SOURCE            CHANNEL
sriov-network-operator   sriov-network-operator   qe-app-registry   stable

It's strange that there is no InstallPlan, 
mac:openshift-tests-private jianzhang$ oc get ip
No resources found in openshift-sriov-network-operator namespace.
mac:openshift-tests-private jianzhang$ oc get csv
NAME                                         DISPLAY                   VERSION               REPLACES   PHASE
sriov-network-operator.4.10.0-202201121612   SR-IOV Network Operator   4.10.0-202201121612              Succeeded

mac:openshift-tests-private jianzhang$ oc get pods
NAME                                      READY   STATUS    RESTARTS   AGE
network-resources-injector-77dck          1/1     Running   0          8m23s
network-resources-injector-btj2s          1/1     Running   0          8m23s
network-resources-injector-knxf9          1/1     Running   0          8m23s
operator-webhook-7pb7w                    1/1     Running   0          8m23s
operator-webhook-rsx9c                    1/1     Running   0          8m23s
operator-webhook-zksmx                    1/1     Running   0          8m23s
sriov-network-config-daemon-ckk4r         3/3     Running   0          8m23s
sriov-network-config-daemon-ffbws         3/3     Running   0          8m23s
sriov-network-config-daemon-nmbzn         3/3     Running   0          8m23s
sriov-network-operator-6b657d75bf-fq6l6   1/1     Running   0          8m56s

I just subscribe to the sriov-network-operator and didn't delete anything, the value of the `failurePolicy` is Fail too.
mac:openshift-tests-private jianzhang$ oc get mutatingwebhookconfigurations network-resources-injector-config -o yaml | grep failurePolicy

  failurePolicy: Ignore
mac:openshift-tests-private jianzhang$ 
mac:openshift-tests-private jianzhang$ oc get MutatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy
  failurePolicy: Fail
mac:openshift-tests-private jianzhang$ oc get ValidatingWebhookConfiguration sriov-operator-webhook-config -o yaml | grep failurePolicy
  failurePolicy: Fail

mac:openshift-tests-private jianzhang$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-11-065245   True        False         7h23m   Cluster version is 4.10.0-0.nightly-2022-01-11-065245

Comment 2 Per da Silva 2022-01-21 17:04:38 UTC
This isn't an OLM issue. I've reached out to Tim Waugh (twaugh) from EDX. He's verified there's something wrong with the signature for that image and has requested someone from Cloud Distribution to get back to me. Once I hear back, hopefully we can get this to land on the right team ^^

Comment 3 Per da Silva 2022-01-21 18:46:31 UTC
Oh whoops - commented on the wrong bz - please ignore my previous comment

Comment 4 Per da Silva 2022-01-21 19:08:53 UTC
I'm not finding this operator in the 4.10 catalog. Could you share your catalog with me?

Comment 5 Per da Silva 2022-01-25 00:22:26 UTC
Alternatively, would you be able to verify with clusterbot whether this PR: openshift/operator-framework-olm#244 solves the issue?

Comment 6 Marcin Mirecki 2022-01-25 13:36:52 UTC
(In reply to Per da Silva from comment #5)
> Alternatively, would you be able to verify with clusterbot whether this PR:
> openshift/operator-framework-olm#244 solves the issue?

How can modify the env to include your PR?

also, the SRIOV operator is definitely available in the catalog.

Comment 7 Marcin Mirecki 2022-01-25 15:12:41 UTC
> also, the SRIOV operator is definitely available in the catalog.

Just checked, and there is indeed no sriov operator for 4.10

Comment 8 Marcin Mirecki 2022-01-26 13:30:38 UTC
I checked the PR using the cluster bot (launch openshift/operator-framework-olm#244), but unfortunately it still fails.
There is no sriov operator for 4.10 in the catalog, so I used the 4.9 version instead. I verified the fix is working by scaling the operator deployment down to 0.

Below is the procedure to reproduce it, including the operator installation.



oc create namespace openshift-sriov-network-operator

oc create -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: mm-operators
  namespace: openshift-marketplace
spec:
  displayName: MM Operators
  icon:
    base64data: ''
    mediatype: ''
  image: 'registry.redhat.io/redhat/redhat-operator-index:v4.9'
  priority: -100
  publisher: Red Hat
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 10m0s
EOF

oc create -f - <<EOF
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: sriov-network-operators
  namespace: openshift-sriov-network-operator
spec:
  targetNamespaces:
  - openshift-sriov-network-operator
EOF

oc create -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: sriov-network-operator
  namespace: openshift-sriov-network-operator
spec:
  channel: '4.9'
  installPlanApproval: Automatic
  name: sriov-network-operator
  source: mm-operators
  sourceNamespace: openshift-marketplace
  startingCSV: sriov-network-operator.4.9.0-202201140920  
EOF

# CHECK THE OPERATOR POD IS UP
$oc get pod -n openshift-sriov-network-operator
NAME                                                DESIRED   CURRENT   READY   AGE
sriov-network-operator-6d4d9c66df-8cl2p   0/1     Running   0          15s

# CHECK THE DEPLOYMENT IS UP
$oc get deployment -n openshift-sriov-network-operator
NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
sriov-network-operator   0/1     1            0           16s

#CHECK THE FAILURE POLICY, IT SHOULD BE: Fail
$oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure
 failurePolicy: Fail
 
#TRY IF THE CLEANUP WORKS WITH DEPLOYMENT SCALE DOWN:
$oc scale deployment --replicas=0 -n openshift-sriov-network-operator                   sriov-network-operator 
$oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure
 failurePolicy: Ignore
 
# SCALE BACK TO 1 TO HAVE TO OPERATOR RUNNING:  
$oc scale deployment --replicas=1 -n openshift-sriov-network-operator                   sriov-network-operator
# THE WEBHOOK FAILUREPOLICY WILL GO BACK TO Fail
$oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure
 failurePolicy: Fail
 
# DELETE OPERATOR BY DELETING SUBSCRIPTION
$oc delete subscription -n openshift-sriov-network-operator   sriov-network-operator 
$oc delete csv -n openshift-sriov-network-operator       sriov-network-operator.4.9.0-202201140920

# CHECK THE OPERATOR POD IS REMOVED
$oc get pod -n openshift-sriov-network-operator

# THE WEBHOOK FAILUREPOLICY WILL SADLY BE Fail:
$oc get validatingwebhookconfigurations sriov-operator-webhook-config -oyaml |grep failure
 failurePolicy: Fail

Comment 9 Per da Silva 2022-02-02 11:09:44 UTC
Hey, we're still trying to figure this one out.

Comment 10 Per da Silva 2022-02-02 21:04:30 UTC
I've tested a fix with the instructions from above. It didn't fix the problem, but, looking at the sriov operator logs:

2022-02-02T21:00:53.595Z	INFO	controller-runtime.manager.controller.sriovibnetwork	Stopping workers	{"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovIBNetwork"}
2022-02-02T21:00:53.595Z	INFO	controller-runtime.manager.controller.sriovnetwork	Stopping workers	{"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovNetwork"}
2022-02-02T21:00:53.596Z	INFO	controller-runtime.manager.controller.sriovnetworknodepolicy	Stopping workers	{"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovNetworkNodePolicy"}
2022-02-02T21:00:53.597Z	INFO	controller-runtime.manager.controller.sriovnetworkpoolconfig	Stopping workers	{"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovNetworkPoolConfig"}
2022-02-02T21:00:53.597Z	INFO	controller-runtime.manager.controller.sriovoperatorconfig	Stopping workers	{"reconciler group": "sriovnetwork.openshift.io", "reconciler kind": "SriovOperatorConfig"}
2022-02-02T21:00:53.598Z	ERROR	controller-runtime.manager	error received after stop sequence was engaged	{"error": "leader election lost"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:530
2022-02-02T21:00:53.598Z	ERROR	controller-runtime.manager	error received after stop sequence was engaged	{"error": "context canceled"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:530
2022-02-02T21:00:53.598Z	ERROR	controller-runtime.manager	error received after stop sequence was engaged	{"error": "context canceled"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:530
2022-02-02T21:00:53.598Z	INFO	shutdown	Clearing finalizers on exit
2022-02-02T21:00:53.607Z	ERROR	shutdown	Failed to list SriovNetworks	{"error": "sriovnetworks.sriovnetwork.openshift.io is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot list resource \"sriovnetworks\" in API group \"sriovnetwork.openshift.io\" at the cluster scope"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:144
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateFinalizers
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:42
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.Shutdown
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:29
main.main
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:185
runtime.main
	/usr/lib/golang/src/runtime/proc.go:225
2022-02-02T21:00:53.607Z	INFO	shutdown	Done clearing finalizers on exit
2022-02-02T21:00:53.607Z	INFO	shutdown	Seting webhook failure policies to Ignore on exit
2022-02-02T21:00:53.665Z	ERROR	shutdown	Error getting webhook	{"error": "validatingwebhookconfigurations.admissionregistration.k8s.io \"sriov-operator-webhook-config\" is forbidden: User \"system:serviceaccount:openshift-sriov-network-operator:sriov-network-operator\" cannot get resource \"validatingwebhookconfigurations\" in API group \"admissionregistration.k8s.io\" at the cluster scope"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:144
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateValidatingWebhook
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:80
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateWebhooks
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:71
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.Shutdown
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:30
main.main
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:185
runtime.main
	/usr/lib/golang/src/runtime/proc.go:225
panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateValidatingWebhook(0xc000c414a0)
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:82 +0x231
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.updateWebhooks()
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:71 +0x8f
github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils.Shutdown()
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils/shutdown.go:30 +0x2a
main.main()
	/go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/main.go:185 +0xbf5

So, it seems the fix has allowed the shutdown process to run, but it is failing.
I've tested this on CRC though.

Comment 11 Per da Silva 2022-02-07 09:05:47 UTC
I've run a quick test against the current version of OLM, and I get the same log output. It seems the problem isn't related to the grace period, but rather that the shutdown procedure expecting authorizations that were probably deleted together with the CSV.
Resources created by the CSV are deleted in cascade in no particular order. I'd also like to reiterate that I don't think relying on a graceful shutdown is the right approach. A node running the pod could crap out at any time for any number of reasons.

Comment 12 Per da Silva 2022-04-01 18:22:46 UTC
Closing this as wontfix due to lack of activity. Please re-open if it's still an issue.


Note You need to log in before you can comment on or make changes to this bug.