Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2058256

Summary: LeaseDuration for NFD Operator seems to be rather small, causing Operator restarts when running etcd defrag
Product: OpenShift Container Platform Reporter: Simon Reber <sreber>
Component: Node Feature Discovery OperatorAssignee: Carlos Eduardo Arango Gutierrez <carangog>
Status: CLOSED ERRATA QA Contact: Lena Horsley <lhorsley>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: scuppett, sejug
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2065148 (view as bug list) Environment:
Last Closed: 2022-08-10 10:23:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2065148    

Description Simon Reber 2022-02-24 15:27:22 UTC
Description of problem:

Running `etcd` defrag on OpenShift Container Platform 4 with `etcd` database size bigger than 6 GB is causing the NFD Operator to fail and restart due to lease renewal failure.

Since there is a 30 second disruption expected from `etcd` side when doing the defrag activity, we need to be able to deal with this kind of situation and allow better fault tolerance.

In https://github.com/openshift/library-go/blob/4362aa519714a4b62b00ab8318197ba2bba51cb7/pkg/config/leaderelection/leaderelection.go#L104 the value is set to 60 seconds and there is also some explanation why this value was chosen.

In https://github.com/kubernetes-sigs/controller-runtime/blob/v0.11.1/pkg/manager/manager.go#L182-L184 the default value is much smaller and thus does not have a lot failure toleration.

This is especially important in environments with large `etcd` databases as restarting the Operator will cause all CSV being updated with the Operator status and thus causing a spike with regards to `etcd` database instead of actually reducing the overall size.

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.8

How reproducible:

 - Always

Steps to Reproduce:
1. Setup OpenShift Container Platform 4 - with NFD Operator
2. Load etcd with 6 GB or more in database size
3. Run etcd defrag activtiy as per https://docs.openshift.com/container-platform/4.9/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks
4. Watch how NFD Operator is restarting and causing CSV's being updated

Actual results:

NFD Operator is restarting due to lease renewal failure

Expected results:

NFS Operator to have more fault tolerance and therefore prevent failing when etcd is unavailable for a short period of time due to defrag activity.

Additional info:

Comment 1 Simon Reber 2022-02-24 15:42:40 UTC
Hi all,

Just adding some details about Leader Election that ma be useful.

 + https://sdk.operatorframework.io/docs/building-operators/golang/advanced-topics/#leader-election

Thanks and all the best,
Simon Reber

Comment 2 Carlos Eduardo Arango Gutierrez 2022-02-24 15:48:39 UTC
would this change help

	renewDeadline := 60 * time.Second // <--- here
	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
		Scheme:                 scheme,
		MetricsBindAddress:     metricsAddr,
		Port:                   9443,
		HealthProbeBindAddress: probeAddr,
		LeaderElection:         enableLeaderElection,
		LeaderElectionID:       "39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io",
		Namespace:              watchNamespace, 
		RenewDeadline:          &renewDeadline, // <--- here
	})

?

Comment 4 Simon Reber 2022-02-24 16:10:45 UTC
(In reply to Carlos Eduardo Arango Gutierrez from comment #2)
> would this change help
> 
> 	renewDeadline := 60 * time.Second // <--- here
> 	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
> 		Scheme:                 scheme,
> 		MetricsBindAddress:     metricsAddr,
> 		Port:                   9443,
> 		HealthProbeBindAddress: probeAddr,
> 		LeaderElection:         enableLeaderElection,
> 		LeaderElectionID:      
> "39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io",
> 		Namespace:              watchNamespace, 
> 		RenewDeadline:          &renewDeadline, // <--- here
> 	})
> 
> ?
If that is for lease renewal, then I think this should improve the experience (please also evaluate the impact for the Operator if there is any with this change). Also, do you have any request timeout set for requests towards the OpenShift Container Platform 4 - API as this might need some tweaking as well since we may also see timeouts in this area.

Comment 11 errata-xmlrpc 2022-08-10 10:23:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.11.0 extras and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5070