2058256 – LeaseDuration for NFD Operator seems to be rather small, causing Operator restarts when running etcd defrag

Bug 2058256 - LeaseDuration for NFD Operator seems to be rather small, causing Operator restarts when running etcd defrag

Summary: LeaseDuration for NFD Operator seems to be rather small, causing Operator res...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node Feature Discovery Operator
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Carlos Eduardo Arango Gutierrez
QA Contact:	Lena Horsley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2065148
TreeView+	depends on / blocked

Reported:	2022-02-24 15:27 UTC by Simon Reber
Modified:	2022-08-10 10:24 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2065148 (view as bug list)
Environment:
Last Closed:	2022-08-10 10:23:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-nfd-operator pull 246	None	open	Bug 2058256: use library-go for leader elect options	2022-03-16 01:01:21 UTC
Red Hat Knowledge Base (Solution)	6840041	None	None	None	2022-03-24 11:19:37 UTC
Red Hat Product Errata	RHSA-2022:5070	None	None	None	2022-08-10 10:24:00 UTC

Description Simon Reber 2022-02-24 15:27:22 UTC

Description of problem:

Running `etcd` defrag on OpenShift Container Platform 4 with `etcd` database size bigger than 6 GB is causing the NFD Operator to fail and restart due to lease renewal failure.

Since there is a 30 second disruption expected from `etcd` side when doing the defrag activity, we need to be able to deal with this kind of situation and allow better fault tolerance.

In https://github.com/openshift/library-go/blob/4362aa519714a4b62b00ab8318197ba2bba51cb7/pkg/config/leaderelection/leaderelection.go#L104 the value is set to 60 seconds and there is also some explanation why this value was chosen.

In https://github.com/kubernetes-sigs/controller-runtime/blob/v0.11.1/pkg/manager/manager.go#L182-L184 the default value is much smaller and thus does not have a lot failure toleration.

This is especially important in environments with large `etcd` databases as restarting the Operator will cause all CSV being updated with the Operator status and thus causing a spike with regards to `etcd` database instead of actually reducing the overall size.

Version-Release number of selected component (if applicable):

- OpenShift Container Platform 4.8

How reproducible:

- Always

Steps to Reproduce:
1. Setup OpenShift Container Platform 4 - with NFD Operator
2. Load etcd with 6 GB or more in database size
3. Run etcd defrag activtiy as per https://docs.openshift.com/container-platform/4.9/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks
4. Watch how NFD Operator is restarting and causing CSV's being updated

Actual results:

NFD Operator is restarting due to lease renewal failure

Expected results:

NFS Operator to have more fault tolerance and therefore prevent failing when etcd is unavailable for a short period of time due to defrag activity.

Additional info:

Comment 1 Simon Reber 2022-02-24 15:42:40 UTC

Hi all,

Just adding some details about Leader Election that ma be useful.

 + https://sdk.operatorframework.io/docs/building-operators/golang/advanced-topics/#leader-election

Thanks and all the best,
Simon Reber

Comment 2 Carlos Eduardo Arango Gutierrez 2022-02-24 15:48:39 UTC

would this change help

	renewDeadline := 60 * time.Second // <--- here
	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
		Scheme:                 scheme,
		MetricsBindAddress:     metricsAddr,
		Port:                   9443,
		HealthProbeBindAddress: probeAddr,
		LeaderElection:         enableLeaderElection,
		LeaderElectionID:       "39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io",
		Namespace:              watchNamespace, 
		RenewDeadline:          &renewDeadline, // <--- here
	})

?

Comment 4 Simon Reber 2022-02-24 16:10:45 UTC

(In reply to Carlos Eduardo Arango Gutierrez from comment #2)
> would this change help
> 
> 	renewDeadline := 60 * time.Second // <--- here
> 	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
> 		Scheme:                 scheme,
> 		MetricsBindAddress:     metricsAddr,
> 		Port:                   9443,
> 		HealthProbeBindAddress: probeAddr,
> 		LeaderElection:         enableLeaderElection,
> 		LeaderElectionID:      
> "39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io",
> 		Namespace:              watchNamespace, 
> 		RenewDeadline:          &renewDeadline, // <--- here
> 	})
> 
> ?
If that is for lease renewal, then I think this should improve the experience (please also evaluate the impact for the Operator if there is any with this change). Also, do you have any request timeout set for requests towards the OpenShift Container Platform 4 - API as this might need some tweaking as well since we may also see timeouts in this area.

Comment 11 errata-xmlrpc 2022-08-10 10:23:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.11.0 extras and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5070

Note You need to log in before you can comment on or make changes to this bug.