Description of problem: Running `etcd` defrag on OpenShift Container Platform 4 with `etcd` database size bigger than 6 GB is causing the NFD Operator to fail and restart due to lease renewal failure. Since there is a 30 second disruption expected from `etcd` side when doing the defrag activity, we need to be able to deal with this kind of situation and allow better fault tolerance. In https://github.com/openshift/library-go/blob/4362aa519714a4b62b00ab8318197ba2bba51cb7/pkg/config/leaderelection/leaderelection.go#L104 the value is set to 60 seconds and there is also some explanation why this value was chosen. In https://github.com/kubernetes-sigs/controller-runtime/blob/v0.11.1/pkg/manager/manager.go#L182-L184 the default value is much smaller and thus does not have a lot failure toleration. This is especially important in environments with large `etcd` databases as restarting the Operator will cause all CSV being updated with the Operator status and thus causing a spike with regards to `etcd` database instead of actually reducing the overall size. Version-Release number of selected component (if applicable): - OpenShift Container Platform 4.8 How reproducible: - Always Steps to Reproduce: 1. Setup OpenShift Container Platform 4 - with NFD Operator 2. Load etcd with 6 GB or more in database size 3. Run etcd defrag activtiy as per https://docs.openshift.com/container-platform/4.9/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks 4. Watch how NFD Operator is restarting and causing CSV's being updated Actual results: NFD Operator is restarting due to lease renewal failure Expected results: NFS Operator to have more fault tolerance and therefore prevent failing when etcd is unavailable for a short period of time due to defrag activity. Additional info:
Hi all, Just adding some details about Leader Election that ma be useful. + https://sdk.operatorframework.io/docs/building-operators/golang/advanced-topics/#leader-election Thanks and all the best, Simon Reber
would this change help renewDeadline := 60 * time.Second // <--- here mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{ Scheme: scheme, MetricsBindAddress: metricsAddr, Port: 9443, HealthProbeBindAddress: probeAddr, LeaderElection: enableLeaderElection, LeaderElectionID: "39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io", Namespace: watchNamespace, RenewDeadline: &renewDeadline, // <--- here }) ?
(In reply to Carlos Eduardo Arango Gutierrez from comment #2) > would this change help > > renewDeadline := 60 * time.Second // <--- here > mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{ > Scheme: scheme, > MetricsBindAddress: metricsAddr, > Port: 9443, > HealthProbeBindAddress: probeAddr, > LeaderElection: enableLeaderElection, > LeaderElectionID: > "39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io", > Namespace: watchNamespace, > RenewDeadline: &renewDeadline, // <--- here > }) > > ? If that is for lease renewal, then I think this should improve the experience (please also evaluate the impact for the Operator if there is any with this change). Also, do you have any request timeout set for requests towards the OpenShift Container Platform 4 - API as this might need some tweaking as well since we may also see timeouts in this area.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.11.0 extras and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5070