Bug 1989767

Summary:	kube-controller-manager needs to handle API server downtime gracefully in SNO
Product:	OpenShift Container Platform	Reporter:	Naga Ravi Chaitanya Elluri <nelluri>
Component:	kube-controller-manager	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED DEFERRED	QA Contact:	zhou ying <yinzhou>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.9	CC:	mfojtik, nelluri, wlewis
Target Milestone:	---	Flags:	mfojtik: needinfo?
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:	chaos LifecycleStale
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-16 10:45:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1984730

Description Naga Ravi Chaitanya Elluri 2021-08-03 23:48:12 UTC

Description of problem:
kube-controller-manager leader election lease duration is set to 15 seconds which is causing it to go through leader elections and restart during the kube-apiserver rollout which currently takes around 60 seconds with shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ). kube-controller-manager leader election timeout should be set to > 60 seconds to handle the downtime gracefully in SNO.

Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183:

LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s.
These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them.
This gives us
1. clock skew tolerance == 30s
2. kube-apiserver downtime tolerance == 78s
3. worst non-graceful lease reacquisition == 163s
4. worst graceful lease reacquisition == 26s

We can see that leader lease failures in the log: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/kube-controller-manager/kube-controller-manager-pod.log and that maps to the restarts: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/kube-controller-manager/cerberus_cluster_state.log. We can disable the leader election as well given that there's no HA.

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-01-132055

How reproducible:
Always

Steps to Reproduce:
1. Install a SNO cluster using the latest nightly payload.
2. Trigger kube-apiserver rollout or outage which lasts for at least 60 seconds ( kube-apiserver rollout on a cluster built using payload with https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 should take ~60 seconds ) - $ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n
3. Observe the state of kube-controller-manager.

Actual results:
kube-controller-manager goes through leader election and restarts.

Expected results:
kube-controller-manager should handle the API rollout/outage gracefully.

Additional info:
Logs including must-gather: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/kube-controller-manager/

Comment 2 Maciej Szulik 2021-08-09 14:26:18 UTC

Mike this is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1984608

Comment 3 Naga Ravi Chaitanya Elluri 2021-08-24 14:44:01 UTC

Any update on this? Thanks.

Comment 4 Mike Dame 2021-08-24 14:58:09 UTC

Hi, yes I was just talking about this with David today. From his comment on the related PR:

> This one in particular we should not change until graceful release becomes a thing for the KCM.
> 
> Without graceful release, the average case of leader changes is directly related to the lease duration. Having a tight loop there is expensive, but the cost of waiting for two minutes for every normal release is too high to be reasonable.
> 
> The KCM is special in the sense that it is difficult to gracefully release and having a leader is critical to the smooth running of the cluster.
> 
> https://github.com/kubernetes/kubernetes/pull/101379 and https://github.com/kubernetes/kubernetes/pull/101125 are in progress to make the process gracefully release
> 
> /hold until the KCM can gracefully release the lease.

So while we have the updated values and it is possible to make the change, this is dependent on the PRs he mentioned merging upstream to allow graceful lock release in KCM (ref: https://issues.redhat.com/browse/WRKLDS-261)

If this bug is a critical blocker for this release, we will need to evaluate options such as carry patches for the upstream changes in order to unblock it.

Comment 9 Maciej Szulik 2022-01-28 12:23:53 UTC

This was fixed in https://github.com/openshift/library-go/pull/1242 and picked into kcm-o in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/581

Comment 12 zhou ying 2022-01-30 06:07:07 UTC

still could reproduce the issue now :

[root@localhost roottest]# oc get clusterversion 
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-arm64-2022-01-29-162645   True        False         37m     Cluster version is 4.10.0-0.nightly-arm64-2022-01-29-162645


kubeapiserver.operator.openshift.io/cluster patched
[root@localhost roottest]# oc get co kube-apiserver -w 
NAME             VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.10.0-0.nightly-arm64-2022-01-29-162645   True        False         False      45m     
kube-apiserver   4.10.0-0.nightly-arm64-2022-01-29-162645   True        False         False      45m     
kube-apiserver   4.10.0-0.nightly-arm64-2022-01-29-162645   True        False         False      45m     
kube-apiserver   4.10.0-0.nightly-arm64-2022-01-29-162645   True        True          False      45m     NodeInstallerProgressing: 1 nodes are at revision 10; 0 nodes have achieved new revision 11



I0130 06:00:17.860585       1 garbagecollector.go:580] "Deleting object" object="openshift-operator-lifecycle-manager/collect-profiles-27391995-6dhnb" objectUID=0e42ac53-9bc0-422c-859e-da3b273a94b9 kind="Pod" propagationPolicy=Background
E0130 06:02:30.140259       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.0.220.57:6443: connect: connection refused
E0130 06:02:33.077150       1 resource_quota_controller.go:413] failed to discover resources: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/api": dial tcp 10.0.220.57:6443: connect: connection refused
W0130 06:02:33.080104       1 garbagecollector.go:709] failed to discover preferred resources: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/api": dial tcp 10.0.166.160:6443: connect: connection refused
I0130 06:02:33.080127       1 garbagecollector.go:181] no resources reported by discovery, skipping garbage collector sync
E0130 06:02:33.147759       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.0.153.179:6443: connect: connection refused
E0130 06:02:36.146760       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.0.220.57:6443: connect: connection refused
E0130 06:02:39.148401       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.0.166.160:6443: connect: connection refused
I0130 06:02:40.135068       1 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F0130 06:02:40.135155       1 controllermanager.go:330] leaderelection lost
[root@localhost roottest]# oc logs -f po/kube-controller-manager-ip-10-0-153-252.us-east-2.compute.internal
+ timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10257 \))" ]; do sleep 1; done'
++ ss -Htanop '(' sport = 10257 ')'
......


I0130 06:03:17.790530       1 flags.go:64] FLAG: --add-dir-header="false"
I0130 06:03:17.790622       1 flags.go:64] FLAG: --address="127.0.0.1"
I0130 06:03:17.790629       1 flags.go:64] FLAG: --allocate-node-cidrs="false"
I0130 06:03:17.790634       1 flags.go:64] FLAG: --allow-metric-labels="[]"

Comment 13 Michal Fojtik 2022-03-01 06:57:16 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 15 Jan Chaloupka 2023-01-16 10:45:24 UTC

If this is still an issue please report the bug against https://issues.redhat.com/browse/OCPBUGS