Bug 1989767
| Summary: | kube-controller-manager needs to handle API server downtime gracefully in SNO | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> |
| Component: | kube-controller-manager | Assignee: | Jan Chaloupka <jchaloup> |
| Status: | CLOSED DEFERRED | QA Contact: | zhou ying <yinzhou> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.9 | CC: | mfojtik, nelluri, wlewis |
| Target Milestone: | --- | Flags: | mfojtik:
needinfo?
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | chaos LifecycleStale | ||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-16 10:45:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1984730 | ||
|
Description
Naga Ravi Chaitanya Elluri
2021-08-03 23:48:12 UTC
Mike this is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1984608 Any update on this? Thanks. Hi, yes I was just talking about this with David today. From his comment on the related PR: > This one in particular we should not change until graceful release becomes a thing for the KCM. > > Without graceful release, the average case of leader changes is directly related to the lease duration. Having a tight loop there is expensive, but the cost of waiting for two minutes for every normal release is too high to be reasonable. > > The KCM is special in the sense that it is difficult to gracefully release and having a leader is critical to the smooth running of the cluster. > > https://github.com/kubernetes/kubernetes/pull/101379 and https://github.com/kubernetes/kubernetes/pull/101125 are in progress to make the process gracefully release > > /hold until the KCM can gracefully release the lease. So while we have the updated values and it is possible to make the change, this is dependent on the PRs he mentioned merging upstream to allow graceful lock release in KCM (ref: https://issues.redhat.com/browse/WRKLDS-261) If this bug is a critical blocker for this release, we will need to evaluate options such as carry patches for the upstream changes in order to unblock it. This was fixed in https://github.com/openshift/library-go/pull/1242 and picked into kcm-o in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/581 still could reproduce the issue now : [root@localhost roottest]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-arm64-2022-01-29-162645 True False 37m Cluster version is 4.10.0-0.nightly-arm64-2022-01-29-162645 kubeapiserver.operator.openshift.io/cluster patched [root@localhost roottest]# oc get co kube-apiserver -w NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.10.0-0.nightly-arm64-2022-01-29-162645 True False False 45m kube-apiserver 4.10.0-0.nightly-arm64-2022-01-29-162645 True False False 45m kube-apiserver 4.10.0-0.nightly-arm64-2022-01-29-162645 True False False 45m kube-apiserver 4.10.0-0.nightly-arm64-2022-01-29-162645 True True False 45m NodeInstallerProgressing: 1 nodes are at revision 10; 0 nodes have achieved new revision 11 I0130 06:00:17.860585 1 garbagecollector.go:580] "Deleting object" object="openshift-operator-lifecycle-manager/collect-profiles-27391995-6dhnb" objectUID=0e42ac53-9bc0-422c-859e-da3b273a94b9 kind="Pod" propagationPolicy=Background E0130 06:02:30.140259 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.0.220.57:6443: connect: connection refused E0130 06:02:33.077150 1 resource_quota_controller.go:413] failed to discover resources: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/api": dial tcp 10.0.220.57:6443: connect: connection refused W0130 06:02:33.080104 1 garbagecollector.go:709] failed to discover preferred resources: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/api": dial tcp 10.0.166.160:6443: connect: connection refused I0130 06:02:33.080127 1 garbagecollector.go:181] no resources reported by discovery, skipping garbage collector sync E0130 06:02:33.147759 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.0.153.179:6443: connect: connection refused E0130 06:02:36.146760 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.0.220.57:6443: connect: connection refused E0130 06:02:39.148401 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.yinzhou-bugr.qe.devcluster.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 10.0.166.160:6443: connect: connection refused I0130 06:02:40.135068 1 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition F0130 06:02:40.135155 1 controllermanager.go:330] leaderelection lost [root@localhost roottest]# oc logs -f po/kube-controller-manager-ip-10-0-153-252.us-east-2.compute.internal + timeout 3m /bin/bash -exuo pipefail -c 'while [ -n "$(ss -Htanop \( sport = 10257 \))" ]; do sleep 1; done' ++ ss -Htanop '(' sport = 10257 ')' ...... I0130 06:03:17.790530 1 flags.go:64] FLAG: --add-dir-header="false" I0130 06:03:17.790622 1 flags.go:64] FLAG: --address="127.0.0.1" I0130 06:03:17.790629 1 flags.go:64] FLAG: --allocate-node-cidrs="false" I0130 06:03:17.790634 1 flags.go:64] FLAG: --allow-metric-labels="[]" This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. If this is still an issue please report the bug against https://issues.redhat.com/browse/OCPBUGS |