Bug 1874584
| Summary: | add retry for etcd errors in kube-apiserver | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Abu Kashem <akashem> |
| Component: | kube-apiserver | Assignee: | Abu Kashem <akashem> |
| Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.6 | CC: | aos-bugs, mfojtik, sttts, xxia |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:17:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Abu Kashem
2020-09-01 16:57:58 UTC
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-10-21-001511 True False 132m Cluster version is 4.7.0-0.nightly-2020-10-21-001511 - Verification from v(1) log - https://github.com/openshift/kubernetes/pull/327/files#diff-1fd56c8c7e4cdae284a08cdd8ad1fe0683f904cdca76db1e1e1c212f65a67b80R201 $ oc debug node/ip-xx-xx-132-94.us-east-2.compute.internal sh-4.4# chroot /host To create some network disruption in etcd I tried the following script, the effect of this method is relatively slow, need to wait for some time cat net-bother.sh #!/bin/bash set -ex choice=$(oc get --namespace openshift-etcd --selector etcd pods -o json | jq -r '.items[] | .spec.nodeName + " " + (.status.containerStatuses[] | select(.name=="etcd") | .containerID[8:])') IFS=' ' read node container_id <<< "$choice" pid=$(oc debug --quiet nodes/$node -- chroot /host crictl inspect -o go-template --template '{{.info.pid}}' $container_id) oc debug --quiet nodes/$node -- chroot /host strace -Tfe inject=fdatasync:delay_enter=800000 -e trace=fdatasync -p $pid So I tried the following method, # cat test.sh # create one script let the master nic break a while ifconfig ens5 down sleep 300 ifconfig ens5 up sh-4.4# chmod +x test.sh sh-4.4# ./test.sh & After a while, reconnect to above master, sh-4.4# cd /var/log/kube-apiserver/ sh-4.4# grep 'etcd retry - lastErrLabel' termination.log I1022 05:54:18.458455 17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:rpc error: code = Unavailable desc = transport is closing I1022 05:54:18.995361 17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:etcdserver: request timed out I1022 05:54:32.997730 17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:etcdserver: request timed out ... --------------------- - Verification from new prometheus counter for etcd retry - https://github.com/openshift/kubernetes/pull/327/files#diff-a05a6147def4bf5e9b21033d642749b1d0085f0556e7b42b83f09c3486fd92daR66 Also executing a query 'etcd_request_retry_total' in prometheus after above test, get the etcd retry reqeusts as below, etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.132.94:6443",job="apiserver",namespace="default",service="kubernetes"} 10 etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.167.249:6443",job="apiserver",namespace="default",service="kubernetes"} 59 etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.208.134:6443",job="apiserver",namespace="default",service="kubernetes"} 75 The PR fix works as expected, so move the bug VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |