Add retry for etcd errors in kube-apiserver. - retry all non-mutating requests on error. - retry certain mutating requests on errors, the ones that are we know "this action didn't do anything". Notes: - If this works well for us we can get that to upstream.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-10-21-001511 True False 132m Cluster version is 4.7.0-0.nightly-2020-10-21-001511 - Verification from v(1) log - https://github.com/openshift/kubernetes/pull/327/files#diff-1fd56c8c7e4cdae284a08cdd8ad1fe0683f904cdca76db1e1e1c212f65a67b80R201 $ oc debug node/ip-xx-xx-132-94.us-east-2.compute.internal sh-4.4# chroot /host To create some network disruption in etcd I tried the following script, the effect of this method is relatively slow, need to wait for some time cat net-bother.sh #!/bin/bash set -ex choice=$(oc get --namespace openshift-etcd --selector etcd pods -o json | jq -r '.items[] | .spec.nodeName + " " + (.status.containerStatuses[] | select(.name=="etcd") | .containerID[8:])') IFS=' ' read node container_id <<< "$choice" pid=$(oc debug --quiet nodes/$node -- chroot /host crictl inspect -o go-template --template '{{.info.pid}}' $container_id) oc debug --quiet nodes/$node -- chroot /host strace -Tfe inject=fdatasync:delay_enter=800000 -e trace=fdatasync -p $pid So I tried the following method, # cat test.sh # create one script let the master nic break a while ifconfig ens5 down sleep 300 ifconfig ens5 up sh-4.4# chmod +x test.sh sh-4.4# ./test.sh & After a while, reconnect to above master, sh-4.4# cd /var/log/kube-apiserver/ sh-4.4# grep 'etcd retry - lastErrLabel' termination.log I1022 05:54:18.458455 17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:rpc error: code = Unavailable desc = transport is closing I1022 05:54:18.995361 17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:etcdserver: request timed out I1022 05:54:32.997730 17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:etcdserver: request timed out ... --------------------- - Verification from new prometheus counter for etcd retry - https://github.com/openshift/kubernetes/pull/327/files#diff-a05a6147def4bf5e9b21033d642749b1d0085f0556e7b42b83f09c3486fd92daR66 Also executing a query 'etcd_request_retry_total' in prometheus after above test, get the etcd retry reqeusts as below, etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.132.94:6443",job="apiserver",namespace="default",service="kubernetes"} 10 etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.167.249:6443",job="apiserver",namespace="default",service="kubernetes"} 59 etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.208.134:6443",job="apiserver",namespace="default",service="kubernetes"} 75 The PR fix works as expected, so move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633