Bug 1874584 - add retry for etcd errors in kube-apiserver
Summary: add retry for etcd errors in kube-apiserver
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Abu Kashem
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-01 16:57 UTC by Abu Kashem
Modified: 2021-02-24 15:17 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:17:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 327 0 None closed Bug 1874584: UPSTREAM: <carry>: retry etcd errors 2021-01-11 03:03:34 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:17:26 UTC

Description Abu Kashem 2020-09-01 16:57:58 UTC
Add retry for etcd errors in kube-apiserver.

- retry all non-mutating requests on error.
- retry certain mutating requests on errors, the ones that are we know "this action didn't do anything".

Notes:
- If this works well for us we can get that to upstream.

Comment 2 Ke Wang 2020-10-22 07:14:20 UTC
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-10-21-001511   True        False         132m    Cluster version is 4.7.0-0.nightly-2020-10-21-001511

- Verification from v(1) log - https://github.com/openshift/kubernetes/pull/327/files#diff-1fd56c8c7e4cdae284a08cdd8ad1fe0683f904cdca76db1e1e1c212f65a67b80R201
$ oc debug node/ip-xx-xx-132-94.us-east-2.compute.internal

sh-4.4# chroot /host

To create some network disruption in etcd I tried the following script, the effect of this method is relatively slow, need to wait for some time

cat net-bother.sh
#!/bin/bash
set -ex
choice=$(oc get --namespace openshift-etcd --selector etcd pods -o json | jq -r '.items[] | .spec.nodeName + " " + (.status.containerStatuses[] | select(.name=="etcd") | .containerID[8:])')
IFS=' ' read node container_id <<< "$choice"
pid=$(oc debug --quiet nodes/$node -- chroot /host crictl inspect -o go-template --template '{{.info.pid}}' $container_id)
oc debug --quiet nodes/$node -- chroot /host strace -Tfe inject=fdatasync:delay_enter=800000 -e trace=fdatasync -p $pid


So I tried the following method,
# cat test.sh # create one script let the master nic break a while
ifconfig ens5 down
sleep 300
ifconfig ens5 up

sh-4.4# chmod +x test.sh

sh-4.4# ./test.sh &

After a while, reconnect to above master,
sh-4.4# cd /var/log/kube-apiserver/

sh-4.4# grep 'etcd retry - lastErrLabel' termination.log
I1022 05:54:18.458455      17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:rpc error: code = Unavailable desc = transport is closing
I1022 05:54:18.995361      17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:etcdserver: request timed out
I1022 05:54:32.997730      17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:etcdserver: request timed out
...

---------------------
- Verification from new prometheus counter for etcd retry - https://github.com/openshift/kubernetes/pull/327/files#diff-a05a6147def4bf5e9b21033d642749b1d0085f0556e7b42b83f09c3486fd92daR66
 
Also executing a query 'etcd_request_retry_total' in prometheus after above test, get the etcd retry reqeusts as below,

etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.132.94:6443",job="apiserver",namespace="default",service="kubernetes"}	10
etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.167.249:6443",job="apiserver",namespace="default",service="kubernetes"}	59
etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.208.134:6443",job="apiserver",namespace="default",service="kubernetes"}	75

The PR fix works as expected, so move the bug VERIFIED.

Comment 5 errata-xmlrpc 2021-02-24 15:17:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.