Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1874584

Summary: add retry for etcd errors in kube-apiserver
Product: OpenShift Container Platform Reporter: Abu Kashem <akashem>
Component: kube-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: aos-bugs, mfojtik, sttts, xxia
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:17:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Abu Kashem 2020-09-01 16:57:58 UTC
Add retry for etcd errors in kube-apiserver.

- retry all non-mutating requests on error.
- retry certain mutating requests on errors, the ones that are we know "this action didn't do anything".

Notes:
- If this works well for us we can get that to upstream.

Comment 2 Ke Wang 2020-10-22 07:14:20 UTC
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-10-21-001511   True        False         132m    Cluster version is 4.7.0-0.nightly-2020-10-21-001511

- Verification from v(1) log - https://github.com/openshift/kubernetes/pull/327/files#diff-1fd56c8c7e4cdae284a08cdd8ad1fe0683f904cdca76db1e1e1c212f65a67b80R201
$ oc debug node/ip-xx-xx-132-94.us-east-2.compute.internal

sh-4.4# chroot /host

To create some network disruption in etcd I tried the following script, the effect of this method is relatively slow, need to wait for some time

cat net-bother.sh
#!/bin/bash
set -ex
choice=$(oc get --namespace openshift-etcd --selector etcd pods -o json | jq -r '.items[] | .spec.nodeName + " " + (.status.containerStatuses[] | select(.name=="etcd") | .containerID[8:])')
IFS=' ' read node container_id <<< "$choice"
pid=$(oc debug --quiet nodes/$node -- chroot /host crictl inspect -o go-template --template '{{.info.pid}}' $container_id)
oc debug --quiet nodes/$node -- chroot /host strace -Tfe inject=fdatasync:delay_enter=800000 -e trace=fdatasync -p $pid


So I tried the following method,
# cat test.sh # create one script let the master nic break a while
ifconfig ens5 down
sleep 300
ifconfig ens5 up

sh-4.4# chmod +x test.sh

sh-4.4# ./test.sh &

After a while, reconnect to above master,
sh-4.4# cd /var/log/kube-apiserver/

sh-4.4# grep 'etcd retry - lastErrLabel' termination.log
I1022 05:54:18.458455      17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:rpc error: code = Unavailable desc = transport is closing
I1022 05:54:18.995361      17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:etcdserver: request timed out
I1022 05:54:32.997730      17 retry_etcdclient.go:201] etcd retry - lastErrLabel: Unavailable error:etcdserver: request timed out
...

---------------------
- Verification from new prometheus counter for etcd retry - https://github.com/openshift/kubernetes/pull/327/files#diff-a05a6147def4bf5e9b21033d642749b1d0085f0556e7b42b83f09c3486fd92daR66
 
Also executing a query 'etcd_request_retry_total' in prometheus after above test, get the etcd retry reqeusts as below,

etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.132.94:6443",job="apiserver",namespace="default",service="kubernetes"}	10
etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.167.249:6443",job="apiserver",namespace="default",service="kubernetes"}	59
etcd_request_retry_total{apiserver="kube-apiserver",endpoint="https",error="Unavailable",instance="xx.xx.208.134:6443",job="apiserver",namespace="default",service="kubernetes"}	75

The PR fix works as expected, so move the bug VERIFIED.

Comment 5 errata-xmlrpc 2021-02-24 15:17:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633