Bug 2006717 - [release-4.9] add retry for etcd errors in kube-apiserver
Summary: [release-4.9] add retry for etcd errors in kube-apiserver
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.z
Assignee: Lukasz Szaszkiewicz
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On: 2006714
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-22 09:40 UTC by Lukasz Szaszkiewicz
Modified: 2021-11-01 13:44 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2006714
Environment:
Last Closed: 2021-11-01 13:44:32 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 974 0 None open Bug 2006717: etcd-client starts retrying transient errors from the etcd cluster 2021-09-22 09:43:11 UTC
Red Hat Product Errata RHBA-2021:4005 0 None None None 2021-11-01 13:44:50 UTC

Description Lukasz Szaszkiewicz 2021-09-22 09:40:43 UTC
+++ This bug was initially created as a clone of Bug #2006714 +++

The etcd client used by the API server should retry certain types of errors. The full list of retriable (codes.Unavailable) errors can be found at https://github.com/etcd-io/etcd/blob/main/api/v3rpc/rpctypes/error.go#L72.
This is already implemented by the etcd client just wasn't "enabled" in the API server


The fix seems to be real compare [1] to how it was just a few days ago [2]

[1] https://search.ci.openshift.org/?search=etcdserver%3A+leader+changed&maxAge=24h&context=1&type=junit&name=periodic.*4.10.*upgrade&excludeName=4.9%7C4.8%7Csingle-node&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://search.ci.openshift.org/?search=etcdserver%3A+leader+changed&maxAge=48h&context=1&type=junit&name=periodic.*4.10.*upgrade&excludeName=4.9%7C4.8%7Csingle-node&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Ke Wang 2021-09-27 04:28:25 UTC
This bug's PR is dev-approved and not yet merged, so I'm following issue DPTP-660 to do the pre-merge verifying for QE pre-merge verification goal of issue OCPQE-815 by using the bot to launch a cluster with the open PR.  Here is the verification steps:

Compared the results between 4.9 and 4.9 with PR fix.

-4.9 without PR fix.
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-25-094414   True        False         79m     Cluster version is 4.9.0-0.nightly-2021-09-25-094414

$ masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')
$ for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash -c "grep -ir 'etcdserver: leader changed' /var/log/ | grep -v debug";done | grep kube-apiserver | wc -l
...
12

Snipped logs, 
/var/log/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-85f45dfb8c-qwhh5_51cdbf88-9631-40c9-afd2-381f72bcc81d/kube-apiserver-operator/1.log:2021-09-27T02:34:45.038539976+00:00 stderr F E0927 02:34:45.037714       1 base_controller.go:272] CertRotationController reconciliation failed: etcdserver: leader changed
/var/log/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-85f45dfb8c-qwhh5_51cdbf88-9631-40c9-afd2-381f72bcc81d/kube-apiserver-operator/1.log:2021-09-27T02:34:45.040668123+00:00 stderr F E0927 02:34:45.039861       1 base_controller.go:272] auditPolicyController reconciliation failed: etcdserver: leader changed
...
/var/log/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-85f45dfb8c-qwhh5_51cdbf88-9631-40c9-afd2-381f72bcc81d/kube-apiserver-operator/1.log:2021-09-27T02:34:54.301287946+00:00 stderr F E0927 02:34:54.301254       1 base_controller.go:272] StaticResourceController reconciliation failed: ["manifests/installer-sa.yaml" (string): etcdserver: leader changed, Operation cannot be fulfilled on kubeapiservers.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again]
/var/log/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-85f45dfb8c-qwhh5_51cdbf88-9631-40c9-afd2-381f72bcc81d/kube-apiserver-operator/1.log:2021-09-27T02:34:56.857449602+00:00 stderr F E0927 02:34:56.853811       1 base_controller.go:272] StaticResourceController reconciliation failed: ["assets/kube-apiserver/svc.yaml" (string): etcdserver: leader changed, Operation cannot be fulfilled on kubeapiservers.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again]


- 4.9 with PR fix.
$ oc get clusterversion
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.ci.test-2021-09-27-021440-ci-ln-8ftswck-latest   True        False         94m     Cluster version is 4.9.0-0.ci.test-2021-09-27-021440-ci-ln-8ftswck-latest

$ masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')
$ for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash -c "grep -ir 'etcdserver: leader changed' /var/log/ | grep -v debug";done | grep kube-apiserver | wc -l
...
/var/log/kube-apiserver/termination.log:{"level":"warn","ts":"2021-09-27T02:37:33.612Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001ac9500/#initially=[https://10.0.0.3:2379;https://10.0.0.4:2379;https://10.0.0.5:2379;https://localhost:2379]","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: leader changed"}

Above warning logs from etcd-client, not from by the API server, after applied the PR fix, won't see certain errors of etcd. So the bug is pre-merge verified. After the PR gets merged, the bug will be moved to VERIFIED by the bot automatically or, if not working, by me manually.

Comment 6 errata-xmlrpc 2021-11-01 13:44:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.5 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4005


Note You need to log in before you can comment on or make changes to this bug.