Bug 2083942 - Learner promotion can temporarily fail with rpc not supported for learner errors
Summary: Learner promotion can temporarily fail with rpc not supported for learner errors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.11.0
Assignee: Thomas Jungblut
QA Contact: ge liu
URL:
Whiteboard:
: 2089153 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-11 04:50 UTC by Haseeb Tariq
Modified: 2022-08-10 11:11 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:11:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 834 0 None open Bug 2083942: Exclude learners from etcdclient 2022-05-18 17:34:33 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:11:19 UTC

Description Haseeb Tariq 2022-05-11 04:50:57 UTC
The clustermember controller will repeatedly attempt to promote a learner member and ignore the error for the expected case when the learner is not yet in sync with the leader's log.
https://github.com/openshift/cluster-etcd-operator/blob/c9977ae3bd788a9e9a595001c5bc24995f9d0175/pkg/operator/clustermembercontroller/clustermembercontroller.go#L275-L288

Learner members only support endpoint status (`etcdctl endpoint status`) and serializable read (`etcdctl get --consistency="s" ...`) calls and reject all other calls including the promotion attempts.

The etcd client for the clustermember controller does not exclude learner members from its endpoints so the calls to promote a member could be sent to the learner itself which can result in "etcdserver: rpc not supported for learner" errors.

This happens most commonly early on during bootstrap when there are 2 members (1 voting and 1 learner) present and the promotion calls end up being sent to the learner member.

Seen here in CI:
https://search.ci.openshift.org/?search=etcdserver%3A+rpc+not+supported+for+learner&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-windows-machine-config-operator-release-4.11-vsphere-e2e-periodic/1524177785738760192


While the promotion calls are retried and eventually succeed, they should never be sent to learner members in the first place.

The apparent fix here seems to be to exclude all learner member endpoints from the cached etcd client used by the clustermember controller.

Comment 2 Thomas Jungblut 2022-05-31 12:17:51 UTC
*** Bug 2089153 has been marked as a duplicate of this bug. ***

Comment 6 ge liu 2022-06-20 03:56:23 UTC
Run some regression test, and have not hit err.

Comment 7 errata-xmlrpc 2022-08-10 11:11:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.