Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2083942

Summary: Learner promotion can temporarily fail with rpc not supported for learner errors
Product: OpenShift Container Platform Reporter: Haseeb Tariq <htariq>
Component: EtcdAssignee: Thomas Jungblut <tjungblu>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: high    
Version: 4.11CC: geliu, tjungblu, wking
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:11:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Haseeb Tariq 2022-05-11 04:50:57 UTC
The clustermember controller will repeatedly attempt to promote a learner member and ignore the error for the expected case when the learner is not yet in sync with the leader's log.
https://github.com/openshift/cluster-etcd-operator/blob/c9977ae3bd788a9e9a595001c5bc24995f9d0175/pkg/operator/clustermembercontroller/clustermembercontroller.go#L275-L288

Learner members only support endpoint status (`etcdctl endpoint status`) and serializable read (`etcdctl get --consistency="s" ...`) calls and reject all other calls including the promotion attempts.

The etcd client for the clustermember controller does not exclude learner members from its endpoints so the calls to promote a member could be sent to the learner itself which can result in "etcdserver: rpc not supported for learner" errors.

This happens most commonly early on during bootstrap when there are 2 members (1 voting and 1 learner) present and the promotion calls end up being sent to the learner member.

Seen here in CI:
https://search.ci.openshift.org/?search=etcdserver%3A+rpc+not+supported+for+learner&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-windows-machine-config-operator-release-4.11-vsphere-e2e-periodic/1524177785738760192


While the promotion calls are retried and eventually succeed, they should never be sent to learner members in the first place.

The apparent fix here seems to be to exclude all learner member endpoints from the cached etcd client used by the clustermember controller.

Comment 2 Thomas Jungblut 2022-05-31 12:17:51 UTC
*** Bug 2089153 has been marked as a duplicate of this bug. ***

Comment 6 ge liu 2022-06-20 03:56:23 UTC
Run some regression test, and have not hit err.

Comment 7 errata-xmlrpc 2022-08-10 11:11:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069