Bug 2063831 - etcd quorum pods landing on same node
Summary: etcd quorum pods landing on same node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Haseeb Tariq
QA Contact: ge liu
URL:
Whiteboard:
: 2027744 2065454 (view as bug list)
Depends On:
Blocks: 2070783
TreeView+ depends on / blocked
 
Reported: 2022-03-14 13:36 UTC by Devan Goodwin
Modified: 2024-12-20 21:41 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2070783 (view as bug list)
Environment:
openshift-tests-upgrade.[sig-scheduling][Early] The openshift-etcd pods should be scheduled on different nodes [Suite:openshift/conformance/parallel]
Last Closed: 2022-08-10 10:54:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 763 0 None Merged Bug 2063831: replace quorumguard and add readyz server 2022-05-14 06:21:24 UTC
Github openshift cluster-etcd-operator pull 789 0 None Merged Bug 2063831: Replace quorumguard and add readyz server 2022-05-14 06:21:24 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:54:21 UTC

Description Devan Goodwin 2022-03-14 13:36:26 UTC
TRT recently added a test to monitor for this and it exposed that etcd quorum pods are actually landing on the same node for periods of time:

https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=openshift-tests-upgrade.[sig-scheduling][Early]%20The%20openshift-etcd%20pods%20should%20be%20scheduled%20on%20different%20nodes%20[Suite:openshift/conformance/parallel]

Sample job: 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1503258288765014016

This seems to be happening alarmingly often:

https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=48h&context=0&type=junit&name=4.11&excludeName=quorum&maxMatches=5&maxBytes=20971520&groupBy=job

Marking sev high as this has potential to cause loss of quorum. 

Backporting to 4.10 should probably be discussed.

Jan Chaloupka did some work to allow force assign PDB pods to nodes instead of relying on scheduler, may be a good idea to make use of this for etcd.

Comment 1 Devan Goodwin 2022-03-14 13:37:19 UTC
TRT is double checking the results to make absolutely sure the test is catching something real.

Comment 2 Ken Zhang 2022-03-14 15:10:50 UTC
I confirmed that for both HAProxy and ETCD cases, the test is catching real problems. There is a bug with image-registry that is being fixed.

Comment 3 Haseeb Tariq 2022-03-14 21:20:55 UTC
Working on an update to replace the etcd-operator's quorum guard controller with the staticpod quorum guard controller.
This would also include a new readyz server sidecar on the etcd-pods for the guard controller to be able to check for pod readiness.

Comment 4 W. Trevor King 2022-03-21 22:17:28 UTC
*** Bug 2065454 has been marked as a duplicate of this bug. ***

Comment 11 ge liu 2022-04-21 02:45:19 UTC
Verified with 4.11.0-0.nightly-2022-04-20-045714,
quorum guard controller have been updated, I suppose it should resolve this problem, 
sh-4.4# crictl ps|grep etcd
cf3e865a0bcb2       d6eace900ed8aa9f2bb76d7f34981a34bf0cad1ee69ff3b05fd9b408d4645349                                                         13 minutes ago      Running             etcd-readyz                                   0                   6bead9a291bc8

Comment 13 Thomas Jungblut 2022-04-27 10:02:34 UTC
*** Bug 2027744 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2022-08-10 10:54:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 16 Red Hat Bugzilla 2023-09-15 01:52:43 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.