Bug 1945572
Summary: | [arbiter] OCP Console fail on authentication problems during network separation of a zone | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Martin Bukatovic <mbukatov> | ||||
Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | ||||
Status: | CLOSED DEFERRED | QA Contact: | ge liu <geliu> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.7 | CC: | aos-bugs, jokerman, spadgett | ||||
Target Milestone: | --- | Flags: | mfojtik:
needinfo?
|
||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | LifecycleReset | ||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-05-14 16:59:39 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1984103 | ||||||
Attachments: |
|
Description
Martin Bukatovic
2021-04-01 10:03:37 UTC
(In reply to Martin Bukatovic from comment #0) > This code is going to be placed into separate project, and when this happens > I will link it here. See: - https://gitlab.com/mbukatov/ocp-network-split - https://mbukatov.gitlab.io/ocp-network-split/ Despite the fact the console is down for a fair amount of time, this doesn't looks like a console. Check oauth-apiserver logs and they are full of error: ``` ... 2021-03-31T21:46:47.031807836Z W0331 21:46:47.031792 1 reflector.go:436] storage/cacher.go:/useridentities: watch of *user.Identity ended with: Internal error occurred: etcdserver: no leader 2021-03-31T21:46:47.103709983Z I0331 21:46:47.103658 1 cacher.go:405] cacher (*user.User): initialized 2021-03-31T21:46:47.104359753Z E0331 21:46:47.104316 1 watcher.go:218] watch chan error: etcdserver: no leader 2021-03-31T21:46:47.104376484Z W0331 21:46:47.104369 1 reflector.go:436] storage/cacher.go:/users: watch of *user.User ended with: Internal error occurred: etcdserver: no leader 2021-03-31T21:46:47.104431475Z I0331 21:46:47.104392 1 cacher.go:405] cacher (*oauth.OAuthClientAuthorization): initialized 2021-03-31T21:46:47.104930289Z E0331 21:46:47.104909 1 watcher.go:218] watch chan error: etcdserver: no leader 2021-03-31T21:46:47.104951309Z W0331 21:46:47.104940 1 reflector.go:436] storage/cacher.go:/oauth/clientauthorizations: watch of *oauth.OAuthClientAuthorization ended with: Internal error occurred: etcdserver: no leader 2021-03-31T21:46:48.036547242Z I0331 21:46:48.036041 1 cacher.go:405] cacher (*user.Identity): initialized 2021-03-31T21:46:48.037007785Z E0331 21:46:48.036976 1 watcher.go:218] watch chan error: etcdserver: no leader 2021-03-31T21:46:48.037029338Z W0331 21:46:48.037023 1 reflector.go:436] storage/cacher.go:/useridentities: watch of *user.Identity ended with: Internal error occurred: etcdserver: no leader 2021-03-31T21:46:48.106170485Z I0331 21:46:48.105962 1 cacher.go:405] cacher (*user.User): initialized ... ``` But all this looked like an issue in routing to me since I found quite a lot of errors and timeouts in the SDN, SDN controller and OVS pods. But after some investigation I found that the etcdserver is not responding: ``` 2021-03-31T19:43:28.852583740Z I0331 19:43:28.851612 1 leaderelection.go:243] attempting to acquire leader lease openshift-sdn/openshift-network-controller... 2021-03-31T20:59:18.984137244Z I0331 20:59:18.983316 1 leaderelection.go:243] attempting to acquire leader lease openshift-sdn/openshift-network-controller... 2021-03-31T21:45:14.102123998Z E0331 21:45:14.098591 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out 2021-03-31T21:45:42.116263298Z E0331 21:45:42.116219 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out 2021-03-31T21:46:10.119594560Z E0331 21:46:10.119546 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out 2021-03-31T21:46:38.100311952Z E0331 21:46:38.100270 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out 2021-03-31T21:47:06.109401795Z E0331 21:47:06.109353 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out ``` After checking the etcd pods I see a lot of errors so to me it looks like a good candidate for assigning the BZ This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. This has been reported during test of a failure scenario for OCS KNIP-1540. I consider this to be a valid problem which should be fixed. The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified. > I consider this to be a valid problem which should be fixed.
This is a hole in the way the client balancer works as it checks that the peer is available not if the peer has quorum. For this to change would be a large structural change to the client which I don't think is going to happen short term. The other solution would be to remove the peer from the endpoint list provided to the apiserver and have this be more dynamic/reactive based on health checks. But the problem with the current implementation is that change in endpoints would require a new revision of KAS which in itself could be disruptive. It is a real problem.
Closing this bug and tracking as an RFE https://issues.redhat.com/browse/ETCD-191 This is still reproducible, retried on vSphere LSO cluster with: OCP 4.9.0-0.nightly-2021-12-01-080120 LSO 4.9.0-202111151318 ODF 4.9.0-249.ci Still reproducible on vSphere LSO cluster with: OCP 4.10.0-0.nightly-2022-03-10-155847 LSO 4.10.0-202202241648 ODF 4.10.0-187 |