Created attachment 1768230 [details] screenshot #1: auth server error example Description of problem ====================== When I classify nodes of OCP cluster into 3 zones via "topology.kubernetes.io/zone" label, so that there is one master node in each zone and only 2 zones have the same number of worker nodes (so the 3rd zone has just master node without workers), and then cut all network traffic among one zone with workers and the other zones, while keeping everything else intact, I see that OCP Console fails on authentication problems. I would expect OCP Console to survive this disruption, as only one master node out of 3 and half of worker nodes are in the isolated zone, so the remaining cluster majority should be able to figure out how to react to overcome the disruption. When the disruption ends, OCP Console recovers, which is good. The use case described here is based on network split ab-bc failure for OCS arbiter stretch cluster. Version-Release number of selected component ============================================ OCP 4.7.0-0.nightly-2021-03-30-235343 How reproducible ================ 100% Steps to Reproduce ================== 1. Install OCP on vSphere, with 3 master and 6 worker nodes. 2. Pick one master node and label it as an arbiter, eg.: ``` $ oc label node $node topology.kubernetes.io/zone=foo-arbiter ``` 3. Label one remaining master node as "data-a" zone, the other as zone "data-b". 4. Half of the worker nodes label as zone "data-a", and the other half as zone "data-b". Note: So at this point, you have nodes labeled like this: ``` $ oc get nodes -L topology.kubernetes.io/zone NAME STATUS ROLES AGE VERSION ZONE compute-0 Ready worker 14h v1.20.0+bafe72f data-a compute-1 Ready worker 14h v1.20.0+bafe72f data-a compute-2 Ready worker 14h v1.20.0+bafe72f data-a compute-3 Ready worker 14h v1.20.0+bafe72f data-b compute-4 Ready worker 14h v1.20.0+bafe72f data-b compute-5 Ready worker 14h v1.20.0+bafe72f data-b control-plane-0 Ready master 14h v1.20.0+bafe72f data-a control-plane-1 Ready master 14h v1.20.0+bafe72f data-b control-plane-2 Ready master 14h v1.20.0+bafe72f foo-arbiter ``` 5. Open OCP Console and login as kubeadmin. 6. Isolate machines in zone data-a from other zones (foo-arbiter and data-b) for 15 minutes. 7. Check OCP Console during the network split and after that. Actual results ============== OCP Console stops responding, and after a while fails on authentication problem or asks me to login presenting the loging screen without issues, but when I enter kubeadmin credentials and try to login, it fails on authentication issue. See screenshot #1: auth server error example > The authorization server encountered an unexpected condition that prevented > it from fulfilling the request. Command line tool oc seems to be mostly unaffected during the network disruption. Expected results ================ It's ok if it takes some time for OCP Console to react on the problem, but I would expect it to continue operation during the disruption. Additional info =============== To inflict a network split on the cluster, one can tweak settings of underlying network infrastructure (eg. shutdown router of the affected zone). Since ocs qe infrastructure doesn't allow this, I use firewall script which inserts firewall rules on appropriate nodes of the cluster, deployed via MCO. For details see: https://github.com/red-hat-storage/ocs-ci/blob/0a9abea2a79d685bb92b2f26a2d2aed424dbfae1/ocs_ci/utility/networksplit/README.rst This code is going to be placed into separate project, and when this happens I will link it here.
(In reply to Martin Bukatovic from comment #0) > This code is going to be placed into separate project, and when this happens > I will link it here. See: - https://gitlab.com/mbukatov/ocp-network-split - https://mbukatov.gitlab.io/ocp-network-split/
Despite the fact the console is down for a fair amount of time, this doesn't looks like a console. Check oauth-apiserver logs and they are full of error: ``` ... 2021-03-31T21:46:47.031807836Z W0331 21:46:47.031792 1 reflector.go:436] storage/cacher.go:/useridentities: watch of *user.Identity ended with: Internal error occurred: etcdserver: no leader 2021-03-31T21:46:47.103709983Z I0331 21:46:47.103658 1 cacher.go:405] cacher (*user.User): initialized 2021-03-31T21:46:47.104359753Z E0331 21:46:47.104316 1 watcher.go:218] watch chan error: etcdserver: no leader 2021-03-31T21:46:47.104376484Z W0331 21:46:47.104369 1 reflector.go:436] storage/cacher.go:/users: watch of *user.User ended with: Internal error occurred: etcdserver: no leader 2021-03-31T21:46:47.104431475Z I0331 21:46:47.104392 1 cacher.go:405] cacher (*oauth.OAuthClientAuthorization): initialized 2021-03-31T21:46:47.104930289Z E0331 21:46:47.104909 1 watcher.go:218] watch chan error: etcdserver: no leader 2021-03-31T21:46:47.104951309Z W0331 21:46:47.104940 1 reflector.go:436] storage/cacher.go:/oauth/clientauthorizations: watch of *oauth.OAuthClientAuthorization ended with: Internal error occurred: etcdserver: no leader 2021-03-31T21:46:48.036547242Z I0331 21:46:48.036041 1 cacher.go:405] cacher (*user.Identity): initialized 2021-03-31T21:46:48.037007785Z E0331 21:46:48.036976 1 watcher.go:218] watch chan error: etcdserver: no leader 2021-03-31T21:46:48.037029338Z W0331 21:46:48.037023 1 reflector.go:436] storage/cacher.go:/useridentities: watch of *user.Identity ended with: Internal error occurred: etcdserver: no leader 2021-03-31T21:46:48.106170485Z I0331 21:46:48.105962 1 cacher.go:405] cacher (*user.User): initialized ... ``` But all this looked like an issue in routing to me since I found quite a lot of errors and timeouts in the SDN, SDN controller and OVS pods. But after some investigation I found that the etcdserver is not responding: ``` 2021-03-31T19:43:28.852583740Z I0331 19:43:28.851612 1 leaderelection.go:243] attempting to acquire leader lease openshift-sdn/openshift-network-controller... 2021-03-31T20:59:18.984137244Z I0331 20:59:18.983316 1 leaderelection.go:243] attempting to acquire leader lease openshift-sdn/openshift-network-controller... 2021-03-31T21:45:14.102123998Z E0331 21:45:14.098591 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out 2021-03-31T21:45:42.116263298Z E0331 21:45:42.116219 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out 2021-03-31T21:46:10.119594560Z E0331 21:46:10.119546 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out 2021-03-31T21:46:38.100311952Z E0331 21:46:38.100270 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out 2021-03-31T21:47:06.109401795Z E0331 21:47:06.109353 1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out ``` After checking the etcd pods I see a lot of errors so to me it looks like a good candidate for assigning the BZ
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
This has been reported during test of a failure scenario for OCS KNIP-1540. I consider this to be a valid problem which should be fixed.
The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified.
> I consider this to be a valid problem which should be fixed. This is a hole in the way the client balancer works as it checks that the peer is available not if the peer has quorum. For this to change would be a large structural change to the client which I don't think is going to happen short term. The other solution would be to remove the peer from the endpoint list provided to the apiserver and have this be more dynamic/reactive based on health checks. But the problem with the current implementation is that change in endpoints would require a new revision of KAS which in itself could be disruptive. It is a real problem.
Closing this bug and tracking as an RFE https://issues.redhat.com/browse/ETCD-191
This is still reproducible, retried on vSphere LSO cluster with: OCP 4.9.0-0.nightly-2021-12-01-080120 LSO 4.9.0-202111151318 ODF 4.9.0-249.ci
Still reproducible on vSphere LSO cluster with: OCP 4.10.0-0.nightly-2022-03-10-155847 LSO 4.10.0-202202241648 ODF 4.10.0-187