Bug 1945572

Summary: [arbiter] OCP Console fail on authentication problems during network separation of a zone
Product: OpenShift Container Platform Reporter: Martin Bukatovic <mbukatov>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DEFERRED QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.7CC: aos-bugs, jokerman, spadgett
Target Milestone: ---Flags: mfojtik: needinfo?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: LifecycleReset
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-14 16:59:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1984103    
Attachments:
Description Flags
screenshot #1: auth server error example none

Description Martin Bukatovic 2021-04-01 10:03:37 UTC
Created attachment 1768230 [details]
screenshot #1: auth server error example

Description of problem
======================

When I classify nodes of OCP cluster into 3 zones via
"topology.kubernetes.io/zone" label, so that there is one master node in each
zone and only 2 zones have the same number of worker nodes (so the 3rd zone has
just master node without workers), and then cut all network traffic among one
zone with workers and the other zones, while keeping everything else intact,
I see that OCP Console fails on authentication problems.

I would expect OCP Console to survive this disruption, as only one master node
out of 3 and half of worker nodes are in the isolated zone, so the remaining
cluster majority should be able to figure out how to react to overcome the
disruption.

When the disruption ends, OCP Console recovers, which is good.

The use case described here is based on network split ab-bc failure for OCS
arbiter stretch cluster.

Version-Release number of selected component
============================================

OCP 4.7.0-0.nightly-2021-03-30-235343

How reproducible
================

100%

Steps to Reproduce
==================

1. Install OCP on vSphere, with 3 master and 6 worker nodes.

2. Pick one master node and label it as an arbiter, eg.:

   ```
   $ oc label node $node topology.kubernetes.io/zone=foo-arbiter
   ```

3. Label one remaining master node as "data-a" zone, the other as zone
   "data-b".

4. Half of the worker nodes label as zone "data-a", and the other half as
   zone "data-b".

Note: So at this point, you have nodes labeled like this:

```
$ oc get nodes -L topology.kubernetes.io/zone
NAME              STATUS   ROLES    AGE   VERSION           ZONE
compute-0         Ready    worker   14h   v1.20.0+bafe72f   data-a
compute-1         Ready    worker   14h   v1.20.0+bafe72f   data-a
compute-2         Ready    worker   14h   v1.20.0+bafe72f   data-a
compute-3         Ready    worker   14h   v1.20.0+bafe72f   data-b
compute-4         Ready    worker   14h   v1.20.0+bafe72f   data-b
compute-5         Ready    worker   14h   v1.20.0+bafe72f   data-b
control-plane-0   Ready    master   14h   v1.20.0+bafe72f   data-a
control-plane-1   Ready    master   14h   v1.20.0+bafe72f   data-b
control-plane-2   Ready    master   14h   v1.20.0+bafe72f   foo-arbiter
```

5. Open OCP Console and login as kubeadmin.

6. Isolate machines in zone data-a from other zones (foo-arbiter and data-b)
   for 15 minutes.

7. Check OCP Console during the network split and after that.

Actual results
==============

OCP Console stops responding, and after a while fails on authentication problem
or asks me to login presenting the loging screen without issues, but when I
enter kubeadmin credentials and try to login, it fails on authentication issue.

See screenshot #1: auth server error example

> The authorization server encountered an unexpected condition that prevented
> it from fulfilling the request.

Command line tool oc seems to be mostly unaffected during the network
disruption.

Expected results
================

It's ok if it takes some time for OCP Console to react on the problem, but I
would expect it to continue operation during the disruption.

Additional info
===============

To inflict a network split on the cluster, one can tweak settings of underlying
network infrastructure (eg. shutdown router of the affected zone). Since ocs qe
infrastructure doesn't allow this, I use firewall script which inserts
firewall rules on appropriate nodes of the cluster, deployed via MCO. For
details see:

https://github.com/red-hat-storage/ocs-ci/blob/0a9abea2a79d685bb92b2f26a2d2aed424dbfae1/ocs_ci/utility/networksplit/README.rst

This code is going to be placed into separate project, and when this happens I
will link it here.

Comment 2 Martin Bukatovic 2021-04-09 16:27:04 UTC
(In reply to Martin Bukatovic from comment #0)
> This code is going to be placed into separate project, and when this happens
> I will link it here.

See:

- https://gitlab.com/mbukatov/ocp-network-split
- https://mbukatov.gitlab.io/ocp-network-split/

Comment 3 Jakub Hadvig 2021-04-12 16:04:52 UTC
Despite the fact the console is down for a fair amount of time, this doesn't looks like a console.
Check oauth-apiserver logs and they are full of error:
```
...
2021-03-31T21:46:47.031807836Z W0331 21:46:47.031792       1 reflector.go:436] storage/cacher.go:/useridentities: watch of *user.Identity ended with: Internal error occurred: etcdserver: no leader
2021-03-31T21:46:47.103709983Z I0331 21:46:47.103658       1 cacher.go:405] cacher (*user.User): initialized
2021-03-31T21:46:47.104359753Z E0331 21:46:47.104316       1 watcher.go:218] watch chan error: etcdserver: no leader
2021-03-31T21:46:47.104376484Z W0331 21:46:47.104369       1 reflector.go:436] storage/cacher.go:/users: watch of *user.User ended with: Internal error occurred: etcdserver: no leader
2021-03-31T21:46:47.104431475Z I0331 21:46:47.104392       1 cacher.go:405] cacher (*oauth.OAuthClientAuthorization): initialized
2021-03-31T21:46:47.104930289Z E0331 21:46:47.104909       1 watcher.go:218] watch chan error: etcdserver: no leader
2021-03-31T21:46:47.104951309Z W0331 21:46:47.104940       1 reflector.go:436] storage/cacher.go:/oauth/clientauthorizations: watch of *oauth.OAuthClientAuthorization ended with: Internal error occurred: etcdserver: no leader
2021-03-31T21:46:48.036547242Z I0331 21:46:48.036041       1 cacher.go:405] cacher (*user.Identity): initialized
2021-03-31T21:46:48.037007785Z E0331 21:46:48.036976       1 watcher.go:218] watch chan error: etcdserver: no leader
2021-03-31T21:46:48.037029338Z W0331 21:46:48.037023       1 reflector.go:436] storage/cacher.go:/useridentities: watch of *user.Identity ended with: Internal error occurred: etcdserver: no leader
2021-03-31T21:46:48.106170485Z I0331 21:46:48.105962       1 cacher.go:405] cacher (*user.User): initialized
...
```

But all this looked like an issue in routing to me since I found quite a lot of errors and timeouts in the SDN, SDN controller and OVS pods. But after some investigation I found that the etcdserver is not responding:
```
2021-03-31T19:43:28.852583740Z I0331 19:43:28.851612       1 leaderelection.go:243] attempting to acquire leader lease openshift-sdn/openshift-network-controller...
2021-03-31T20:59:18.984137244Z I0331 20:59:18.983316       1 leaderelection.go:243] attempting to acquire leader lease openshift-sdn/openshift-network-controller...
2021-03-31T21:45:14.102123998Z E0331 21:45:14.098591       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out
2021-03-31T21:45:42.116263298Z E0331 21:45:42.116219       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out
2021-03-31T21:46:10.119594560Z E0331 21:46:10.119546       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out
2021-03-31T21:46:38.100311952Z E0331 21:46:38.100270       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out
2021-03-31T21:47:06.109401795Z E0331 21:47:06.109353       1 leaderelection.go:325] error retrieving resource lock openshift-sdn/openshift-network-controller: etcdserver: request timed out
```

After checking the etcd pods I see a lot of errors so to me it looks like a good candidate for assigning the BZ

Comment 4 Michal Fojtik 2021-05-12 16:16:43 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 5 Martin Bukatovic 2021-05-12 16:35:18 UTC
This has been reported during test of a failure scenario for OCS KNIP-1540.

I consider this to be a valid problem which should be fixed.

Comment 6 Michal Fojtik 2021-05-12 17:16:50 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 7 Sam Batschelet 2021-05-12 17:30:27 UTC
>  I consider this to be a valid problem which should be fixed.

This is a hole in the way the client balancer works as it checks that the peer is available not if the peer has quorum. For this to change would be a large structural change to the client which I don't think is going to happen short term. The other solution would be to remove the peer from the endpoint list provided to the apiserver and have this be more dynamic/reactive based on health checks. But the problem with the current implementation is that change in endpoints would require a new revision of KAS which in itself could be disruptive. It is a real problem.

Comment 8 Sam Batschelet 2021-05-14 16:59:39 UTC
Closing this bug and tracking as an RFE https://issues.redhat.com/browse/ETCD-191

Comment 9 Martin Bukatovic 2021-12-02 16:12:18 UTC
This is still reproducible, retried on vSphere LSO cluster with:

OCP 4.9.0-0.nightly-2021-12-01-080120
LSO 4.9.0-202111151318
ODF 4.9.0-249.ci

Comment 10 Martin Bukatovic 2022-03-11 18:46:12 UTC
Still reproducible on vSphere LSO cluster with:

OCP 4.10.0-0.nightly-2022-03-10-155847
LSO 4.10.0-202202241648
ODF 4.10.0-187