Bug 1883386

Summary: remaining etcd pods enter stop/start cycle when master is shut down
Product: OpenShift Container Platform Reporter: Mitchell Rollinson <mirollin>
Component: EtcdAssignee: Suresh Kolichala <skolicha>
Status: CLOSED WONTFIX QA Contact: ge liu <geliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.11.0CC: astedefo, dahernan, openshift-bugs-escalate, rcarrier, rhowe, sbatsche, skolicha, vlaad, wlewis
Target Milestone: ---Keywords: Reopened
Target Release: 3.11.zFlags: mirollin: needinfo-
mirollin: needinfo-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-03 23:52:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mitchell Rollinson 2020-09-29 02:05:00 UTC
Description of problem:
Shutting down 1 master in a 3 master cluster, results in the remaining etcd pods entering a stop/start cycle until the master that is shutdown, rejoins the cluster.

*It doesn't matter which master - all of them are affected:

- shutdown master1, etcd on master2 and 3 are not working anymore
- shutdown master2, etcd on master1 and 3 are not working anymore
- shutdown master3, etcd on master1 and 2 are not working anymore

*No CPU constraints are reported
*No memory constraints are reported

Stopping an etcd pod (mv /etc/origin/node/pod/etcd.yaml ./) on a Master, does NOT result in any error, and a new etcd election takes place successfully.

All members appear healthy 

~~~
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://x.x.x.171:2379 |   73f2759599c278 |  3.2.28 |  915 MB |     false |      1996 |  317678448 |
| https://x.x.x.100:2379 | 2c625e979f874649 |  3.2.28 |  915 MB |      true |      1996 |  317678449 |
| https://x.x.x.172:2379 | f1f7ad413df68fd1 |  3.2.28 |  915 MB |     false |      1996 |  317678450 |
+----------------------------+------------------+---------+---------+-----------+-----------+------------+
~~~

'fio' benchmarks (99th percentile) appear OK
'curl' tests to etcd (node-pod) appear fine

Version-Release number of selected component (if applicable):

etcd 3.2.28
ocp 3.11.248

How reproducible:

Shutdown one Master.

Steps to Reproduce:
1.
2.
3.

Actual results:
Cluster becomes inaccessible (etcd unstable)

Expected results:
etcd elects a new leader, and cluster is accessible

Additional info:

Comment 6 Sam Batschelet 2020-10-02 18:08:31 UTC
This is basically expected unfortunately and the exact reason why we didnt use this liveness probe in OCP4. The liveness probe uses cluster-health to validate health of the etcd member. Problem is cluster-health actually does a check of the clusters health not the single members. In the case where one member goes down the other members will continously restart. I honeslty don't know how you fix this in 3.11 without causing unexpected problems as the usage is so engrained into the product for so long expectations have been build around it. More accurate would simply be a curl against /health. 

I will leave this open for a sprint or so to think about possible better solutions but I am not sure we can change this honestly.

Comment 10 Sam Batschelet 2020-10-08 13:37:55 UTC
I wanted to followup up to my reply in https://bugzilla.redhat.com/show_bug.cgi?id=1883386#c6 as it has caused some confusion. cluster-health in 3.2 works much differently than in does in 3.3+.

Here is an example of cluster-health in 3.2.28.

To test this I am using docker-compose to setup a basic 3 node etcd cluster. In the first test I am going to inject disk latency to etcd-1. The net result will be an unhealthy member.

# fsync_stress.sh
```
CONTAINER_IDS=($(docker-compose ps -q))
PID=$(docker inspect --format '{{ .State.Pid }}' ${CONTAINER_IDS[0]})
echo -e "injecting latency into container id ${CONTAINER_IDS[0]}"
  
sudo strace -Tfe inject=fdatasync:delay_enter=1200000 -e trace=fdatasync -p $PID
```

# stress test 1
docker-compose exec etcd-1 etcdctl cluster-health; echo "$?"
member 23ea4ab19f7e0a41 is unhealthy: got unhealthy result from http://172.21.84.41:2379
member 3424c2711072f3bf is unhealthy: got unhealthy result from http://172.21.84.42:2379
member 8238fb8747162a10 is healthy: got healthy result from http://172.21.84.43:2379
cluster is healthy
0

# stress test 2
$ docker-compose exec etcd-1 etcdctl cluster-health; echo "$?"
member 23ea4ab19f7e0a41 is unhealthy: got unhealthy result from http://172.21.84.41:2379
member 3424c2711072f3bf is healthy: got healthy result from http://172.21.84.42:2379
member 8238fb8747162a10 is healthy: got healthy result from http://172.21.84.43:2379
cluster is healthy
0

### member down etcd-1 stopped

# test 1 etcd-2
docker-compose exec etcd-2 etcdctl cluster-health; echo "$?"
member 23ea4ab19f7e0a41 is unreachable: [http://172.21.84.41:2379] are all unreachable
member 3424c2711072f3bf is healthy: got healthy result from http://172.21.84.42:2379
member 8238fb8747162a10 is healthy: got healthy result from http://172.21.84.43:2379
cluster is healthy
0

# test 2 query etcd directly
docker-compose exec etcd-2 etcdctl --endpoints http://172.21.84.41:2379 cluster-health; echo "$?"
cluster may be unhealthy: failed to list members
Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.21.84.41:2379: getsockopt: no route to host

error #0: dial tcp 172.21.84.41:2379: getsockopt: no route to host

4

So this technically is correct net result and returns non zero because command fails.

# conclusion tests cluster-health etcd 3.2.28

The command actually serves its purpose but because of a bug[1] which was fixed here [2].

Since etcd 3.3 cluster-health reports false if one member is down which was the assumption I used in https://bugzilla.redhat.com/show_bug.cgi?id=1883386#c6

# test 1 etcd-1 stopped etcd 3.3.23
docker-compose exec etcd-2 etcdctl cluster-health; echo "$?"
failed to check the health of member 23ea4ab19f7e0a41 on http://172.21.84.41:2379: Get http://172.21.84.41:2379/health: dial tcp 172.21.84.41:2379: i/o timeout
member 23ea4ab19f7e0a41 is unreachable: [http://172.21.84.41:2379] are all unreachable
member 3424c2711072f3bf is healthy: got healthy result from http://172.21.84.42:2379
member 8238fb8747162a10 is healthy: got healthy result from http://172.21.84.43:2379
cluster is degraded
5

# test 2 etcd-1 unhealthy etcd 3.3.23
dc exec etcd-2 etcdctl cluster-health; echo "$?"
member 23ea4ab19f7e0a41 is unhealthy: got unhealthy result from http://172.21.84.41:2379
member 3424c2711072f3bf is healthy: got healthy result from http://172.21.84.42:2379
member 8238fb8747162a10 is healthy: got healthy result from http://172.21.84.43:2379
cluster is degraded
5

# conclusion tests cluster-health etcd 3.3.23

As show by tests cluster-health would result in liveness failing in the case of slow member or single member down. But as OCP 3.11 uses etcd 3.2 NOT 3.3 the failure I described in https://bugzilla.redhat.com/show_bug.cgi?id=1883386#c6
 would not be true. Excuse my over quick reply I will review logs in detail now. 


[1] https://github.com/etcd-io/etcd/issues/8061
[2] https://github.com/etcd-io/etcd/pull/8070

Comment 13 Mitchell Rollinson 2020-10-08 22:35:40 UTC
Hi Sam,

Some additional detail for you ..

~~~~

**When do you see the status change for your kube-system pods (please provide outputs) - How long after downing the master ?
~~~
> shutdown master91 with init 0
-- etcd containers on master92 and master93 are continuously restarting

> check pod status after 5min (run check command multiple times):
-- api offline:
No resources found.
The connection to the server masters.xxx:443 was refused - did you specify the right host or port?

> check status after 7min:
-- api is up again. output shows the following:
master-api master91 unknown
master-controllers master91 unknown
master-etcd master91 unknown

> check status after 8min:
-- api offline again.

> check status after 17min:
-- api still offline.

> check status after 20min:
-- api still offline.

check etcd containers on masters
master92:
CONTAINER ID        IMAGE                                                                                                                         CREATED             STATE               NAME                ATTEMPT
f66ea058d9805       ab67bc94ee69845100213c2befe7f931b2d2588544de90bc5de7ed26c3bb8391                                                              2 minutes ago       Running             controllers         34
0ad19d8c882e3       ab67bc94ee69845100213c2befe7f931b2d2588544de90bc5de7ed26c3bb8391                                                              4 minutes ago       Exited              api                 129
212410aacde7c       45b99cdb08f5ab1e86941340f48098eb17a1a31c209ff83eb30e529aa19a7ed1                                                              4 minutes ago       Exited              etcd                8

master92:
CONTAINER ID        IMAGE                                                                                                                         CREATED              STATE               NAME                ATTEMPT
f12b44bf555c1       ab67bc94ee69845100213c2befe7f931b2d2588544de90bc5de7ed26c3bb8391                                                              About a minute ago   Exited              api                 127
8b545bb7dd34f       45b99cdb08f5ab1e86941340f48098eb17a1a31c209ff83eb30e529aa19a7ed1                                                              6 minutes ago        Exited              etcd                18
211a8f6e04115       ab67bc94ee69845100213c2befe7f931b2d2588544de90bc5de7ed26c3bb8391                                                              11 minutes ago       Running             controllers         26

> test canceled after 30min.
~~~~~

Regards

Comment 15 Mitchell Rollinson 2020-10-12 00:56:55 UTC
Hi Sam,

Just a small correction, RE comment #13 test results

~~
Just realized, that I made a typo in my test..
**check etcd containers on masters**
> I mentioned the master92 twice but actually one of the two snippets shows the output for master93.
~~

Regards

Comment 36 Mitchell Rollinson 2020-12-17 03:59:50 UTC
Created attachment 1739864 [details]
master191 logs

Comment 47 Mitchell Rollinson 2021-03-09 03:28:30 UTC
Hi again

Are you aware of an inventory variable that can be set to make etcd customization's permanent through an upgrade. 

EG  timeoutSeconds: 6

Editing /etc/origin/node/pod/etcd.yaml is not persistent through an upgrade.

It is possible to edit the change in /usr/share/ansible/openshift-ansible/roles/etcd/files/etcd.yaml as supplied by openshift-ansible-roles-3.11.380-1.git.0.983c5d1.el7.noarch RPM. but this is not ideal.

Regards

Mitch

Comment 48 Suresh Kolichala 2021-03-10 17:13:42 UTC
@mirollin Instead of the yaml file, you can consider adding it to /etc/etcd/etcd.conf

Comment 49 Mitchell Rollinson 2021-03-12 04:09:40 UTC
Thanks Suresh.

I was not aware it could be specified therein.

Can you advise what the parameter name & format would look like in etcd.conf ?

thanks

Comment 55 Suresh Kolichala 2021-04-27 15:59:17 UTC
Thanks Ryan.

Based on Ryan's recommendation, it is suggested that we remove the liveness probe in 3.x installations.

Can we close this bug in deference to this recommendation?

Comment 56 Suresh Kolichala 2021-05-03 23:52:55 UTC
In the 3.11 the liveness probe is just wrong. It checks the health of the cluster not the individual container.

Closing the bug, with the recommendation to turn off the liveness probe in 3.x installations.