Bug 1883386
Summary: | remaining etcd pods enter stop/start cycle when master is shut down | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mitchell Rollinson <mirollin> |
Component: | Etcd | Assignee: | Suresh Kolichala <skolicha> |
Status: | CLOSED WONTFIX | QA Contact: | ge liu <geliu> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.11.0 | CC: | astedefo, dahernan, openshift-bugs-escalate, rcarrier, rhowe, sbatsche, skolicha, vlaad, wlewis |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 3.11.z | Flags: | mirollin:
needinfo-
mirollin: needinfo- |
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-03 23:52:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mitchell Rollinson
2020-09-29 02:05:00 UTC
This is basically expected unfortunately and the exact reason why we didnt use this liveness probe in OCP4. The liveness probe uses cluster-health to validate health of the etcd member. Problem is cluster-health actually does a check of the clusters health not the single members. In the case where one member goes down the other members will continously restart. I honeslty don't know how you fix this in 3.11 without causing unexpected problems as the usage is so engrained into the product for so long expectations have been build around it. More accurate would simply be a curl against /health. I will leave this open for a sprint or so to think about possible better solutions but I am not sure we can change this honestly. I wanted to followup up to my reply in https://bugzilla.redhat.com/show_bug.cgi?id=1883386#c6 as it has caused some confusion. cluster-health in 3.2 works much differently than in does in 3.3+. Here is an example of cluster-health in 3.2.28. To test this I am using docker-compose to setup a basic 3 node etcd cluster. In the first test I am going to inject disk latency to etcd-1. The net result will be an unhealthy member. # fsync_stress.sh ``` CONTAINER_IDS=($(docker-compose ps -q)) PID=$(docker inspect --format '{{ .State.Pid }}' ${CONTAINER_IDS[0]}) echo -e "injecting latency into container id ${CONTAINER_IDS[0]}" sudo strace -Tfe inject=fdatasync:delay_enter=1200000 -e trace=fdatasync -p $PID ``` # stress test 1 docker-compose exec etcd-1 etcdctl cluster-health; echo "$?" member 23ea4ab19f7e0a41 is unhealthy: got unhealthy result from http://172.21.84.41:2379 member 3424c2711072f3bf is unhealthy: got unhealthy result from http://172.21.84.42:2379 member 8238fb8747162a10 is healthy: got healthy result from http://172.21.84.43:2379 cluster is healthy 0 # stress test 2 $ docker-compose exec etcd-1 etcdctl cluster-health; echo "$?" member 23ea4ab19f7e0a41 is unhealthy: got unhealthy result from http://172.21.84.41:2379 member 3424c2711072f3bf is healthy: got healthy result from http://172.21.84.42:2379 member 8238fb8747162a10 is healthy: got healthy result from http://172.21.84.43:2379 cluster is healthy 0 ### member down etcd-1 stopped # test 1 etcd-2 docker-compose exec etcd-2 etcdctl cluster-health; echo "$?" member 23ea4ab19f7e0a41 is unreachable: [http://172.21.84.41:2379] are all unreachable member 3424c2711072f3bf is healthy: got healthy result from http://172.21.84.42:2379 member 8238fb8747162a10 is healthy: got healthy result from http://172.21.84.43:2379 cluster is healthy 0 # test 2 query etcd directly docker-compose exec etcd-2 etcdctl --endpoints http://172.21.84.41:2379 cluster-health; echo "$?" cluster may be unhealthy: failed to list members Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 172.21.84.41:2379: getsockopt: no route to host error #0: dial tcp 172.21.84.41:2379: getsockopt: no route to host 4 So this technically is correct net result and returns non zero because command fails. # conclusion tests cluster-health etcd 3.2.28 The command actually serves its purpose but because of a bug[1] which was fixed here [2]. Since etcd 3.3 cluster-health reports false if one member is down which was the assumption I used in https://bugzilla.redhat.com/show_bug.cgi?id=1883386#c6 # test 1 etcd-1 stopped etcd 3.3.23 docker-compose exec etcd-2 etcdctl cluster-health; echo "$?" failed to check the health of member 23ea4ab19f7e0a41 on http://172.21.84.41:2379: Get http://172.21.84.41:2379/health: dial tcp 172.21.84.41:2379: i/o timeout member 23ea4ab19f7e0a41 is unreachable: [http://172.21.84.41:2379] are all unreachable member 3424c2711072f3bf is healthy: got healthy result from http://172.21.84.42:2379 member 8238fb8747162a10 is healthy: got healthy result from http://172.21.84.43:2379 cluster is degraded 5 # test 2 etcd-1 unhealthy etcd 3.3.23 dc exec etcd-2 etcdctl cluster-health; echo "$?" member 23ea4ab19f7e0a41 is unhealthy: got unhealthy result from http://172.21.84.41:2379 member 3424c2711072f3bf is healthy: got healthy result from http://172.21.84.42:2379 member 8238fb8747162a10 is healthy: got healthy result from http://172.21.84.43:2379 cluster is degraded 5 # conclusion tests cluster-health etcd 3.3.23 As show by tests cluster-health would result in liveness failing in the case of slow member or single member down. But as OCP 3.11 uses etcd 3.2 NOT 3.3 the failure I described in https://bugzilla.redhat.com/show_bug.cgi?id=1883386#c6 would not be true. Excuse my over quick reply I will review logs in detail now. [1] https://github.com/etcd-io/etcd/issues/8061 [2] https://github.com/etcd-io/etcd/pull/8070 Hi Sam, Some additional detail for you .. ~~~~ **When do you see the status change for your kube-system pods (please provide outputs) - How long after downing the master ? ~~~ > shutdown master91 with init 0 -- etcd containers on master92 and master93 are continuously restarting > check pod status after 5min (run check command multiple times): -- api offline: No resources found. The connection to the server masters.xxx:443 was refused - did you specify the right host or port? > check status after 7min: -- api is up again. output shows the following: master-api master91 unknown master-controllers master91 unknown master-etcd master91 unknown > check status after 8min: -- api offline again. > check status after 17min: -- api still offline. > check status after 20min: -- api still offline. check etcd containers on masters master92: CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT f66ea058d9805 ab67bc94ee69845100213c2befe7f931b2d2588544de90bc5de7ed26c3bb8391 2 minutes ago Running controllers 34 0ad19d8c882e3 ab67bc94ee69845100213c2befe7f931b2d2588544de90bc5de7ed26c3bb8391 4 minutes ago Exited api 129 212410aacde7c 45b99cdb08f5ab1e86941340f48098eb17a1a31c209ff83eb30e529aa19a7ed1 4 minutes ago Exited etcd 8 master92: CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT f12b44bf555c1 ab67bc94ee69845100213c2befe7f931b2d2588544de90bc5de7ed26c3bb8391 About a minute ago Exited api 127 8b545bb7dd34f 45b99cdb08f5ab1e86941340f48098eb17a1a31c209ff83eb30e529aa19a7ed1 6 minutes ago Exited etcd 18 211a8f6e04115 ab67bc94ee69845100213c2befe7f931b2d2588544de90bc5de7ed26c3bb8391 11 minutes ago Running controllers 26 > test canceled after 30min. ~~~~~ Regards Hi Sam, Just a small correction, RE comment #13 test results ~~ Just realized, that I made a typo in my test.. **check etcd containers on masters** > I mentioned the master92 twice but actually one of the two snippets shows the output for master93. ~~ Regards Created attachment 1739864 [details]
master191 logs
Hi again Are you aware of an inventory variable that can be set to make etcd customization's permanent through an upgrade. EG timeoutSeconds: 6 Editing /etc/origin/node/pod/etcd.yaml is not persistent through an upgrade. It is possible to edit the change in /usr/share/ansible/openshift-ansible/roles/etcd/files/etcd.yaml as supplied by openshift-ansible-roles-3.11.380-1.git.0.983c5d1.el7.noarch RPM. but this is not ideal. Regards Mitch @mirollin Instead of the yaml file, you can consider adding it to /etc/etcd/etcd.conf Thanks Suresh. I was not aware it could be specified therein. Can you advise what the parameter name & format would look like in etcd.conf ? thanks Thanks Ryan. Based on Ryan's recommendation, it is suggested that we remove the liveness probe in 3.x installations. Can we close this bug in deference to this recommendation? In the 3.11 the liveness probe is just wrong. It checks the health of the cluster not the individual container. Closing the bug, with the recommendation to turn off the liveness probe in 3.x installations. |