Bug 1886160
Summary: | [sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should restore itself after quorum loss [Serial] | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Eads <deads> |
Component: | Etcd | Assignee: | Maru Newby <mnewby> |
Status: | CLOSED NOTABUG | QA Contact: | ge liu <geliu> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.6 | CC: | aos-bugs, ccoleman, gmontero, mfojtik, mnewby, sbatsche, skolicha, wking, xxia |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | tag-ci | ||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
job=release-openshift-origin-installer-e2e-aws-disruptive-4.6=all
|
|
Last Closed: | 2021-06-02 16:15:35 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1947705, 1954364 |
Description
David Eads
2020-10-07 18:56:42 UTC
DR process did not change from 4.5 to 4.6. The failure which often times is the result of timeout waiting for machines to delete should be looked at more carefully. I began some triage in the following PR[1] to address general concerns about possible lag from load balancer health checks. Moving to cloud team to continue triage. ``` fail [github.com/openshift/origin/test/extended/dr/quorum_restore.go:195]: Unexpected error: <*errors.StatusError | 0xc001cb10e0>: { ErrStatus: { TypeMeta: {Kind: "Status", APIVersion: "v1"}, ListMeta: { SelfLink: "", ResourceVersion: "", Continue: "", RemainingItemCount: nil, }, Status: "Failure", Message: "Internal error occurred: resource quota evaluation timed out", Reason: "InternalError", Details: { Name: "", Group: "", Kind: "", UID: "", Causes: [ { Type: "", Message: "resource quota evaluation timed out", Field: "", }, ], RetryAfterSeconds: 0, }, Code: 500, }, } Internal error occurred: resource quota evaluation timed outoccurred ``` [1] https://github.com/openshift/origin/pull/25585 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/459/pull-ci-openshift-cluster-etcd-operator-master-e2e-disruptive/1311384501846282240 The test in question is deleting the master/control plane machines via the machine-api. Accordingly, we drain those machines prior to removing them from the cloud. It seems that k8s api server needs to account for this (AWS lb slow to realized failed members): https://bugzilla.redhat.com/show_bug.cgi?id=1869629#c5 We should continue accepting connections until X amount of time has passed to ensure the instance has been removed from the LB. Here's where we're deleting the masters in the test in question: https://github.com/openshift/origin/blob/ec3b979387ddc58d3f78e7c92eb4cd85b70857b0/test/extended/dr/quorum_restore.go#L118 I don't see anywhere we're applying the skip-drain annotation. The machine-controller logs were wiped when the master it was running on went down, but we emit events that demonstrate we did indeed drain each master (one entry): "message": "Node \"ip-10-0-178-1.us-west-2.compute.internal\" drained" cc Clayton After further review it seems that there is a disruption in service during scale up of machines that we did not experience in 4.5. While this is not ideal the problem is not blocking as DR will complete as expected. Moving this to 4.7 to continue exploration. ``` Oct 8 16:34:27.007: INFO: Create new masters Oct 8 16:34:27.007: INFO: Creating master sbatsche-rc0-2020-10-qg4sp-master-2 Oct 8 16:34:31.248: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:34:36.360: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:34:41.354: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:34:46.354: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:34:51.351: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:34:56.354: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:01.361: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:06.354: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:11.349: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:16.569: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:21.351: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:26.354: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:31.352: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:36.368: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:41.353: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:46.369: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:51.352: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:35:56.357: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:36:01.351: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:36:06.361: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:36:11.352: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:36:16.349: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:36:21.369: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:37:00.509: INFO: error creating "sbatsche-rc0-2020-10-qg4sp-master-2": error "context deadline exceeded" Oct 8 16:37:11.839: INFO: error creating "sbatsche-rc0-2020-10-qg4sp-master-2": error "Internal error occurred: resource quota evaluation timed out" Oct 8 16:37:26.357: INFO: error creating "sbatsche-rc0-2020-10-qg4sp-master-2": error "Internal error occurred: resource quota evaluation timed out" Oct 8 16:37:31.369: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:37:36.355: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:37:41.359: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:37:46.357: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:37:51.375: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:37:56.362: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one Oct 8 16:38:01.369: INFO: Creating master sbatsche-rc0-2020-10-qg4sp-master-1 ``` My observations on this issue. i am trying to test this manually by doing it step by step. Brought up a cluster on gcp with bootstrap node preserved so that it can be used as bastion node. Deleted two machines using commands ``` oc project openshift-machine-api oc delete machine vareti-46-20201014-1-fbg92-master-1 oc delete machine vareti-46-20201014-1-fbg92-master-2 ``` After sometime apiserver became unresponsive. I could not see the VM instances in cloud console as well. SSH into remaining master node and perform the backup. ``` sudo -i /bin/bash -cx 'rm -rf /home/core/backup; /usr/local/bin/cluster-backup.sh ~core/backup' ``` now restore the control plane from backup ``` sudo -i /bin/bash -cx '/usr/local/bin/cluster-restore.sh /home/core/backup' ``` now wait till api server is back up again. API server was up for a brief moment and then it became unavailable. I logged into master-1 to see what happened. Found that etcd was restarting continuously. It is not able to fetch member info from master-0. I see these two messages in the error logs ``` 2020-10-14 20:52:12.387924 W | etcdserver: could not get cluster response from https://10.0.0.5:2380: Get https://10.0.0.5:2380/members: EOF 2020-10-14 20:52:12.389328 C | etcdmain: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs ``` I logged into master-0 and checked the etcd container logs. Saw messages that says connection was rejected ``` 2020-10-14 19:27:48.061264 I | embed: rejected connection from "10.0.0.7:55686" (error "tls: \"10.0.0.7\" does not match any of DNSNames [\"localhost\" \"10.0.0.3\"]", ServerName "", IPAddresses ["10.0.0.3"], DNSNames ["localhost" "10.0.0.3"]) ``` From the bootstrap node, I could see that the etcd IPs were different ``` 2020-10-14 20:57:51.925390 I | etcdserver/membership: added member 4bba455bc1fbc5ae [https://10.0.0.3:2380] to cluster 18c4c43dc346df9a 2020-10-14 20:57:51.929270 I | etcdserver/membership: added member a03f36bbf0fbae67 [https://10.0.0.4:2380] to cluster 18c4c43dc346df9a 2020-10-14 20:57:51.932889 I | etcdserver/membership: added member e33178e1f8b8d32a [https://10.0.0.5:2380] to cluster 18c4c43dc346df9a ``` So, it is possible that the change in cluster IPs is making the cluster unusable after DR. It is possible that something else is going on and i will try to look into it more tomorrow. I wanted to add a few things that might help you triage further. In 4.6 if we tolerate disruption failures the test will pass[1]. > 2020-10-14 19:27:48.061264 I | embed: rejected connection from "10.0.0.7:55686" (error "tls: \"10.0.0.7\" does not match any of DNSNames [\"localhost\" \"10.0.0.3\"]", ServerName "", IPAddresses ["10.0.0.3"], DNSNames ["localhost" "10.0.0.3"]) While this is certainly a failure case for TLS auth to fail I believe that not to be the root cause for the same reason that the test will pass. I believe why this works is we key the certificates for etcd on the node name[2]. Even if we created a machine with the same name my understanding is that the nodeName will be different because the IP is different. So for example here is the list of nodes for the test. ``` Oct 23 04:08:09.796: INFO: Pod name: ds1-dcrrx Node Name: ip-10-0-244-135.us-west-1.compute.internal Oct 23 04:08:09.796: INFO: Pod name: ds1-hbx95 Node Name: ip-10-0-184-152.us-west-1.compute.internal Oct 23 04:08:09.796: INFO: Pod name: ds1-plckl Node Name: ip-10-0-156-141.us-west-1.compute.internal Oct 23 04:08:09.796: INFO: Pod name: ds1-qbqkh Node Name: ip-10-0-151-105.us-west-1.compute.internal Oct 23 04:08:09.796: INFO: Pod name: ds1-qg275 Node Name: ip-10-0-245-7.us-west-1.compute.internal Oct 23 04:08:09.796: INFO: Pod name: ds1-szx24 Node Name: ip-10-0-149-232.us-west-1.compute.internal ``` Here is the list of control-plane nodes post restore. ``` Oct 23 04:29:09.216: INFO: New master: "ip-10-0-151-105.us-west-1.compute.internal" IPs: [{InternalIP 10.0.151.105} {Hostname ip-10-0-151-105.us-west-1.compute.internal} {InternalDNS ip-10-0-151-105.us-west-1.compute.internal}] Oct 23 04:29:09.216: INFO: New master: "ip-10-0-179-160.us-west-1.compute.internal" IPs: [{InternalIP 10.0.179.160} {Hostname ip-10-0-179-160.us-west-1.compute.internal} {InternalDNS ip-10-0-179-160.us-west-1.compute.internal}] Oct 23 04:29:09.216: INFO: New master: "ip-10-0-182-63.us-west-1.compute.internal" IPs: [{InternalIP 10.0.182.63} {Hostname ip-10-0-182-63.us-west-1.compute.internal} {InternalDNS ip-10-0-182-63.us-west-1.compute.internal}] ``` So the code would roll new certs and when etcd scales to these new nodes I would not expect them to have TLS auth issues. [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25624/pull-ci-openshift-origin-release-4.6-e2e-aws-disruptive/1319480380121681920 [2] https://github.com/openshift/cluster-etcd-operator/blob/2c4bd124cdb7954fed806d1afe668c0c52c8fac7/pkg/tlshelpers/tlshelpers.go#L33 Finally I could reproduce this locally and made some progress. Log messages are making me suspect loadbalancer. In DR test, there is a section that waits for API server to be up after restoring the cluster. Surprisingly, API server is responding to requests when clearly it is not ready. I could tell from these messages I1028 21:37:30.718241 19 healthz.go:258] etcd,shutdown check failed: readyz [-]etcd failed: error getting data from etcd: context deadline exceeded [-]shutdown failed: process is shutting down I1028 21:37:30.769160 19 healthz.go:258] etcd,shutdown check failed: readyz [-]etcd failed: error getting data from etcd: context deadline exceeded [-]shutdown failed: process is shutting down W1028 21:37:30.807299 19 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.5:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.5:2379: connect: connection timed out". Reconnecting... I am also seeing these messages in the log W1028 21:25:00.095100 19 patch_genericapiserver.go:123] Request to "/api/v1/namespaces/openshift-monitoring/endpoints" (source IP 10.0.32.3:53172, user agent "Prometheus/2.21.0") before server is ready, possibly a sign for a broken load balancer setup. W1028 21:25:00.095912 19 patch_genericapiserver.go:123] Request to "/api/v1/namespaces/openshift-service-ca-operator/services" (source IP 10.0.32.3:53172, user agent "Prometheus/2.21.0") before server is ready, possibly a sign for a broken load balancer setup. W1028 21:25:00.096053 19 patch_genericapiserver.go:123] Request to "/api/v1/namespaces/openshift-sdn/pods" (source IP 10.0.32.3:53172, user agent "Prometheus/2.21.0") before server is ready, possibly a sign for a broken load balancer setup. Etcd check fails kube-apiserver arguments may have stale etcd endpoints. kube-apiserver goes through another restart with only etcd endpoint next time. This bug is actively being worked on. Siva is actively working on this. Still working on this. In looking at this test, there is definitely a bug in MHC interacting with stopped machines on GCP (at least). 1. Set up a MHC with a 5m condition on ready (as per the test) 2. Stop the machine via SSH with a force poweroff 3. MHC detects down machine, marks machine deleted Expected: Stopped and deleted machine completes in seconds (or at worst a minute, because no drain is necessary and the cloud knows that the machine is down) Actual: Machine is reported as Deleting for at least 15m, machine object is removed. Filed https://bugzilla.redhat.com/show_bug.cgi?id=1905709 for the issue in comment 14 Verified with 4.7.0-0.nightly-2020-12-20-031835, I checked the e2e test, and can't got more log msg, then I tried it manually step by step, deleted 2 master node, and also machine, restore cluster to 1 master cluster, then add 2 machines, then cluster recovery with 2 new machines. # oc get node NAME STATUS ROLES AGE VERSION ip-10-0-133-229.us-east-2.compute.internal Ready worker 21h v1.20.0+87544c5 ip-10-0-142-119.us-east-2.compute.internal Ready master 4m5s v1.20.0+87544c5 ip-10-0-176-124.us-east-2.compute.internal Ready worker 21h v1.20.0+87544c5 ip-10-0-183-19.us-east-2.compute.internal Ready master 21h v1.20.0+87544c5 ip-10-0-197-107.us-east-2.compute.internal Ready master 3m40s v1.20.0+87544c5 ip-10-0-198-212.us-east-2.compute.internal Ready worker 21h v1.20.0+87544c5 # oc get pods NAME READY STATUS RESTARTS AGE etcd-ip-10-0-142-119.us-east-2.compute.internal 3/3 Running 0 6m4s etcd-ip-10-0-183-19.us-east-2.compute.internal 3/3 Running 0 5m13s etcd-ip-10-0-197-107.us-east-2.compute.internal 3/3 Running 0 6m33s etcd-quorum-guard-5fb4bcb4dc-dsnvs 1/1 Running 0 22m etcd-quorum-guard-5fb4bcb4dc-n7dwh 1/1 Running 0 21h etcd-quorum-guard-5fb4bcb4dc-vqhpb 1/1 Running 0 22m Backup up to POST, more fixes required. The disruptive job is on track to be fixed in master in the immediate future (https://github.com/openshift/origin/pull/25774 and https://github.com/openshift/cluster-etcd-operator/pull/540). Once the disruptive job is passing consistently against master, the etcd and origin changes required to get it there should probable be backported all the way back (4.5?) to ensure against regression. Still in process. This bz is in-process until https://github.com/openshift/origin/pull/25774 merges, and there is no requirement to tie completion to a specific release. https://github.com/openshift/origin/pull/25774 has merged, which fixes the quorum restore test to follow the documented restore procedure. https://bugzilla.redhat.com/show_bug.cgi?id=1956609 tracks resolution of the CMA issue that prevents the test from passing. |