Bug 1886160

Summary:	[sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should restore itself after quorum loss [Serial]
Product:	OpenShift Container Platform	Reporter:	David Eads <deads>
Component:	Etcd	Assignee:	Maru Newby <mnewby>
Status:	CLOSED NOTABUG	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.6	CC:	aos-bugs, ccoleman, gmontero, mfojtik, mnewby, sbatsche, skolicha, wking, xxia
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	tag-ci
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:	job=release-openshift-origin-installer-e2e-aws-disruptive-4.6=all
Last Closed:	2021-06-02 16:15:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1947705, 1954364

Description David Eads 2020-10-07 18:56:42 UTC

test:
[sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should restore itself after quorum loss [Serial] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-etcd%5C%5D%5C%5BFeature%3ADisasterRecovery%5C%5D%5C%5BDisruptive%5C%5D+%5C%5BFeature%3AEtcdRecovery%5C%5D+Cluster+should+restore+itself+after+quorum+loss+%5C%5BSerial%5C%5D


Disruptive tests show failures.  There are several tests that fail because the cluster state isn't right, but I suspect it all goes back to "etcd didn't recover very well".

The environment setting on this bug marks all failed tests in this job as the "fault" of this bug.

Comment 1 Sam Batschelet 2020-10-07 19:13:08 UTC

DR process did not change from 4.5 to 4.6. The failure which often times is the result of timeout waiting for machines to delete should be looked at more carefully. I began some triage in the following PR[1] to address general concerns about possible lag from load balancer health checks. Moving to cloud team to continue triage.

```
fail [github.com/openshift/origin/test/extended/dr/quorum_restore.go:195]: Unexpected error:
    <*errors.StatusError | 0xc001cb10e0>: {
        ErrStatus: {
            TypeMeta: {Kind: "Status", APIVersion: "v1"},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "Internal error occurred: resource quota evaluation timed out",
            Reason: "InternalError",
            Details: {
                Name: "",
                Group: "",
                Kind: "",
                UID: "",
                Causes: [
                    {
                        Type: "",
                        Message: "resource quota evaluation timed out",
                        Field: "",
                    },
                ],
                RetryAfterSeconds: 0,
            },
            Code: 500,
        },
    }
    Internal error occurred: resource quota evaluation timed outoccurred
```

[1] https://github.com/openshift/origin/pull/25585

Comment 2 Michael Gugino 2020-10-07 23:09:32 UTC

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/459/pull-ci-openshift-cluster-etcd-operator-master-e2e-disruptive/1311384501846282240

The test in question is deleting the master/control plane machines via the machine-api.  Accordingly, we drain those machines prior to removing them from the cloud.

It seems that k8s api server needs to account for this (AWS lb slow to realized failed members): https://bugzilla.redhat.com/show_bug.cgi?id=1869629#c5

We should continue accepting connections until X amount of time has passed to ensure the instance has been removed from the LB.

Here's where we're deleting the masters in the test in question: https://github.com/openshift/origin/blob/ec3b979387ddc58d3f78e7c92eb4cd85b70857b0/test/extended/dr/quorum_restore.go#L118

I don't see anywhere we're applying the skip-drain annotation.  The machine-controller logs were wiped when the master it was running on went down, but we emit events that demonstrate we did indeed drain each master (one entry): "message": "Node \"ip-10-0-178-1.us-west-2.compute.internal\" drained"

Comment 3 Michael Gugino 2020-10-07 23:12:05 UTC

cc Clayton

Comment 4 Sam Batschelet 2020-10-08 20:46:52 UTC

After further review it seems that there is a disruption in service during scale up of machines that we did not experience in 4.5. While this is not ideal the problem is not blocking as DR will complete as expected. Moving this to 4.7 to continue exploration.

```
  Oct  8 16:34:27.007: INFO: Create new masters
  Oct  8 16:34:27.007: INFO: Creating master sbatsche-rc0-2020-10-qg4sp-master-2
  Oct  8 16:34:31.248: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:34:36.360: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:34:41.354: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:34:46.354: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:34:51.351: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:34:56.354: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:01.361: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:06.354: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:11.349: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:16.569: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:21.351: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:26.354: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:31.352: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:36.368: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:41.353: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:46.369: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:51.352: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:35:56.357: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:36:01.351: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:36:06.361: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:36:11.352: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:36:16.349: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:36:21.369: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:37:00.509: INFO: error creating "sbatsche-rc0-2020-10-qg4sp-master-2": error "context deadline exceeded"
  Oct  8 16:37:11.839: INFO: error creating "sbatsche-rc0-2020-10-qg4sp-master-2": error "Internal error occurred: resource quota evaluation timed out"
  Oct  8 16:37:26.357: INFO: error creating "sbatsche-rc0-2020-10-qg4sp-master-2": error "Internal error occurred: resource quota evaluation timed out"
  Oct  8 16:37:31.369: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:37:36.355: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:37:41.359: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:37:46.357: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:37:51.375: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:37:56.362: INFO: Waiting for old machine object sbatsche-rc0-2020-10-qg4sp-master-2 to be deleted so we can create a new one
  Oct  8 16:38:01.369: INFO: Creating master sbatsche-rc0-2020-10-qg4sp-master-1
```

Comment 5 Venkata Siva Teja Areti 2020-10-14 21:05:58 UTC

My observations on this issue.

i am trying to test this manually by doing it step by step. Brought up a cluster on gcp with bootstrap node preserved so that it can be used as bastion node.

Deleted two machines using commands

```
oc project openshift-machine-api
oc delete machine vareti-46-20201014-1-fbg92-master-1
oc delete machine vareti-46-20201014-1-fbg92-master-2
```

After sometime apiserver became unresponsive. I could not see the VM instances in cloud console as well.

SSH into remaining master node and perform the backup.

```
sudo -i /bin/bash -cx 'rm -rf /home/core/backup; /usr/local/bin/cluster-backup.sh ~core/backup'
```
now restore the control plane from backup

```
sudo -i /bin/bash -cx '/usr/local/bin/cluster-restore.sh /home/core/backup'
```

now wait till api server is back up again.

API server was up for a brief moment and then it became unavailable.

I logged into master-1 to see what happened. Found that etcd was restarting continuously. It is not able to fetch member info from master-0. I see these two messages in the error logs

```
2020-10-14 20:52:12.387924 W | etcdserver: could not get cluster response from https://10.0.0.5:2380: Get https://10.0.0.5:2380/members: EOF
2020-10-14 20:52:12.389328 C | etcdmain: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs
```

I logged into master-0 and checked the etcd container logs. Saw messages that says connection was rejected

```
2020-10-14 19:27:48.061264 I | embed: rejected connection from "10.0.0.7:55686" (error "tls: \"10.0.0.7\" does not match any of DNSNames [\"localhost\" \"10.0.0.3\"]", ServerName "", IPAddresses ["10.0.0.3"], DNSNames ["localhost" "10.0.0.3"])
```

From the bootstrap node, I could see that the etcd IPs were different

```
2020-10-14 20:57:51.925390 I | etcdserver/membership: added member 4bba455bc1fbc5ae [https://10.0.0.3:2380] to cluster 18c4c43dc346df9a
2020-10-14 20:57:51.929270 I | etcdserver/membership: added member a03f36bbf0fbae67 [https://10.0.0.4:2380] to cluster 18c4c43dc346df9a
2020-10-14 20:57:51.932889 I | etcdserver/membership: added member e33178e1f8b8d32a [https://10.0.0.5:2380] to cluster 18c4c43dc346df9a
```

So, it is possible that the change in cluster IPs is making the cluster unusable after DR.

Comment 6 Venkata Siva Teja Areti 2020-10-14 21:08:47 UTC

It is possible that something else is going on and i will try to look into it more tomorrow.

Comment 7 Sam Batschelet 2020-10-23 12:24:23 UTC

I wanted to add a few things that might help you triage further. In 4.6 if we tolerate disruption failures the test will pass[1]. 

> 2020-10-14 19:27:48.061264 I | embed: rejected connection from "10.0.0.7:55686" (error "tls: \"10.0.0.7\" does not match any of DNSNames [\"localhost\" \"10.0.0.3\"]", ServerName "", IPAddresses ["10.0.0.3"], DNSNames ["localhost" "10.0.0.3"])

While this is certainly a failure case for TLS auth to fail I believe that not to be the root cause for the same reason that the test will pass. 

I believe why this works is we key the certificates for etcd on the node name[2]. Even if we created a machine with the same name my understanding is that the nodeName will be different because the IP is different. So for example here is the list of nodes for the test.

```
  Oct 23 04:08:09.796: INFO: Pod name: ds1-dcrrx   Node Name: ip-10-0-244-135.us-west-1.compute.internal
  Oct 23 04:08:09.796: INFO: Pod name: ds1-hbx95   Node Name: ip-10-0-184-152.us-west-1.compute.internal
  Oct 23 04:08:09.796: INFO: Pod name: ds1-plckl   Node Name: ip-10-0-156-141.us-west-1.compute.internal
  Oct 23 04:08:09.796: INFO: Pod name: ds1-qbqkh   Node Name: ip-10-0-151-105.us-west-1.compute.internal
  Oct 23 04:08:09.796: INFO: Pod name: ds1-qg275   Node Name: ip-10-0-245-7.us-west-1.compute.internal
  Oct 23 04:08:09.796: INFO: Pod name: ds1-szx24   Node Name: ip-10-0-149-232.us-west-1.compute.internal
```
 
Here is the list of control-plane nodes post restore.
```
  Oct 23 04:29:09.216: INFO: New master: "ip-10-0-151-105.us-west-1.compute.internal" IPs: [{InternalIP 10.0.151.105} {Hostname ip-10-0-151-105.us-west-1.compute.internal} {InternalDNS ip-10-0-151-105.us-west-1.compute.internal}]
  Oct 23 04:29:09.216: INFO: New master: "ip-10-0-179-160.us-west-1.compute.internal" IPs: [{InternalIP 10.0.179.160} {Hostname ip-10-0-179-160.us-west-1.compute.internal} {InternalDNS ip-10-0-179-160.us-west-1.compute.internal}]
  Oct 23 04:29:09.216: INFO: New master: "ip-10-0-182-63.us-west-1.compute.internal" IPs: [{InternalIP 10.0.182.63} {Hostname ip-10-0-182-63.us-west-1.compute.internal} {InternalDNS ip-10-0-182-63.us-west-1.compute.internal}]
```

So the code would roll new certs and when etcd scales to these new nodes I would not expect them to have TLS auth issues.


[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25624/pull-ci-openshift-origin-release-4.6-e2e-aws-disruptive/1319480380121681920

[2] https://github.com/openshift/cluster-etcd-operator/blob/2c4bd124cdb7954fed806d1afe668c0c52c8fac7/pkg/tlshelpers/tlshelpers.go#L33

Comment 8 Venkata Siva Teja Areti 2020-10-28 22:46:41 UTC

Finally I could reproduce this locally and made some progress. Log messages are making me suspect loadbalancer.

In DR test, there is a section that waits for API server to be up after restoring the cluster. Surprisingly, API server is responding to requests when clearly it is not ready.

I could tell from these messages

I1028 21:37:30.718241 19 healthz.go:258] etcd,shutdown check failed: readyz
[-]etcd failed: error getting data from etcd: context deadline exceeded
[-]shutdown failed: process is shutting down
I1028 21:37:30.769160 19 healthz.go:258] etcd,shutdown check failed: readyz
[-]etcd failed: error getting data from etcd: context deadline exceeded
[-]shutdown failed: process is shutting down
W1028 21:37:30.807299 19 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.5:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.5:2379: connect: connection timed out". Reconnecting...

I am also seeing these messages in the log

W1028 21:25:00.095100 19 patch_genericapiserver.go:123] Request to "/api/v1/namespaces/openshift-monitoring/endpoints" (source IP 10.0.32.3:53172, user agent "Prometheus/2.21.0") before server is ready, possibly a sign for a broken load balancer setup.
W1028 21:25:00.095912 19 patch_genericapiserver.go:123] Request to "/api/v1/namespaces/openshift-service-ca-operator/services" (source IP 10.0.32.3:53172, user agent "Prometheus/2.21.0") before server is ready, possibly a sign for a broken load balancer setup.
W1028 21:25:00.096053 19 patch_genericapiserver.go:123] Request to "/api/v1/namespaces/openshift-sdn/pods" (source IP 10.0.32.3:53172, user agent "Prometheus/2.21.0") before server is ready, possibly a sign for a broken load balancer setup.

Etcd check fails kube-apiserver arguments may have stale etcd endpoints. kube-apiserver goes through another restart with only etcd endpoint next time.

Comment 9 Sam Batschelet 2020-11-14 17:10:48 UTC

This bug is actively being worked on.

Comment 10 Maru Newby 2020-11-16 17:37:48 UTC

Siva is actively working on this.

Comment 13 Maru Newby 2020-12-04 18:12:05 UTC

Still working on this.

Comment 14 Clayton Coleman 2020-12-08 20:39:30 UTC

In looking at this test, there is definitely a bug in MHC interacting with stopped machines on GCP (at least).

1. Set up a MHC with a 5m condition on ready (as per the test)
2. Stop the machine via SSH with a force poweroff
3. MHC detects down machine, marks machine deleted

Expected:

Stopped and deleted machine completes in seconds (or at worst a minute, because no drain is necessary and the cloud knows that the machine is down)

Actual:

Machine is reported as Deleting for at least 15m, machine object is removed.

Comment 15 Clayton Coleman 2020-12-08 21:33:37 UTC

Filed https://bugzilla.redhat.com/show_bug.cgi?id=1905709 for the issue in comment 14

Comment 18 ge liu 2020-12-22 09:12:04 UTC

Verified with 4.7.0-0.nightly-2020-12-20-031835, I checked the e2e test, and can't got more log msg, then I tried it manually step by step, deleted 2 master node, and also machine, restore cluster to 1 master cluster, then add 2 machines, then cluster recovery with 2 new machines.

# oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-133-229.us-east-2.compute.internal   Ready    worker   21h     v1.20.0+87544c5
ip-10-0-142-119.us-east-2.compute.internal   Ready    master   4m5s    v1.20.0+87544c5
ip-10-0-176-124.us-east-2.compute.internal   Ready    worker   21h     v1.20.0+87544c5
ip-10-0-183-19.us-east-2.compute.internal    Ready    master   21h     v1.20.0+87544c5
ip-10-0-197-107.us-east-2.compute.internal   Ready    master   3m40s   v1.20.0+87544c5
ip-10-0-198-212.us-east-2.compute.internal   Ready    worker   21h     v1.20.0+87544c5

# oc get pods
NAME                                                           READY   STATUS      RESTARTS   AGE
etcd-ip-10-0-142-119.us-east-2.compute.internal                3/3     Running     0          6m4s
etcd-ip-10-0-183-19.us-east-2.compute.internal                 3/3     Running     0          5m13s
etcd-ip-10-0-197-107.us-east-2.compute.internal                3/3     Running     0          6m33s
etcd-quorum-guard-5fb4bcb4dc-dsnvs                             1/1     Running     0          22m
etcd-quorum-guard-5fb4bcb4dc-n7dwh                             1/1     Running     0          21h
etcd-quorum-guard-5fb4bcb4dc-vqhpb                             1/1     Running     0          22m

Comment 19 Maru Newby 2021-01-18 15:51:35 UTC

Backup up to POST, more fixes required.

Comment 23 Maru Newby 2021-02-12 19:14:09 UTC

The disruptive job is on track to be fixed in master in the immediate future (https://github.com/openshift/origin/pull/25774 and https://github.com/openshift/cluster-etcd-operator/pull/540). Once the disruptive job is passing consistently against master, the etcd and origin changes required to get it there should probable be backported all the way back (4.5?) to ensure against regression.

Comment 24 Maru Newby 2021-02-15 15:50:33 UTC

Still in process.

Comment 28 Maru Newby 2021-05-24 15:36:53 UTC

This bz is in-process until https://github.com/openshift/origin/pull/25774 merges, and there is no requirement to tie completion to a specific release.

Comment 29 Maru Newby 2021-06-02 16:15:35 UTC

https://github.com/openshift/origin/pull/25774 has merged, which fixes the quorum restore test to follow the documented restore procedure.

https://bugzilla.redhat.com/show_bug.cgi?id=1956609 tracks resolution of the CMA issue that prevents the test from passing.