Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1850687

Summary:

[DR]etcd fail to back after restore from automated-cluster-backups

Product:

OpenShift Container Platform

Reporter:

Neelesh Agrawal <nagrawal>

Component:

Node

Assignee:

Seth Jennings <sjenning>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Sunil Choudhary <schoudha>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.5

CC:

aos-bugs, geliu, jokerman, nagrawal, rphillips, sbatsche, schoudha, skolicha, sttts

Target Milestone:

---

Flags:

zyu: needinfo-

Target Release:

4.5.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

1848939

Environment:

Last Closed:

2020-08-10 20:46:24 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1848939

Bug Blocks:

Attachments:

Description	Flags
kubelet log for ip-10-0-156-170.us-east-2.compute.internal	none

Comment 19 Suresh Kolichala 2020-06-26 16:35:13 UTC

While the kubelet fixes seem to improve the success rate of restoration process, we found that it is not 100% reliable. Out of the 6 tests I ran yesterday, only 3 passed -- 4th test recovered itself after about 26 minutes, but two other tests failed to recover.

Therefore, we are going to recommend restarting the kubelet service on all masters -- after restoring the etcd database on one of the master -- as a workaround to this problem for the release of 4.5.

I will be working with Andrea to get the documentation modified to include the additional steps. I will also work with the QE team to get this process tested next week.

Comment 23 Ted Yu 2020-06-26 22:30:43 UTC

Created attachment 1698970 [details]
kubelet log for ip-10-0-156-170.us-east-2.compute.internal

Comment 30 Ted Yu 2020-06-29 15:01:34 UTC

On the first node where the etcd pod deletion was successful, after old database is restored, etcd and apiserver pods re-appear.
However, the kubelet on that node completed the pod deletion prior to the restore and doesn't have knowledge about etcd pod anymore.

One solution is to reconcile the pod manager on the node with the current api server so that the phantom etcd pod is removed.

I am going over this fix and evaluate the proper presentation (current formation depends on #89155).

Comment 35 Seth Jennings 2020-08-10 16:05:51 UTC

Ryan is on leave

Comment 36 Seth Jennings 2020-08-10 20:46:24 UTC

This was cloned to early bz1848939 before any cause had been determined.  Will reclone if issue still exists and a fix it found.  If this is still an issue, continue in bz1848939, do not reopen this one please.