Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1850687

Summary: [DR]etcd fail to back after restore from automated-cluster-backups
Product: OpenShift Container Platform Reporter: Neelesh Agrawal <nagrawal>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: aos-bugs, geliu, jokerman, nagrawal, rphillips, sbatsche, schoudha, skolicha, sttts
Target Milestone: ---Flags: zyu: needinfo-
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1848939 Environment:
Last Closed: 2020-08-10 20:46:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1848939    
Bug Blocks:    
Attachments:
Description Flags
kubelet log for ip-10-0-156-170.us-east-2.compute.internal none

Comment 19 Suresh Kolichala 2020-06-26 16:35:13 UTC
While the kubelet fixes seem to improve the success rate of restoration process, we found that it is not 100% reliable. Out of the 6 tests I ran yesterday, only 3 passed -- 4th test recovered itself after about 26 minutes, but two other tests failed to recover.

Therefore, we are going to recommend restarting the kubelet service on all masters -- after restoring the etcd database on one of the master -- as a workaround to this problem for the release of 4.5.

I will be working with Andrea to get the documentation modified to include the additional steps. I will also work with the QE team to get this process tested next week.

Comment 23 Ted Yu 2020-06-26 22:30:43 UTC
Created attachment 1698970 [details]
kubelet log for ip-10-0-156-170.us-east-2.compute.internal

Comment 30 Ted Yu 2020-06-29 15:01:34 UTC
On the first node where the etcd pod deletion was successful, after old database is restored, etcd and apiserver pods re-appear.
However, the kubelet on that node completed the pod deletion prior to the restore and doesn't have knowledge about etcd pod anymore.

One solution is to reconcile the pod manager on the node with the current api server so that the phantom etcd pod is removed.

I am going over this fix and evaluate the proper presentation (current formation depends on #89155).

Comment 35 Seth Jennings 2020-08-10 16:05:51 UTC
Ryan is on leave

Comment 36 Seth Jennings 2020-08-10 20:46:24 UTC
This was cloned to early bz1848939 before any cause had been determined.  Will reclone if issue still exists and a fix it found.  If this is still an issue, continue in bz1848939, do not reopen this one please.