Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1771897

Summary:	[DR]Node status is not stable(switch between Ready/NotReady)after restoring to previous cluster state
Product:	OpenShift Container Platform	Reporter:	ge liu <geliu>
Component:	Etcd	Assignee:	Suresh Kolichala <skolicha>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	alpatel, mfojtik, skolicha
Target Milestone:	---
Target Release:	4.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-13 21:52:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 3 ge liu 2019-11-21 10:38:27 UTC

Hi Suresh,Alay, 

I tried it today with etcd encyrption setting, and can't recreate it, I will try it again tomorrow because original bug have not enable etcd encyrtpion. thanks

Comment 11 Sam Batschelet 2019-11-29 16:49:00 UTC

> 1. According to  Comment 8,  Alay copied snapshot.db from 1 master node to two master nodes, is it official requirement? I often execute etcd backup on each master node, and there is snapshot.db on assets dir, so I suppose each master node will use snapshot.db on its owner assets dir.

etcdctl snapshot save takes a copy of state amchine at the time. The command responds from the leader. If you make the command against any endpoint that is not the leader it will be forwarded to the leader. Since you are restoring to a single point in time all members should use the same datafile.

We document this here[1] in 1.a. If you find this is not clear we can make it more explicit.

[1]: https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html

>2. Regarding to err i met: 
>--------------------------------------------------
>Stopping all containers..
>FATA[0000] Stopping the container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456" failed: rpc error: code = Unknown desc = failed to stop container >02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456: failed to stop container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456": failed to find >process: <nil> 
>--------------------------------------------------
>
>what result in it? and is it affect the restore process? is it possible to avoid it in code? because this err is not appears every time, it's a random issue, Today, I re-run >this test, then have not trigger this err, and cluster also working well after restore. so I think it is potential risk to restore processing.

When the stop_all_containers function fires we first create an array with all container ids. We then loop through them to stop the process. But as the list is fluid in some circumstances we could have an id in the list that is no longer available and already dead.

>I got below message on 2/3 master nodes today, but all restoreing passed on 3 master by once:

>Waiting for all containers to stop... (1/60)
>All containers are stopped.
>Backing up etcd data-dir..

This output is expected, what part of this are you concerned with?

[1]:https://github.com/openshift/machine-config-operator/commit/35db287af27bc64b1abd56d704e27ae73339d830

Comment 13 ge liu 2019-12-03 09:50:04 UTC

(In reply to Sam Batschelet from comment #11)
> > 1. According to  Comment 8,  Alay copied snapshot.db from 1 master node to two master nodes, is it official requirement? I often execute etcd backup on each master node, and there is snapshot.db on assets dir, so I suppose each master node will use snapshot.db on its owner assets dir.
> 
> etcdctl snapshot save takes a copy of state amchine at the time. The command
> responds from the leader. If you make the command against any endpoint that
> is not the leader it will be forwarded to the leader. Since you are
> restoring to a single point in time all members should use the same datafile.
> 
> We document this here[1] in 1.a. If you find this is not clear we can make
> it more explicit.
> 
> [1]:
> https://docs.openshift.com/container-platform/4.2/backup_and_restore/
> disaster_recovery/scenario-2-restoring-cluster-state.html
> it's ok for me, thanks for kindly explanation.

> >2. Regarding to err i met: 
> >--------------------------------------------------
> >Stopping all containers..
> >FATA[0000] Stopping the container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456" failed: rpc error: code = Unknown desc = failed to stop container >02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456: failed to stop container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456": failed to find >process: <nil> 
> >--------------------------------------------------
> >
> >what result in it? and is it affect the restore process? is it possible to avoid it in code? because this err is not appears every time, it's a random issue, Today, I re-run >this test, then have not trigger this err, and cluster also working well after restore. so I think it is potential risk to restore processing.
> 
> When the stop_all_containers function fires we first create an array with
> all container ids. We then loop through them to stop the process. But as the
> list is fluid in some circumstances we could have an id in the list that is
> no longer available and already dead.
> ==> ok, it sounds good.
> >I got below message on 2/3 master nodes today, but all restoreing passed on 3 master by once:FATA[0000] err
> 
> >Waiting for all containers to stop... (1/60)
> >All containers are stopped.
> >Backing up etcd data-dir..
> 
> This output is expected, what part of this are you concerned with?
> no, I have no concern for it and I just compare it to last comments aobut  stop_all_containers appears 
> [1]:https://github.com/openshift/machine-config-operator/commit/
> 35db287af27bc64b1abd56d704e27ae73339d830

Comment 16 errata-xmlrpc 2020-05-13 21:52:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581