Bug 1771897 - [DR]Node status is not stable(switch between Ready/NotReady)after restoring to previous cluster state
Summary: [DR]Node status is not stable(switch between Ready/NotReady)after restoring t...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.3.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Suresh Kolichala
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-13 08:27 UTC by ge liu
Modified: 2020-05-13 21:52 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-13 21:52:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-13 21:52:46 UTC

Comment 3 ge liu 2019-11-21 10:38:27 UTC
Hi Suresh,Alay, 

I tried it today with etcd encyrption setting, and can't recreate it, I will try it again tomorrow because original bug have not enable etcd encyrtpion. thanks

Comment 11 Sam Batschelet 2019-11-29 16:49:00 UTC
> 1. According to  Comment 8,  Alay copied snapshot.db from 1 master node to two master nodes, is it official requirement? I often execute etcd backup on each master node, and there is snapshot.db on assets dir, so I suppose each master node will use snapshot.db on its owner assets dir.

etcdctl snapshot save takes a copy of state amchine at the time. The command responds from the leader. If you make the command against any endpoint that is not the leader it will be forwarded to the leader. Since you are restoring to a single point in time all members should use the same datafile.

We document this here[1] in 1.a. If you find this is not clear we can make it more explicit.

[1]: https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html

>2. Regarding to err i met: 
>--------------------------------------------------
>Stopping all containers..
>FATA[0000] Stopping the container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456" failed: rpc error: code = Unknown desc = failed to stop container >02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456: failed to stop container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456": failed to find >process: <nil> 
>--------------------------------------------------
>
>what result in it? and is it affect the restore process? is it possible to avoid it in code? because this err is not appears every time, it's a random issue, Today, I re-run >this test, then have not trigger this err, and cluster also working well after restore. so I think it is potential risk to restore processing.

When the stop_all_containers function fires we first create an array with all container ids. We then loop through them to stop the process. But as the list is fluid in some circumstances we could have an id in the list that is no longer available and already dead.

>I got below message on 2/3 master nodes today, but all restoreing passed on 3 master by once:

>Waiting for all containers to stop... (1/60)
>All containers are stopped.
>Backing up etcd data-dir..

This output is expected, what part of this are you concerned with?

[1]:https://github.com/openshift/machine-config-operator/commit/35db287af27bc64b1abd56d704e27ae73339d830

Comment 13 ge liu 2019-12-03 09:50:04 UTC
(In reply to Sam Batschelet from comment #11)
> > 1. According to  Comment 8,  Alay copied snapshot.db from 1 master node to two master nodes, is it official requirement? I often execute etcd backup on each master node, and there is snapshot.db on assets dir, so I suppose each master node will use snapshot.db on its owner assets dir.
> 
> etcdctl snapshot save takes a copy of state amchine at the time. The command
> responds from the leader. If you make the command against any endpoint that
> is not the leader it will be forwarded to the leader. Since you are
> restoring to a single point in time all members should use the same datafile.
> 
> We document this here[1] in 1.a. If you find this is not clear we can make
> it more explicit.
> 
> [1]:
> https://docs.openshift.com/container-platform/4.2/backup_and_restore/
> disaster_recovery/scenario-2-restoring-cluster-state.html
> it's ok for me, thanks for kindly explanation.

> >2. Regarding to err i met: 
> >--------------------------------------------------
> >Stopping all containers..
> >FATA[0000] Stopping the container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456" failed: rpc error: code = Unknown desc = failed to stop container >02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456: failed to stop container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456": failed to find >process: <nil> 
> >--------------------------------------------------
> >
> >what result in it? and is it affect the restore process? is it possible to avoid it in code? because this err is not appears every time, it's a random issue, Today, I re-run >this test, then have not trigger this err, and cluster also working well after restore. so I think it is potential risk to restore processing.
> 
> When the stop_all_containers function fires we first create an array with
> all container ids. We then loop through them to stop the process. But as the
> list is fluid in some circumstances we could have an id in the list that is
> no longer available and already dead.
> ==> ok, it sounds good.
> >I got below message on 2/3 master nodes today, but all restoreing passed on 3 master by once:FATA[0000] err
> 
> >Waiting for all containers to stop... (1/60)
> >All containers are stopped.
> >Backing up etcd data-dir..
> 
> This output is expected, what part of this are you concerned with?
> no, I have no concern for it and I just compare it to last comments aobut  stop_all_containers appears 
> [1]:https://github.com/openshift/machine-config-operator/commit/
> 35db287af27bc64b1abd56d704e27ae73339d830

Comment 16 errata-xmlrpc 2020-05-13 21:52:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.