Hi Suresh,Alay, I tried it today with etcd encyrption setting, and can't recreate it, I will try it again tomorrow because original bug have not enable etcd encyrtpion. thanks
> 1. According to Comment 8, Alay copied snapshot.db from 1 master node to two master nodes, is it official requirement? I often execute etcd backup on each master node, and there is snapshot.db on assets dir, so I suppose each master node will use snapshot.db on its owner assets dir. etcdctl snapshot save takes a copy of state amchine at the time. The command responds from the leader. If you make the command against any endpoint that is not the leader it will be forwarded to the leader. Since you are restoring to a single point in time all members should use the same datafile. We document this here[1] in 1.a. If you find this is not clear we can make it more explicit. [1]: https://docs.openshift.com/container-platform/4.2/backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html >2. Regarding to err i met: >-------------------------------------------------- >Stopping all containers.. >FATA[0000] Stopping the container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456" failed: rpc error: code = Unknown desc = failed to stop container >02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456: failed to stop container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456": failed to find >process: <nil> >-------------------------------------------------- > >what result in it? and is it affect the restore process? is it possible to avoid it in code? because this err is not appears every time, it's a random issue, Today, I re-run >this test, then have not trigger this err, and cluster also working well after restore. so I think it is potential risk to restore processing. When the stop_all_containers function fires we first create an array with all container ids. We then loop through them to stop the process. But as the list is fluid in some circumstances we could have an id in the list that is no longer available and already dead. >I got below message on 2/3 master nodes today, but all restoreing passed on 3 master by once: >Waiting for all containers to stop... (1/60) >All containers are stopped. >Backing up etcd data-dir.. This output is expected, what part of this are you concerned with? [1]:https://github.com/openshift/machine-config-operator/commit/35db287af27bc64b1abd56d704e27ae73339d830
(In reply to Sam Batschelet from comment #11) > > 1. According to Comment 8, Alay copied snapshot.db from 1 master node to two master nodes, is it official requirement? I often execute etcd backup on each master node, and there is snapshot.db on assets dir, so I suppose each master node will use snapshot.db on its owner assets dir. > > etcdctl snapshot save takes a copy of state amchine at the time. The command > responds from the leader. If you make the command against any endpoint that > is not the leader it will be forwarded to the leader. Since you are > restoring to a single point in time all members should use the same datafile. > > We document this here[1] in 1.a. If you find this is not clear we can make > it more explicit. > > [1]: > https://docs.openshift.com/container-platform/4.2/backup_and_restore/ > disaster_recovery/scenario-2-restoring-cluster-state.html > it's ok for me, thanks for kindly explanation. > >2. Regarding to err i met: > >-------------------------------------------------- > >Stopping all containers.. > >FATA[0000] Stopping the container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456" failed: rpc error: code = Unknown desc = failed to stop container >02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456: failed to stop container "02bc94052d6a6edd3c0fda1328e7139076249afaa5d6936463b71690c46a5456": failed to find >process: <nil> > >-------------------------------------------------- > > > >what result in it? and is it affect the restore process? is it possible to avoid it in code? because this err is not appears every time, it's a random issue, Today, I re-run >this test, then have not trigger this err, and cluster also working well after restore. so I think it is potential risk to restore processing. > > When the stop_all_containers function fires we first create an array with > all container ids. We then loop through them to stop the process. But as the > list is fluid in some circumstances we could have an id in the list that is > no longer available and already dead. > ==> ok, it sounds good. > >I got below message on 2/3 master nodes today, but all restoreing passed on 3 master by once:FATA[0000] err > > >Waiting for all containers to stop... (1/60) > >All containers are stopped. > >Backing up etcd data-dir.. > > This output is expected, what part of this are you concerned with? > no, I have no concern for it and I just compare it to last comments aobut stop_all_containers appears > [1]:https://github.com/openshift/machine-config-operator/commit/ > 35db287af27bc64b1abd56d704e27ae73339d830
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581