Description of problem:
Restoring a DB with running HA VMs could cause split brains and VM corruptions.
HA VM vm1 is running on host 1
After some time:
vm1 migrates or restarts on host 2
the restored engine DB says that vm1 is running on host1 while it's running on host2.
If the engine finds earlier that vm1 is up on host2 it simply update its record; but if it finds earlier it not on host1 it will try restarting causing a split brain.
Version-Release number of selected component (if applicable):
not systematic, it depends on the vds update order
Steps to Reproduce:
1. create an HA vm and run it on host1
2. take a backup if the engine
3. migrate the HA vm to host2
4. restore the backup and bring the engine up
if the engine doesn't find the HA vm on host 1 as for the DB it could try restarting it causing a split brain
All the HA VMs are filtered at restore time and set as down with exit_reason=Normal and run_on_vds = null, this prevent the engine from automatically restarting them: if they are already up, the engine will simply update their records while if they are down it will be up to the user to restart them on the recovered engine.
On clean migrations we already recommend the user to set everything in maintenance mode, the issue is potentially affecting disaster recovery flows over a backup took on a live system.
(In reply to Simone Tiraboschi from comment #0)
> Description of problem:
> Restoring a DB with running HA VMs could cause split brains and VM
> Backup time:
> HA VM vm1 is running on host 1
> After some time:
> vm1 migrates or restarts on host 2
> Restore time:
> the restored engine DB says that vm1 is running on host1 while it's running
> on host2.
> If the engine finds earlier that vm1 is up on host2 it simply update its
> record; but if it finds earlier it not on host1 it will try restarting
> causing a split brain.
Roy - is this indeed possible? What would the engine do in such case?
Only Arik can answer this I think. I only know about the host unresponsive case where we wait for fencing. But I am not sure how fast we are when the engine restarts.
If this is true we might have an issue during plain engine restart too if the user does something manually (VM on host A, engine crashes, user migrates it using cockpit or something, engine is restarted..).
(In reply to Martin Sivák from comment #2)
This bug was opened after Simone and I talked about the described scenario (restoring a snapshot of the DB that is not necessarily up-to-date) and I was afraid that it could lead to the described split-brain. I think it is possible.
And you're right, we currently assume that VMs are not moved by manual/unmanaged user operations. It is not that critical for regular VMs since at worse it would cause incorrect audit logs, but it could be critical for highly-available VMs.
OK. So? Do we want to still patch engine-backup somehow? Or, assuming that starting/migrating VMs externally from the engine is likely/wanted/planned/already possible, we want to fix the engine to not allow this to happen? Not sure how, though, as I guess if we have many hosts, polling each until we know for sure that no-one runs some VM, can take too much time for a HA VM. Perhaps we can do something by checking the storage - sanlock or something - perhaps we can expect to have, normally, way fewer storage domains than hosts. Not sure who to ask... Arik?
Another idea: Use the flag introduced in bug 1403903 and make the engine more careful when starting HA VMs if this flag was set.
And another one: Allow the user to say somewhere (in engine-backup or elsewhere): "I now restored the engine with engine-backup, and all hosts are also dead/rebooted/whatever. Please start all HA VMs for me ASAP".
Because obviously, engine-backup restore can be used in two completely different scenarios:
1. Only engine is bad/corrupted/problematic/etc and I want to restore it from a backup I took 10 minutes ago prior to doing some test
2. Everything is dead and I am starting from scratch (or restoring on a test env on a separate network and a COW clone of the storage, etc.).
VM lease should prevent that exactly so ha vms should have that by default if they don't already.
Also, the vms monitoring should be more aware of ha vms, in case we know all host in cluster are not up yet - that should be enough to make sure we report all the vms and we know the status of the cluster, instead of acting on a partial cluster state.
I would make the engine backup tool 'smart'. The engine is the component that cope with stale data - the runtime monitoring is the actual truth and the db is just a point in time after that. In case we restore, we should make sure the monitoring is again ahead of the db.
type - I *wouldn't* make engine backup tool smart
typo - I *wouldn't* make engine backup tool smart
(In reply to Roy Golan from comment #5)
> VM lease should prevent that exactly so ha vms should have that by default
> if they don't already.
Yes, it is possible to leverage VM leases for that once every highly-available VM will have a lease.
> Also, the vms monitoring should be more aware of ha vms, in case we know all
> host in cluster are not up yet - that should be enough to make sure we
> report all the vms and we know the status of the cluster, instead of acting
> on a partial cluster state.
Note that it may be that not all the hosts are available - so the logic should probably not be "until all hosts are up" but "until all hosts were polled/monitored".
> I would make the engine backup tool 'smart'. The engine is the component
> that cope with stale data - the runtime monitoring is the actual truth and
> the db is just a point in time after that. In case we restore, we should
> make sure the monitoring is again ahead of the db.
I wouldn't do it either, changed the component accordingly.
You might want to talk to the infra team as well, there is a 5 minute grace period for any host fencing after engine restart. You could follow the same logic with HA restarts in this case (all hosts are either up or fenced after five minutes, so you do not have to wait indefinitely).
> Note that it may be that not all the hosts are available - so the logic
> should probably not be "until all hosts are up" but "until all hosts were
That's more refined and correct. Don't think we mark that in any way atm.
if there is no change in where the VM runs this situation won't happen
if you move the VM we can't really know what happened and attempt to restart. VM leases will be the ultimate safety net for these cases, in the meantime we won't be able to address this
*** This bug has been marked as a duplicate of bug 1441322 ***