Description of problem: Our customer is trying to test HA, and when they set more that 25-30 vms on two hosts and shoot one host to force evacuate of vms to the other host, they start to have failures above this number of vms. Host are beefy enough to handle perfectly 100 vms each of this flavor (and those vms are idle as it is for testing purposes). last try was with 100 vms, 47 of them failed to migrate. However, nova reports vms in error state, but they are pingable, they exist on the target host (liubvirt/virsh level), but nova report them as in error state on source host, not target one. Trying to reset state did nothing, as well as tagging them as 'evacuable'. The only solution left is database manual modification that we don't want to get a process in the production environment. We have the logs from the 100 vms move. Additionnaly, the vms are fresh new for each test, so this isn't the case of source->target->source issue that we are already tracking there for RHOSP13 (1567606). Version-Release number of selected component (if applicable): How reproducible: Create 50-60 vms on a host, kill it and let the vms be evacuate to another node. Steps to Reproduce: 1. Create 50-60 vms on a two compute nodes deployment (customer will test with 4) 2. Shoot one node (power, network, whatever) 3. Vms will be evacuated Actual results: Random vms get in error state, technically migrated, but nova consider them as still on dead host, libvirt has started them, and they are live (i.e. virsh console, ping, etc....) Expected results: All vms database entry are in sync with current vms position. Case files are yanked on supportshell. Additional info: Customer will go-live next week...
(In reply to Pierre-Andre MOREY from comment #0) > Steps to Reproduce: > 1. Create 50-60 vms on a two compute nodes deployment (customer will test > with 4) > 2. Shoot one node (power, network, whatever) > 3. Vms will be evacuated This smells like https://bugzilla.redhat.com/show_bug.cgi?id=1567606 *if* the customer is allowing the original compute to power back on before all of the instances have been rebuilt on the destination compute. Pierre can you confirm if that's the case?
*** Bug 1649253 has been marked as a duplicate of this bug. ***