Description of problem: We saw VMs in 'unknown' status and no run_on_vds. In this state, there is no way to do anything with these VMs and therefore we must prevent this from happening. It seems to be a race between VMs-monitoring and non-responsive treatment. Version-Release number of selected component (if applicable): How reproducible: rarely Steps to Reproduce: 1. have a running VM 2. disconnect the host the VM is running on 3. Actual results: It might be that the VM is detected by the monitoring as missing (stopped being reported by VDSM) while it is being set to unknown and before of a race we will end up with: status=UNKNOWN & run_on_vds=null Expected results: If the VM was detected as missing then it should be DOWN Otherwise the VM should be UNKNOWN and running on the host it ran on before Additional info:
Verified with rhevm-4.0.2.3-0.1.el7ev.noarch. Ran the following flow several times each time stopping the host's network service in a different time: 1. Start vm with os installed (wait until it's up). 2. Shut down vm. 3. Wait for some time. 4. Stop network service on the host. Results: Host becomes non responsive after engine fails soft fencing (hard fencing is disabled in the env). In the majority of the runs the vm became unknown but still existed on the host (still running in libvirt). After restarting the host's network and vdsm, host went up again and the vm's shutdown continued and succeeded. In one case I stopped host's network a second or two before shutdown process ended (as reflected in engine), but the vm went down in the host (process was killed) probably a few miliseconds before the network stopped, so vm went down before host became non responsive.