Bug 1354494 - VMs in unknown status and no run_on_vds
Summary: VMs in unknown status and no run_on_vds
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 3.6.6
Hardware: Unspecified
OS: Unspecified
high vote
Target Milestone: ovirt-4.0.2
: 4.0.2
Assignee: Arik
QA Contact: sefi litmanovich
Depends On:
TreeView+ depends on / blocked
Reported: 2016-07-11 12:30 UTC by Arik
Modified: 2016-08-12 14:31 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2016-08-12 14:31:45 UTC
oVirt Team: Virt
rule-engine: ovirt-4.0.z+
rule-engine: planning_ack+
michal.skrivanek: devel_ack+
mavital: testing_ack+

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
oVirt gerrit 60447 0 master MERGED core: switch multiple vms to unknown in one update 2016-07-13 06:04:38 UTC
oVirt gerrit 60522 0 master MERGED core: prevent floating vms in unknown status 2016-07-13 11:41:19 UTC
oVirt gerrit 60652 0 ovirt-engine-4.0 MERGED core: switch multiple vms to unknown in one update 2016-07-14 09:12:32 UTC
oVirt gerrit 60653 0 ovirt-engine-4.0 MERGED core: prevent floating vms in unknown status 2016-07-14 09:12:42 UTC

Description Arik 2016-07-11 12:30:10 UTC
Description of problem:
We saw VMs in 'unknown' status and no run_on_vds. In this state, there is no way to do anything with these VMs and therefore we must prevent this from happening. It seems to be a race between VMs-monitoring and non-responsive treatment.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. have a running VM
2. disconnect the host the VM is running on

Actual results:
It might be that the VM is detected by the monitoring as missing (stopped being reported by VDSM) while it is being set to unknown and before of a race we will end up with: status=UNKNOWN & run_on_vds=null

Expected results:
If the VM was detected as missing then it should be DOWN
Otherwise the VM should be UNKNOWN and running on the host it ran on before

Additional info:

Comment 2 sefi litmanovich 2016-08-02 12:02:13 UTC
Verified with rhevm-

Ran the following flow several times each time stopping the host's network service in a different time:

1. Start vm with os installed (wait until it's up).
2. Shut down vm.
3. Wait for some time.
4. Stop network service on the host.

Host becomes non responsive after engine fails soft fencing (hard fencing is disabled in the env).
In the majority of the runs the vm became unknown but still existed on the host (still running in libvirt). After restarting the host's network and vdsm, host went up again and the vm's shutdown continued and succeeded.

In one case I stopped host's network a second or two before shutdown process ended (as reflected in engine), but the vm went down in the host (process was killed) probably a few miliseconds before the network stopped, so vm went down before host became non responsive.

Note You need to log in before you can comment on or make changes to this bug.