1354494 – VMs in unknown status and no run_on_vds

Bug 1354494 - VMs in unknown status and no run_on_vds

Summary: VMs in unknown status and no run_on_vds

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	3.6.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.0.2
Target Release:	4.0.2
Assignee:	Arik
QA Contact:	sefi litmanovich
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-07-11 12:30 UTC by Arik
Modified:	2016-08-12 14:31 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-08-12 14:31:45 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.0.z+ rule-engine: planning_ack+ michal.skrivanek: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	60447	master	MERGED	core: switch multiple vms to unknown in one update	2016-07-13 06:04:38 UTC
oVirt gerrit	60522	master	MERGED	core: prevent floating vms in unknown status	2016-07-13 11:41:19 UTC
oVirt gerrit	60652	ovirt-engine-4.0	MERGED	core: switch multiple vms to unknown in one update	2016-07-14 09:12:32 UTC
oVirt gerrit	60653	ovirt-engine-4.0	MERGED	core: prevent floating vms in unknown status	2016-07-14 09:12:42 UTC

Description Arik 2016-07-11 12:30:10 UTC

Description of problem:
We saw VMs in 'unknown' status and no run_on_vds. In this state, there is no way to do anything with these VMs and therefore we must prevent this from happening. It seems to be a race between VMs-monitoring and non-responsive treatment.

Version-Release number of selected component (if applicable):


How reproducible:
rarely

Steps to Reproduce:
1. have a running VM
2. disconnect the host the VM is running on
3.

Actual results:
It might be that the VM is detected by the monitoring as missing (stopped being reported by VDSM) while it is being set to unknown and before of a race we will end up with: status=UNKNOWN & run_on_vds=null

Expected results:
If the VM was detected as missing then it should be DOWN
Otherwise the VM should be UNKNOWN and running on the host it ran on before

Additional info:

Comment 2 sefi litmanovich 2016-08-02 12:02:13 UTC

Verified with rhevm-4.0.2.3-0.1.el7ev.noarch.

Ran the following flow several times each time stopping the host's network service in a different time:

1. Start vm with os installed (wait until it's up).
2. Shut down vm.
3. Wait for some time.
4. Stop network service on the host.

Results:
Host becomes non responsive after engine fails soft fencing (hard fencing is disabled in the env).
In the majority of the runs the vm became unknown but still existed on the host (still running in libvirt). After restarting the host's network and vdsm, host went up again and the vm's shutdown continued and succeeded.

In one case I stopped host's network a second or two before shutdown process ended (as reflected in engine), but the vm went down in the host (process was killed) probably a few miliseconds before the network stopped, so vm went down before host became non responsive.

Note You need to log in before you can comment on or make changes to this bug.