Bug 1354494

Summary:	VMs in unknown status and no run_on_vds
Product:	[oVirt] ovirt-engine	Reporter:	Arik <ahadas>
Component:	BLL.Virt	Assignee:	Arik <ahadas>
Status:	CLOSED CURRENTRELEASE	QA Contact:	sefi litmanovich <slitmano>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.6.6	CC:	bugs
Target Milestone:	ovirt-4.0.2	Flags:	rule-engine: ovirt-4.0.z+ rule-engine: planning_ack+ michal.skrivanek: devel_ack+ mavital: testing_ack+
Target Release:	4.0.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-12 14:31:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Arik 2016-07-11 12:30:10 UTC

Description of problem:
We saw VMs in 'unknown' status and no run_on_vds. In this state, there is no way to do anything with these VMs and therefore we must prevent this from happening. It seems to be a race between VMs-monitoring and non-responsive treatment.

Version-Release number of selected component (if applicable):


How reproducible:
rarely

Steps to Reproduce:
1. have a running VM
2. disconnect the host the VM is running on
3.

Actual results:
It might be that the VM is detected by the monitoring as missing (stopped being reported by VDSM) while it is being set to unknown and before of a race we will end up with: status=UNKNOWN & run_on_vds=null

Expected results:
If the VM was detected as missing then it should be DOWN
Otherwise the VM should be UNKNOWN and running on the host it ran on before

Additional info:

Comment 2 sefi litmanovich 2016-08-02 12:02:13 UTC

Verified with rhevm-4.0.2.3-0.1.el7ev.noarch.

Ran the following flow several times each time stopping the host's network service in a different time:

1. Start vm with os installed (wait until it's up).
2. Shut down vm.
3. Wait for some time.
4. Stop network service on the host.

Results:
Host becomes non responsive after engine fails soft fencing (hard fencing is disabled in the env).
In the majority of the runs the vm became unknown but still existed on the host (still running in libvirt). After restarting the host's network and vdsm, host went up again and the vm's shutdown continued and succeeded.

In one case I stopped host's network a second or two before shutdown process ended (as reflected in engine), but the vm went down in the host (process was killed) probably a few miliseconds before the network stopped, so vm went down before host became non responsive.