Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1361860

Summary: VMs stuck in shutting down
Product: [oVirt] ovirt-engine Reporter: Ori Ben Sasson <obensass>
Component: BLL.VirtAssignee: Arik <ahadas>
Status: CLOSED CURRENTRELEASE QA Contact: Ori Ben Sasson <obensass>
Severity: high Docs Contact:
Priority: high    
Version: 4.0.2.1CC: bugs, gklein, mburman, mgoldboi, michal.skrivanek, sbonazzo
Target Milestone: ovirt-4.0.4Keywords: Automation
Target Release: 4.0.4Flags: rule-engine: ovirt-4.0.z+
mgoldboi: planning_ack+
michal.skrivanek: devel_ack+
myakove: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-26 12:31:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
Screenshot of the VMs none

Description Ori Ben Sasson 2016-07-31 10:37:40 UTC
Created attachment 1186006 [details]
logs

VMs stuck in shutting down
Description of problem:
VMs stuck in the process of shutting down, some of the VMs stuck in migration, 
by looking on the vdsm side there is not running vms.
Happen when running automation test port mirroring case1

engine log:
2016-07-28 13:05:18,821 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (org.ovirt.thread.pool-6-thread-18) [6516501d] FINISH, DestroyVmVDSCommand, log id: 708b78be
2016-07-28 13:05:18,822 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (org.ovirt.thread.pool-6-thread-18) [6516501d] Command 'org.ovirt.engine.core.bll.StopVmCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist, code = 1 (Failed with error noVM and code 1)
2016-07-28 13:05:18,915 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-18) [6516501d] Correlation ID: 6516501d, Job ID: f46256b7-de3d-450c-91dd-e85f49c26c1f, Call Stack: null, Custom Event ID: -1, Message: Failed to power off VM golden_env_mixed_virtio_0 (Host: host_mixed_2, User: admin@internal-authz).
2016-07-28 13:05:21,531 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler7) [7c0c1323] Fetched 0 VMs from VDS '517133cf-6e17-443c-ac55-996854e7a238'
2016-07-28 13:05:36,813 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler2) [3b18b712] Fetched 0 VMs from VDS '517133cf-6e17-443c-ac55-996854e7a238'

Version-Release number of selected component (if applicable):
4.0.2.1-0.1.el7ev

How reproducible:
sometimes

Steps to Reproduce:
1.Create 3 Vnic profile for each VM (4 VMs), configure VM1 to have port  mirroring enabled on the first nic.   
2.Start 4 VMs.
3.Migrate all VMs to another host.
4.Migrate all VMs to original host
5.shutting down VMs 

Actual results:.
Failed to power off VM

Expected results:
VM should be down.

Additional info:

Comment 1 Ori Ben Sasson 2016-07-31 10:41:20 UTC
Created attachment 1186007 [details]
Screenshot of the VMs

Comment 2 Arik 2016-08-01 07:59:29 UTC
The VMs stay in PoweringDown because the hosts they were running on are not monitored.

In the 2 occurrences of this issue I saw the following pattern -
VMs monitoring fails to save updated devices because it detects a deadlock.
15 secs later is the last monitoring cycle done for the host.

The monitoring stops because VmDevicesMonitoring do not unlock the VMs because of that deadlock exception so the next cycle is blocked (the exception is raised before the hashes are saved so the next monitoring cycle will try to update the devices again).

Ensuring that the lock is released will prevent the monitoring from being blocked so the test will succeed, but the main issue here is the deadlock.

If it is indeed a regression, the fix for bz 1315100 might be the reason.

Comment 3 Arik 2016-08-01 15:58:44 UTC
Summarize the theory I currently hold:

1. Seems like the deadlock is between VmDevicesMonitoring and ActivateDeactivateVmNic:
- ActivateDeactivate updates the boot order of all devices in transaction.
- DevicesMonitoring tried to remove unmanaged devices in transaction (updates are not in transaction, adds are irrelevant).

2. The deadlock is unrelated to the migrations or to host-devices.

3. The most correct solution would probably be not to update the boot order of unmanaged devices if that is the case.

Comment 4 Michal Skrivanek 2016-08-03 07:48:41 UTC
after inspecting the code it does not look like a regression, rather a timing issue/race which was detected only now. 
The problem is resolved by restarting engine service, hence reducing severity.


Discussed several approaches for a fix, the best seems to be to remove the boot order update from ActivateDeactivateVmNic and let monitoring to trigger it when it detects the change in devices(hotplug). 
This is closer to the asynchronous nature of hotplug actions in general.