Bug 1361860
| Summary: | VMs stuck in shutting down | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Ori Ben Sasson <obensass> | ||||||
| Component: | BLL.Virt | Assignee: | Arik <ahadas> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Ori Ben Sasson <obensass> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 4.0.2.1 | CC: | bugs, gklein, mburman, mgoldboi, michal.skrivanek, sbonazzo | ||||||
| Target Milestone: | ovirt-4.0.4 | Keywords: | Automation | ||||||
| Target Release: | 4.0.4 | Flags: | rule-engine:
ovirt-4.0.z+
mgoldboi: planning_ack+ michal.skrivanek: devel_ack+ myakove: testing_ack+ |
||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2016-09-26 12:31:55 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Created attachment 1186007 [details]
Screenshot of the VMs
The VMs stay in PoweringDown because the hosts they were running on are not monitored. In the 2 occurrences of this issue I saw the following pattern - VMs monitoring fails to save updated devices because it detects a deadlock. 15 secs later is the last monitoring cycle done for the host. The monitoring stops because VmDevicesMonitoring do not unlock the VMs because of that deadlock exception so the next cycle is blocked (the exception is raised before the hashes are saved so the next monitoring cycle will try to update the devices again). Ensuring that the lock is released will prevent the monitoring from being blocked so the test will succeed, but the main issue here is the deadlock. If it is indeed a regression, the fix for bz 1315100 might be the reason. Summarize the theory I currently hold: 1. Seems like the deadlock is between VmDevicesMonitoring and ActivateDeactivateVmNic: - ActivateDeactivate updates the boot order of all devices in transaction. - DevicesMonitoring tried to remove unmanaged devices in transaction (updates are not in transaction, adds are irrelevant). 2. The deadlock is unrelated to the migrations or to host-devices. 3. The most correct solution would probably be not to update the boot order of unmanaged devices if that is the case. after inspecting the code it does not look like a regression, rather a timing issue/race which was detected only now. The problem is resolved by restarting engine service, hence reducing severity. Discussed several approaches for a fix, the best seems to be to remove the boot order update from ActivateDeactivateVmNic and let monitoring to trigger it when it detects the change in devices(hotplug). This is closer to the asynchronous nature of hotplug actions in general. |
Created attachment 1186006 [details] logs VMs stuck in shutting down Description of problem: VMs stuck in the process of shutting down, some of the VMs stuck in migration, by looking on the vdsm side there is not running vms. Happen when running automation test port mirroring case1 engine log: 2016-07-28 13:05:18,821 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (org.ovirt.thread.pool-6-thread-18) [6516501d] FINISH, DestroyVmVDSCommand, log id: 708b78be 2016-07-28 13:05:18,822 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (org.ovirt.thread.pool-6-thread-18) [6516501d] Command 'org.ovirt.engine.core.bll.StopVmCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine does not exist, code = 1 (Failed with error noVM and code 1) 2016-07-28 13:05:18,915 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-18) [6516501d] Correlation ID: 6516501d, Job ID: f46256b7-de3d-450c-91dd-e85f49c26c1f, Call Stack: null, Custom Event ID: -1, Message: Failed to power off VM golden_env_mixed_virtio_0 (Host: host_mixed_2, User: admin@internal-authz). 2016-07-28 13:05:21,531 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler7) [7c0c1323] Fetched 0 VMs from VDS '517133cf-6e17-443c-ac55-996854e7a238' 2016-07-28 13:05:36,813 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler2) [3b18b712] Fetched 0 VMs from VDS '517133cf-6e17-443c-ac55-996854e7a238' Version-Release number of selected component (if applicable): 4.0.2.1-0.1.el7ev How reproducible: sometimes Steps to Reproduce: 1.Create 3 Vnic profile for each VM (4 VMs), configure VM1 to have port mirroring enabled on the first nic. 2.Start 4 VMs. 3.Migrate all VMs to another host. 4.Migrate all VMs to original host 5.shutting down VMs Actual results:. Failed to power off VM Expected results: VM should be down. Additional info: