Bug 1142776
Summary: | rhev-m stops syncing the VM statuses after massive live VM migration which fails. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Roman Hodain <rhodain> | ||||||
Component: | vdsm | Assignee: | Francesco Romani <fromani> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Israel Pinto <ipinto> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.4.1-1 | CC: | ahadas, bazulay, danken, gklein, jas, lpeer, lsurette, mavital, michal.skrivanek, nbarcet, rbalakri, Rhev-m-bugs, rhodain, sherold, yeylon, ykaul | ||||||
Target Milestone: | ovirt-3.6.0-rc | Keywords: | ZStream | ||||||
Target Release: | 3.6.0 | Flags: | mavital:
needinfo+
|
||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | vt11 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Simultaneous migration of many virtual machines could create a deadlock between the threads that monitor the hosts. As a result, the hosts were not monitored, thus the status of their virtual machines were not updated by the Red Hat Enterprise Virtualization Manager. The virtual machines are now monitored to prevent the deadlock.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1146908 1174813 (view as bug list) | Environment: | |||||||
Last Closed: | 2016-03-09 19:25:26 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1164311, 1174813 | ||||||||
Attachments: |
|
Description
Roman Hodain
2014-09-17 11:41:55 UTC
the source host logs are cut before the calls to the migration verbs.. please attach full logs (In reply to Omer Frenkel from comment #7) > the source host logs are cut before the calls to the migration verbs.. > please attach full logs I have repeated the test and here is the process start 2014-09-29 11:32 The UI does not refresh the statuses engine_2014-09-29_1323.tar engine restart 2014-09-29 13:23 The situation is still the same engine_2014-09-29_1552.tar engine_2014-09-29_1643.tar I have restarted vdsms on both of the hypervisors and engine VMs down engine_2014-09-30_0821.tar The logs and DB dumps are attached. thoughts? Roman, it would help if you'll be able to reproduce it and add the following information: 1. Name the VMs that were 'stuck' in UP/Migrate statuses at the end of the process 2. Attach the output of 'list' verb in vdsm in both hosts: vdsClient 0 list Can you please reproduce it again and add this information along with the engine+vdsm logs? Created attachment 954960 [details]
logs_20141107_144809
I have repeated the test and uploaded the logs form the hypervisors and the engine to the bugzilla.
The time is synced on all the systems in UTC.
The test was following:
- 40 VMs were started across two hypervisors
- Migration of all the VMs were triggered.
- After a while (some of the VMs were already migrated in UP state some of them not) I killed all the qemu processes:
for i in `pidof qemu-kvm`; do kill -9 $i; done;
- I have collected a statuses of the VMs form the hypervisors a couple of time. There is always a time stamp in the files.
- I have restarted vdsms on the hypervisors and collected the list of VMs again. There was not Vm listed. The result is in the same files, but obviously there is just the timestamp as no data was returned.
I have then collected all the logs.
Regarding to:
> 1. Name the VMs that were 'stuck' in UP/Migrate statuses at the end of the process
Non of the status is synced as all the VMs are killed and the UI shows all of them UP or migrating apart of the two VMs (TapeTest and Win7) which were not part of the test and have never been started.
(In reply to Michal Skrivanek from comment #11) Roman, thanks now it's clear There's a deadlock between the two monitoring (VURTI) threads: Each of them holds the lock on its host (the one it monitors), since both of them detect that one of their VMs is down while being migrated on the source and the destination, they both try to kill those VMs on the 'other' host. In order to kill the VM on the a host, one has to lock that host. so each one of them holds the lock on its host and tried to lock the other host => deadlock Created attachment 965736 [details]
Bug-1142776-logs
I reproduce it with vt12, The secnario is: Setup:2 Hosts,10 VMs 1.Start migrate from Host1 to Host2 2 After 10 seconds kill quem process for all VMs. (with: for i in `pidof qemu-kvm`; do kill -9 $i; done;) I see that 6 of 10 VM in pool (Pool_test_2) are not responding. Like: "VM Pool_test_2-2 is not responding." Francesco Romani had investigate it and found problem in VDSM. 3.6 is done Verfiry it with: Red Hat Enterprise Virtualization Manager Version: 3.6.0-0.13.master.el6 VDSM: vdsm-4.17.5-1.el7ev The secnario is: Setup:2 Hosts,10 VMs 1.Start migrate from Host1 to Host2 2 After 10 seconds kill quem process for all VMs. (with: for i in `pidof qemu-kvm`; do kill -9 $i; done;) All VMs are down I posted the following to ovirt users list on Friday, and got no response.. https://www.mail-archive.com/users@ovirt.org/msg28962.html After doing some experimentation, I can see that I can migrate 1 VM between two hosts back and forth, and there's no problem whatsoever. However, if I do the same with 3 hosts at the same time, I run into a problem where-by near the end of migration, engine says that the migration failed, and the VM is DOWN. I'm not 100% sure if it's this bug, but I'm 99% sure it is. That's pretty bad because I can make it happen by simply switching a host into maintenance mode, or, with the balancing policy after I bring the host back up, and engine starts to migrate hosts back to the machine. I haven't been using 3.5 for *so* long.. Ironically, the release notes for 3.5 say that this bug was fixed, but I don't think it is, yet I think it's pretty serious since I've had to disable balancing, and now I need to be super careful about putting a host into maintenance mode unless I want my VMs to shutdown. In addition, I'm seeing this happen with far fewer hosts than have been reported in this report. Just a FYI .. the last time I tried it with 3 hosts - 2 migrated fine, the last was almost done, and then it happened. Sigh.. maybe it's a different bug.. I just migrated 5 VMs from 1 host to another, one at a time, and didn't see the problem. When I migrated the 6th host.. "Migration failed", and the host is down. I threw my engine, virt1, virt2, and virt3 logs in http://www.eecs.yorku.ca/~jas/ovirt-debug/11072015 if someone can have a look. The VM in question was called "webapp" and moved from virt3 to virt1. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html |