Created attachment 1351548 [details] screenshot Description of problem: 'Cannot run VM without at least one bootable disk' error appears although I have one bootable disk. This is happening after trying to start a VM and it for some reason couldn't find bootable disk, although the disk appears to be up and SD is accessible and up. Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos.noarch How reproducible: happens on my hosted engine Steps to Reproduce: 1. start a VM (if this will not fail, check console if it actually found bootable volume) 2. start the VM again 3. Actual results: if step 1 couldn't find bootable volume, step 2 should fail with 'Cannot run VM without at least one bootable disk' Expected results: VM is started Additional info:
Which time/run is it? Run or RunOnce?
This is general run.
And which run is it? There are 5 for the same VM and several others with the error you described. Some also seem to be launched through API instead of GUI.
Ah, sorry, didn't understand your question. It happened for all those run attempts, maybe some rest api run once requests actually succeeded but it didn't find bootable disk afterwards anyway.
Just managed to reproduce on completely different engine (just regular engine) with 4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos
So after some investigation we found out, the first failed run made the disk to be down. After activating the disk, VM can be started but still wont find any bootable volume, although there certainly is (was) one. After powering the VM off the disk is up as it should be, so the down status happened the first time only, maybe there was some other error, not sure.
After getting hand on a reproduction setup (thank you Petr!) this is what I found: - on run, engine generates the libvirtxml including the disk - vdsm receives this in the create call - but vdsm than for some reason removes the disk from the libvirtxml and creates a domain without it (this is the issue) - engine than receives an event that the status changed, updates the devices and finds out that the VM does not have anymore the disk which according to the DB is a managed disk, so it expects it has been unplugged and sets it accordingly (this is correct) Attaching some logs. Francesco, any idea?
Created attachment 1358521 [details] libvirt xml sent from engine
Created attachment 1358522 [details] vm create in vdsm
Hi Petr, the vdsm build is almost two months old - ancient in 4.2 terms - there were numerous changes to XML handling especially on storage side. Please retest with a recent 4.2 build with same engine and vdsm version.
As expected with ovirt-engine-4.2.0-0.0.master.20171122115834.git3549ed1.el7.centos.noarch and vdsm-4.20.7-1.el7.centos.x86_64 this works correctly.
there were actual fixes, just somewhere along the way....:)
I am too affected by this catastrophic failure. ovirt: 4.2.1.7-1.el7.centos All the VM's that were running before the last reboot of the hypervisor are now affected by the issue and will not boot anymore. Other VM's, that were powered off are now still ok. The only workaround I found is to perform the following steps (one by one VM): - Edit VM - Remove all disks using the "-" (minus) symbol (do NOT remove permanently!) - Remove the network adapter using the "-" symbol - Save the VM without disks - Edit the VM - Add the disks using the "+" (plus) symbol and attach the disks previously used. Make sure the root disk has the "boot" flag (OS) set. - Add the network adapter and connect it to the correct network - Save the VM - Run the VM If there are problems with the network, shutdown, Edit VM, remove Network Adapter, save VM, Edit VM, add the network adapter again and save. Maybe there is a better, more automated way to rescue the VM's? Anyone ideas?
I have some comments on this. I updated my 2-host, both able to run hosted-engine cluster from 4.1.1.9 (unsure) to 4.2.1.7 yesterday. I will write out some details, as I may have made some mistakes which caused the VMs to not start, but I have found a workaround. The path I followed: - Go into global maintenance - Update the hosted-engine to 4.2 - Disable global maintenance (may have been a mistake) - Put one host into local maintenance - Update the node to 4.2 - Disable local maintenance on host - Put the other host into maintenance ( I had some errors here, so I had to shutdown the hosted engine using 'hosted-engine', start it on the updated host, and put the other one in maintenance - I had to shut down some VMs as well) - Finish updates - Update cluster to version 4.2 - Update datacenter to version 4.2 After doing this, the VMs showed the 'up' icon, indicating a shutdown/startup is necessary to make them compatible with 4.2 After shutting down VMs, I found that I could not start them again. - Pressing run: 'Cannot start VM' without any visible explanation in the UI or the logs (I checked VDSM, engine, libvirt, messages) - Pressing run-once: 'Bootable disk not found' - the disk seemed to be attached to the VM. Workaround: - Click on the VM's name (takes you to the detail page of the VM) - Go to storage, click on the hard disk, select 'Activate' - Go to networking, NICs seemed unplugged, changed them to plugged (this may be unnecessary) - Click 'run-once', click OK, the engine complains that there is no suitable host - Click 'run', and IT WORKS I did this on many hosts. Seemed to work everytime. I have Ubuntu/CentOS/FreeBSD guests. All of them had issues, worked with all of them. Through this way, I didn't have to remove Storage or NICs (I didn't want to save the MACs and re-enter them as there are DHCP reservations on them). I have also noticed that after the upgrade: - VMs had the 'random device' attached, it wasn't attached before the update - VMs had cloud-init enabled, wasn't before - VMs had a random ISO attached from the domain, e.g. windows was attached on VMs with CentOS installed Hope this is helpful to someone
First, Aral, THANK YOU. I am had this problem with the latest rhvm-4.2.8.5-0.1.el7ev.noarch. I had to do what Aral suggests by re-connecting storage and networking then run once then run and all was fine. This was after updating from 4.1 to 4.2. I had to say I was very worried for a bit because the VMs that suffered this were pretty important.