Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1512530

Summary:

Cannot run VM without at least one bootable disk with one bootable disk

Product:

[oVirt] ovirt-engine

Reporter:

Petr Matyáš <pmatyas>

Component:

Backend.Core

Assignee:

Nobody <nobody>

Status:

CLOSED CURRENTRELEASE

QA Contact:

meital avital <mavital>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.2.0

CC:

amureini, aral_can.kaymaz, bugs, fromani, michal.skrivanek, pmatyas, prutledg, tjelinek, tnisan, Toni.Feric

Target Milestone:

---

Keywords:

Regression

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-11-24 15:21:19 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1495441, 1519271

Attachments:

Description	Flags
screenshot	none
libvirt xml sent from engine	none
vm create in vdsm	none

Description Petr Matyáš 2017-11-13 12:44:11 UTC

Created attachment 1351548 [details]
screenshot

Description of problem:
'Cannot run VM without at least one bootable disk' error appears although I have one bootable disk. This is happening after trying to start a VM and it for some reason couldn't find bootable disk, although the disk appears to be up and SD is accessible and up.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos.noarch

How reproducible:
happens on my hosted engine

Steps to Reproduce:
1. start a VM (if this will not fail, check console if it actually found bootable volume)
2. start the VM again
3.

Actual results:
if step 1 couldn't find bootable volume, step 2 should fail with 'Cannot run VM without at least one bootable disk'

Expected results:
VM is started

Additional info:

Comment 2 Michal Skrivanek 2017-11-14 05:57:03 UTC

Which time/run is it? Run or RunOnce?

Comment 3 Petr Matyáš 2017-11-14 08:02:25 UTC

This is general run.

Comment 4 Michal Skrivanek 2017-11-14 08:20:01 UTC

And which run is it? There are 5 for the same VM and several others with the error you described. Some also seem to be launched through API instead of GUI.

Comment 5 Petr Matyáš 2017-11-14 10:06:03 UTC

Ah, sorry, didn't understand your question. It happened for all those run attempts, maybe some rest api run once requests actually succeeded but it didn't find bootable disk afterwards anyway.

Comment 6 Petr Matyáš 2017-11-14 11:49:39 UTC

Just managed to reproduce on completely different engine (just regular engine) with 4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos

Comment 7 Petr Matyáš 2017-11-14 12:00:56 UTC

So after some investigation we found out, the first failed run made the disk to be down. After activating the disk, VM can be started but still wont find any bootable volume, although there certainly is (was) one. After powering the VM off the disk is up as it should be, so the down status happened the first time only, maybe there was some other error, not sure.

Comment 9 Tomas Jelinek 2017-11-24 07:36:29 UTC

After getting hand on a reproduction setup (thank you Petr!) this is what I found:
- on run, engine generates the libvirtxml including the disk
- vdsm receives this in the create call
- but vdsm than for some reason removes the disk from the libvirtxml and creates a domain without it (this is the issue)
- engine than receives an event that the status changed, updates the devices and finds out that the VM does not have anymore the disk which according to the DB is a managed disk, so it expects it has been unplugged and sets it accordingly (this is correct)

Attaching some logs. Francesco, any idea?

Comment 10 Tomas Jelinek 2017-11-24 07:38:19 UTC

Created attachment 1358521 [details]
libvirt xml sent from engine

Comment 11 Tomas Jelinek 2017-11-24 07:39:17 UTC

Created attachment 1358522 [details]
vm create in vdsm

Comment 12 Michal Skrivanek 2017-11-24 09:21:59 UTC

Hi Petr, the vdsm build is almost two months old - ancient in 4.2 terms - there were numerous changes to XML handling especially on storage side. Please retest with a recent 4.2 build with same engine and vdsm version.

Comment 13 Petr Matyáš 2017-11-24 15:21:19 UTC

As expected with ovirt-engine-4.2.0-0.0.master.20171122115834.git3549ed1.el7.centos.noarch and vdsm-4.20.7-1.el7.centos.x86_64 this works correctly.

Comment 14 Michal Skrivanek 2017-11-24 15:22:23 UTC

there were actual fixes, just somewhere along the way....:)

Comment 15 Toni Feric 2018-03-01 22:31:09 UTC

I am too affected by this catastrophic failure.
ovirt: 4.2.1.7-1.el7.centos

All the VM's that were running before the last reboot of the hypervisor are now affected by the issue and will not boot anymore.
Other VM's, that were powered off are now still ok.

The only workaround I found is to perform the following steps (one by one VM):
- Edit VM
- Remove all disks using the "-" (minus) symbol (do NOT remove permanently!)
- Remove the network adapter using the "-" symbol
- Save the VM without disks
- Edit the VM
- Add the disks using the "+" (plus) symbol and attach the disks previously used. Make sure the root disk has the "boot" flag (OS) set.
- Add the network adapter and connect it to the correct network
- Save the VM
- Run the VM

If there are problems with the network, shutdown, Edit VM, remove Network Adapter, save VM, Edit VM, add the network adapter again and save.

Maybe there is a better, more automated way to rescue the VM's? Anyone ideas?

Comment 16 Aral Can Kaymaz 2018-03-27 14:46:05 UTC

I have some comments on this.

I updated my 2-host, both able to run hosted-engine cluster from 4.1.1.9 (unsure) to 4.2.1.7 yesterday. I will write out some details, as I may have made some mistakes which caused the VMs to not start, but I have found a workaround.

The path I followed:
- Go into global maintenance
- Update the hosted-engine to 4.2
- Disable global maintenance (may have been a mistake)
- Put one host into local maintenance
- Update the node to 4.2
- Disable local maintenance on host
- Put the other host into maintenance
( I had some errors here, so I had to shutdown the hosted engine using 'hosted-engine', start it on the updated host, and put the other one in maintenance - I had to shut down some VMs as well)
- Finish updates
- Update cluster to version 4.2
- Update datacenter to version 4.2

After doing this, the VMs showed the 'up' icon, indicating a shutdown/startup is necessary to make them compatible with 4.2

After shutting down VMs, I found that I could not start them again.
- Pressing run: 'Cannot start VM' without any visible explanation in the UI or the logs (I checked VDSM, engine, libvirt, messages)
- Pressing run-once: 'Bootable disk not found' - the disk seemed to be attached to the VM.

Workaround:
- Click on the VM's name (takes you to the detail page of the VM)
- Go to storage, click on the hard disk, select 'Activate'
- Go to networking, NICs seemed unplugged, changed them to plugged (this may be unnecessary)
- Click 'run-once', click OK, the engine complains that there is no suitable host
- Click 'run', and IT WORKS

I did this on many hosts. Seemed to work everytime. I have Ubuntu/CentOS/FreeBSD guests. All of them had issues, worked with all of them.

Through this way, I didn't have to remove Storage or NICs (I didn't want to save the MACs and re-enter them as there are DHCP reservations on them).

I have also noticed that after the upgrade:
- VMs had the 'random device' attached, it wasn't attached before the update
- VMs had cloud-init enabled, wasn't before
- VMs had a random ISO attached from the domain, e.g. windows was attached on VMs with CentOS installed

Hope this is helpful to someone

Comment 17 Patrick Rutledge 2019-03-09 16:59:57 UTC

First, Aral, THANK YOU.

I am had this problem with the latest rhvm-4.2.8.5-0.1.el7ev.noarch.  I had to do what Aral suggests by re-connecting storage and networking then run once then run and all was fine.  This was after updating from 4.1 to 4.2.
I had to say I was very worried for a bit because the VMs that suffered this were pretty important.