1240649 – Unable to start VM (qemu-kvm: Unknown migration flags: 0)

Bug 1240649 - Unable to start VM (qemu-kvm: Unknown migration flags: 0)

Summary: Unable to start VM (qemu-kvm: Unknown migration flags: 0)

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Backend.Core
Sub Component:
Version:	---
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-07 12:21 UTC by Christopher Pereira
Modified:	2016-01-15 11:46 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-01-15 11:46:38 UTC
oVirt Team:	---
Embargoed:
Dependent Products:
Flags:	rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack?

Attachments	(Terms of Use)

Description Christopher Pereira 2015-07-07 12:21:53 UTC

After upgrading from alpha-1 to alpha-2 and restarting services, VM won't resume.

The reason is that QEMU is executed with the -incoming flag (used for resuming a paused VM), but the passed fd seems to be invalid.

How reproducible: I didn't validate exact steps, but once the problem is triggered, VM won't start anymore throwing always the same QEMU error.

Steps to Reproduce:
1. Suspend VM.
2. Set host into maintenance mode, upgrade and restart services (vdsmd, glusterd, etc).
3. Resume VM (failed because of some other problem and retried some times)
4. Resume VM again (after fixing other problem)

Actual results:

QEMU command line is executed with "-incoming" flag and throws this error:

Domain id=5 is tainted: hook-script
2015-07-07T07:39:27.215579Z qemu-kvm: Unknown migration flags: 0
qemu: warning: error while loading state section id 2
2015-07-07T07:39:27.215681Z qemu-kvm: load of migration failed: Invalid argument

Expected results: VM should resume.

Additional info: Another VM was not affected (I probably didn't try to resume it while other issues were ocurring so it didn't get into invalid state).

Removing the "-incoming" flag from QEMU works fine.

Comment 1 Christopher Pereira 2015-07-07 13:55:49 UTC

'qemu-kvm-ev-2.1.2-23.el7_1.3.1.x86_64' was being used for both suspending and resuming (this package was not upgraded).

Comment 2 Christopher Pereira 2015-07-07 21:25:42 UTC

BTW, I can reproduce on my (affected) system and help with debugging if necessary.

Comment 3 Christopher Pereira 2015-07-08 08:00:33 UTC

Here are QEMU logs when the problem just started to occur (QEMU received SIGTERM, but was "shutting down").

((null):28066): Spice-Warning **: reds.c:2824:reds_handle_ssl_accept: SSL_accept failed, error=5
((null):28066): Spice-Warning **: reds.c:2824:reds_handle_ssl_accept: SSL_accept failed, error=5
((null):28066): Spice-Warning **: reds.c:2824:reds_handle_ssl_accept: SSL_accept failed, error=5
((null):28066): Spice-Warning **: reds.c:2824:reds_handle_ssl_accept: SSL_accept failed, error=5

2015-07-07 05:09:56.817+0000: shutting down

qemu: terminating on signal 15 from pid 2084

2015-07-07 05:31:32.792+0000: starting up
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name suseso-chile -S -machine rhel6.5.0,accel=kvm,usb=off -cpu Nehalem -m 4096 -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -uuid 8c9437e4-5514-4c85-a52b-da33d9ab6061 -smbios type=1,manufacturer=oVirt,product=oVirt Node,version=7-1.1503.el7.centos.2.8,serial=32393735-3733-5355-4532-303957525946,uuid=8c9437e4-5514-4c85-a52b-da33d9ab6061 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/suseso-chile.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2015-07-07T02:31:32,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/1963d7ab-a65b-4a70-8749-f45aceba2393/ebd94ac1-84df-47da-be87-ca49f7bffdcf/images/7fba2829-772b-43e0-9d47-0b164b2ac975/7b2102e5-5f97-4185-9c71-618187c6dee9,if=none,id=drive-virtio-disk0,format=qcow2,serial=7fba2829-772b-43e0-9d47-0b164b2ac975,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:16:01:54,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/8c9437e4-5514-4c85-a52b-da33d9ab6061.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/8c9437e4-5514-4c85-a52b-da33d9ab6061.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice port=5900,tls-port=5901,addr=0,disable-ticketing,x509-dir=/etc/pki/vdsm/libvirt-spice,seamless-migration=on -vnc 0:2 -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vgamem_mb=16,bus=pci.0,addr=0x2 -incoming fd:26 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
Domain id=9 is tainted: hook-script
2015-07-07T05:31:39.798212Z qemu-kvm: Unknown migration flags: 0
qemu: warning: error while loading state section id 2
2015-07-07T05:31:39.798293Z qemu-kvm: load of migration failed: Invalid argument
2015-07-07 05:47:55.386+0000: shutting down
2015-07-07 05:48:22.825+0000: starting up

Comment 4 Christopher Pereira 2015-07-08 08:02:57 UTC

Is there anyway to tell oVirt to not try reading the state file during next startup?

Comment 5 Christopher Pereira 2015-07-18 07:39:49 UTC

It's not clear what is triggering this bug.

It's also not clear how oVirt can be tweaked to start the VM without trying to resume the broken snapshot (ie. not adding the -incoming flag to QEMU).

A discussion in the devel-list was started but then turned into a new topic (qemu backwards compatibility) that doesn't apply in this case because qemu was not upgraded:
http://lists.ovirt.org/pipermail/devel/2015-July/010955.html

Can anyone please point me to the source code that is trying to resume the VM instead of restarting it?

I have a VM that is not starting and would like to tweak oVirt to discard the corrupted (?) snapshot.

Comment 6 Michal Skrivanek 2015-07-22 12:12:01 UTC

Christopher, it does look like an environmental issue.
I haven't seen it happening anywhere else and it does look very suspicious, I would suspect conflict in versions fo qemu, libvirt or other system package...it might be difficult to troubleshoot - would it be possible to try on a new fresh host?

Comment 7 Christopher Pereira 2015-07-22 17:08:20 UTC

(In reply to Michal Skrivanek from comment #6)
> Christopher, it does look like an environmental issue.
> I haven't seen it happening anywhere else and it does look very suspicious,
> I would suspect conflict in versions fo qemu, libvirt or other system
> package...it might be difficult to troubleshoot - would it be possible to
> try on a new fresh host?

Hi Michal,

I confirmed QEMU and libvirt were not upgraded.

This issue presented me twice on different installations. One was nightly build and the other was alpha-1.
I will continue reporting here if I get more info.

At the moment, I would like be able to start those VMs from oVirt again, since I'm currently running them from virsh directly.

Can anyone please tell me where in the sources is oVirt or VDSM telling libvirt to resume the snapshot?
I would like to tweak oVirt's DB and avoid the VM to be started with the "-incoming" switch (ie. avoid it trying to resume the failed snapshot).

Thanks.

Comment 8 Michal Skrivanek 2015-07-23 09:10:25 UTC

you can simply shut down that VM, that will clear out the snapshot if there are issues resuming from it

if you can reproduce, without any upgrade anywhere, a simple run VM -> Suspend -> Resume and it doesn't work then there is a clear problem.

Comment 9 Christopher Pereira 2015-07-23 15:06:00 UTC

(In reply to Michal Skrivanek from comment #8)
> you can simply shut down that VM, that will clear out the snapshot if there
> are issues resuming from it

I shutdown and restart, but the VM is still trying to resume the failed snapshot (it is still started with the "-incoming" switch).

Seems that something else is failing.
How can I debug?

Comment 10 Michal Skrivanek 2015-07-27 06:36:52 UTC

(In reply to Christopher Pereira from comment #9)
> (In reply to Michal Skrivanek from comment #8)
> > you can simply shut down that VM, that will clear out the snapshot if there
> > are issues resuming from it
> 
> I shutdown and restart, but the VM is still trying to resume the failed
> snapshot (it is still started with the "-incoming" switch).
> 
> Seems that something else is failing.
> How can I debug?

hmm..weird. Can you get the relevant part of engine.log and corresponding vdsm.log? Maybe even the stop failed

Comment 11 Michal Skrivanek 2015-08-12 14:24:05 UTC

ping? otherwise closing as we don't have any other report and it works ok on beta build

Comment 12 Christopher Pereira 2015-08-17 14:51:42 UTC

VM shutdown engine logs are not available anymore.
Closing.

Comment 13 Christopher Pereira 2015-09-24 19:40:39 UTC

Reopening...

Today a oVirt host rebooted due to a power failure.
When trying to restart the VMs, 1 of 3 VM failed to start, because a "-incoming" flag is passed to QEMU.
Logs are available, but it's not clear why the VM is started with the "-incoming" flag (maybe a failed operation before the reboot).
It would be useful to at least be able to tell oVirt to not start the VM with the "-incoming" flag.

BTW, after the VM failed to start, virsh is blocked until libvirtd is restarted (dangerous).

I can reproduce everything, but the action that marks the VM so that oVirt passes the "-incoming" flag everytime the VM tries to start.

Comment 14 Michal Skrivanek 2015-09-30 13:58:14 UTC

(In reply to Christopher Pereira from comment #13)
> Reopening...
> 
> Today a oVirt host rebooted due to a power failure.
> When trying to restart the VMs, 1 of 3 VM failed to start, because a
> "-incoming" flag is passed to QEMU.
> Logs are available, but it's not clear why the VM is started with the
> "-incoming" flag (maybe a failed operation before the reboot).

this shouldn't happen unless it somehow *thinks* it is supposed to resume them

> It would be useful to at least be able to tell oVirt to not start the VM
> with the "-incoming" flag.
> 
> BTW, after the VM failed to start, virsh is blocked until libvirtd is
> restarted (dangerous).

you shouldn't really touch the VMs (other than read-only) while vdsm/oVirt is touching them

> I can reproduce everything, but the action that marks the VM so that oVirt
> passes the "-incoming" flag everytime the VM tries to start.

Logs would really help a lot, both engine&vdsm from the time of event. What do you mean you can reproduce it except -incoming? So during reproduction they still fail to start, for other reason?

Comment 15 Christopher Pereira 2015-09-30 18:14:28 UTC

> this shouldn't happen unless it somehow *thinks* it is supposed to resume
> them

I know. I would really like to know how to tell oVirt to not resume this VM so it can be started normally.

> you shouldn't really touch the VMs (other than read-only) while vdsm/oVirt
> is touching them

I don't like this idea either, but it's currently the only option to start the VM...and it's a in-production one.

> What do you mean you can reproduce it except -incoming?
> So during reproduction they still fail to start, for other reason?

Every time I start the VM, it appends the -incoming flag to QEMU, startup fails and libvirtd hangs (no more virsh commands can be executed and new operations fail until libvirtd is restarted).

What I wasn't able to reproduce yet, is to get the VM in a state so that it tries to resume with the -incoming flag each time I start it from oVirt (to mark the VM to be resumed).

My guess is that oVirt shouldn't try to resume after it failed resuming once (the VM should be marked to not resume), but this is not happening (the "resume flag" is not reset), which could be related with the fact that libvirtd is also hanging after the VM tries to resume.

Can you tell me where how to reset this "resume flag" internally?

Comment 16 Michal Skrivanek 2015-10-05 08:55:35 UTC

(In reply to Christopher Pereira from comment #15)
> > this shouldn't happen unless it somehow *thinks* it is supposed to resume
> > them
> 
> I know. I would really like to know how to tell oVirt to not resume this VM
> so it can be started normally.
> 
> > you shouldn't really touch the VMs (other than read-only) while vdsm/oVirt
> > is touching them
> 
> I don't like this idea either, but it's currently the only option to start
> the VM...and it's a in-production one.
> 
> > What do you mean you can reproduce it except -incoming?
> > So during reproduction they still fail to start, for other reason?
> 
> Every time I start the VM, it appends the -incoming flag to QEMU, startup
> fails and libvirtd hangs (no more virsh commands can be executed and new
> operations fail until libvirtd is restarted).
> 
> What I wasn't able to reproduce yet, is to get the VM in a state so that it
> tries to resume with the -incoming flag each time I start it from oVirt (to
> mark the VM to be resumed).
> 
> My guess is that oVirt shouldn't try to resume after it failed resuming once
> (the VM should be marked to not resume), but this is not happening (the
> "resume flag" is not reset), which could be related with the fact that
> libvirtd is also hanging after the VM tries to resume.
> 
> Can you tell me where how to reset this "resume flag" internally?

I suppose it's wrong in the engine db, that it thinks you've suspended the VMs even when you didn't. Do all the operations work ok on those VMs when they are running? It does sound like a specific issue with those particular VMs only - is that right?
Can you try to suspend and resume them? That may resolve the problem in the db - the resume should succeed clearing things up

relevant part of engine.log might help to further troubleshoot

Comment 17 Christopher Pereira 2015-10-05 11:41:52 UTC

(In reply to Michal Skrivanek from comment #16)

> I suppose it's wrong in the engine db, that it thinks you've suspended the
> VMs even when you didn't.

Ok. I will try to workarround at the DB level next time.

> Do all the operations work ok on those VMs when
> they are running? It does sound like a specific issue with those particular
> VMs only - is that right?
> Can you try to suspend and resume them? That may resolve the problem in the
> db - the resume should succeed clearing things up

Yes, the VM works fine when running from virsh.
I cannot suspend or resume, because this bug hangs libvirt and the VM stays in an unknown state where no actions can be performed (even virsh hangs until libvirt is restarted).

I believe this bug is caused by this sequence:

1) Suspend a VM (maybe during storage problems preventing the snapshot to e created)
2) Reset the host or services without shutting down the VM
3) Recover services and try to start the VM

I guess Engine gets in an invalid state.
Note that Engine requires the user to shutdown suspended VM's before being able to detach a Storage Domain.

Also note that when attaching the Storage Domain and importing the VM into another Data Center, the "suspended flag" is gone and the VM can be started normally, so it's probably on the DB.

Next time I encounter this problem, I will try cleaning the "suspended flag" on the DB and post my results here.

I'm not sure about the priority of this issue, because it seems to be an uncommon one, but on the other hand, the consequences of not being able to start a VM and having to restart libvirtd in order to recover control over other VMs may be critical to general users.

I will close until being able to reproduce.

Comment 18 Sven Kieske 2015-10-09 15:25:13 UTC

Hi,

you may want to restart ovirt-engine service on your management server.

this often clears stale/wrong data in the database or ovirt-engine cache in ram.

so if you can afford to restart the engine, this might help cleanup wrong information.

HTH

Sven

PS: But this might maybe lead into a situation where you are not able to reproduce the issue and as a consequence a bug might not get fixed if this data is still needed.

Comment 19 Christopher Pereira 2015-10-11 08:04:43 UTC

(In reply to Sven Kieske from comment #18)
> Hi,
> 
> you may want to restart ovirt-engine service on your management server.

Hi, restarting engine doesn't help. The VM continues trying to resume the failed snapshot (with the "-incoming" switch).

As Michal pointed out, the resume flag is probably present at the DB level.

Comment 20 Christopher Pereira 2015-10-12 19:12:42 UTC

I analyzed more in detail and this are the facts:

1) Engine stores in DB (in field [snapshots].[memory_volume]) the id of the memory volume that is being created when the VM is suspended.
This is displayed in the snapshots tab with a marked checkbox for the "Memory" column.
But for some reason, the memory snapshot didn't get created (see my previous comments for some hypotheses).

2) Then, when trying to start the VM, QEMU fails badly (qemu-kvm: Unknown migration flags: 0) and libvirtd hangs until being restarted (host gets unusable).

3) I was able to recover the VM by setting memory_volume = NULL in the DB.

If you want to reproduce issue 2 (hang libvirtd and loose control of the host) just set an invalid value in the "memory_volume" field for a current (active) snapshot and start the VM.

Comment 21 Red Hat Bugzilla Rules Engine 2015-10-18 08:21:22 UTC

Fixed bug tickets must have version flags set prior to fixing them. Please set the correct version flags and move the bugs back to the previous status after this is corrected.

Comment 23 Dr. David Alan Gilbert 2015-12-04 19:59:59 UTC

Hi Christopher,
  Have you experienced this again recently? There's a theory that this bug is triggered by a kernel bug that was fixed in mid July, close to when you reported it.

(see https://bugzilla.redhat.com/show_bug.cgi?id=1238320 )

Note, the -incoming is perfectly normal for restarting from a snapshot; snapshots are just like migrations into a file.

Comment 24 Christopher Pereira 2015-12-23 17:46:39 UTC

Hi David, 

No, but I will continue reporting here once I have more details.

Comment 26 Michal Skrivanek 2016-01-15 11:46:38 UTC

in the meantime, i'm going to close it again. Most likely suspect is an initial failure at snapshot creation which screwed up the DB.
Feel free to reopen once it is reproducible.

Note You need to log in before you can comment on or make changes to this bug.