Hide Forgot
Description of problem: During the live-migrations caused by the activation of the maintenance-mode on one hypervisor RHEL 7.2, one of the windows 2003 vm crashed and it didn't restarted on any other active hypervisor. A core file was generated on the source hypervisor. Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.3.0-31.el7_2.10.x86_64 rhevm-3.5.8-0.1.el6ev.noarch How reproducible: Occurred only once at customer side Steps to Reproduce: 1. Put the host to maintenance so vm auto-migration will start. 2. migration failed as vm 'qemu-kvm" process got killed. 3. Actual results: Sometimes vm process gets killed during migration. Expected results: Migration should complete successfully, at least the underlying qemu-kvm process should not crash. Additional info: Migration failed with following traceback in vdsm logs. Thread-96776::ERROR::2016-04-07 20:45:11,461::migration::260::vm.Vm::(run) vmId=`22e0e8cd-7260-4e23-8460-77ec0a89fb67`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 246, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 335, in _startUnderlyingMigration None, maxBandwidth) File "/usr/share/vdsm/virt/vm.py", line 709, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 119, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1825, in migrateToURI2 if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self) libvirtError: internal error: early end of file from monitor: possible problem: 2016-04-07T18:45:08.064325Z qemu-kvm: load of migration failed: Input/output error
(In reply to Sachin Raje from comment #0) > libvirtError: internal error: early end of file from monitor: possible > problem: > 2016-04-07T18:45:08.064325Z qemu-kvm: load of migration failed: Input/output > error This error message isn't too descriptive, but it usually happens when there's a device mismatch (after hotplug operations). Were any hotplug/unplug operations performed on the VM prior to migration? What were the qemu command lines on the src and dest machines?
The core dump doesn't seem to belong to the crashed VM (or, the qemu version that produced the dump is different from the one that's mentioned). Running gdb on the crash, I don't get a proper backtrace; and in fact there are some call sites that are shown to be in TCG (i.e. non-KVM) code, so something is definitely amiss here. Can you check the qemu version that produced this crash? Also, this crash was on the src host, right? So the VM was lost during migration? Any messages that QEMU output when it crashed? Logs from src qemu and libvirt could provide clues.
So one thing I see from the provided qemu versions is that the src is on 7_2.10 version, and dest is on 7.2_4. Since the 7.2_10 binary doesn't produce a valid gdb backtrace, I gave 7_2.4 a try, and it does work. So it looks like the src host was in fact running 7_2.4 when the crash happened. Backtrace is: (gdb) bt #0 timer_del (ts=0x2020202020202020) at qemu-timer.c:401 #1 0x00007f435e0ece41 in spice_server_vm_stop (s=<optimized out>) at reds.c:4615 #2 0x00007f4364e5c234 in qemu_spice_display_stop () at ui/spice-core.c:930 #3 vm_change_state_handler (opaque=<optimized out>, running=<optimized out>, state=<optimized out>) at ui/spice-core.c:639 #4 0x00007f4364d72082 in vm_state_notify (running=running@entry=0, state=state@entry=RUN_STATE_FINISH_MIGRATE) at vl.c:1517 #5 0x00007f4364cac8b2 in do_vm_stop (state=RUN_STATE_FINISH_MIGRATE) at /usr/src/debug/qemu-2.3.0/cpus.c:603 #6 vm_stop (state=RUN_STATE_FINISH_MIGRATE) at /usr/src/debug/qemu-2.3.0/cpus.c:1297 #7 0x00007f4364cac916 in vm_stop_force_state (state=state@entry=RUN_STATE_FINISH_MIGRATE) at /usr/src/debug/qemu-2.3.0/cpus.c:1305 #8 0x00007f4364e33832 in migration_thread (opaque=0x7f4365330fa0 <current_migration.34315>) at migration/migration.c:806 #9 0x00007f43637b6dc5 in start_thread (arg=0x7f414a3fe700) at pthread_create.c:308 #10 0x00007f435d19721d in lseek64 () at ../sysdeps/unix/syscall-template.S:81 #11 0x0000000000000000 in ?? () This looks like it's a use-after-free in spice-server. Re-assigning the bug to Marc-Andre for further investigation. Can you let us know what the spice-server version is on the src? Attaching the full backtrace, as the core file is too huge to be downloaded in reasonable time.
Created attachment 1152147 [details] full backtrace
Closing this one as it seems to be addressed in bug #1281455 with fixes at spice-0.12.4-17.el7 *** This bug has been marked as a duplicate of bug 1281455 ***