Created attachment 409652 [details] kernel panic in the VM when restoring VM Description of problem: We are evaluating RHEL5.5 KVM to save a VM on a host, and then restore it on another host. The feature is very important for us. We have hit this problem where if the state file is over 2GB then the restore sometimes fails. The larger the state file the more likely the failure. We are not sure if this is KVM problem or just an environment problem and requires some assistance. Version-Release number of selected component (if applicable): Redhat 5.5 official release How reproducible: Very easy to reproduce. I was able to do that on two different machines, Intel and AMD. Steps to Reproduce: For example, with a 4GB vm, running a 1 GB memory intensive application save and restore seems to be reliable: [root@hb06b11 XM]# /usr/bin/time virsh save 1 STATE Domain 1 saved to STATE 0.00user 0.00system 0:44.22elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+539minor)pagefaults 0swaps [root@hb06b11 XM]# ls -lh total 1.3G -rw------- 1 root root 1.3G Apr 23 14:39 STATE [root@hb06b11 XM]# /usr/bin/time virsh restore STATE Domain restored from STATE 0.00user 0.00system 0:05.56elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+536minor)pagefaults 0swaps But with a 3.5GB application in the same 4GB vm, the restore fails. [root@hb06b11 XM]# /usr/bin/time virsh save 3 STATE Domain 3 saved to STATE 0.00user 0.00system 1:57.00elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+540minor)pagefaults 0swaps [root@hb06b11 XM]# ls -lh total 3.6G -rw------- 1 root root 3.6G Apr 23 14:49 STATE [root@hb06b11 XM]# /usr/bin/time virsh restore STATE error: Failed to restore domain from STATE error: operation failed: failed to start VM Command exited with non-zero status 1 0.00user 0.00system 0:10.12elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+545minor)pagefaults 0swaps It seems failed at the following situations: - The size of the state file on the disk seems to be the key variable. If the size > 2GB, the restore action fails. We have tried migration (cold and live) and both seem to be reliable. It is just save and restore actions that have trouble. After some memory usage in a VM, the virsh restore action causes the kernel panic in the VM, and gives the kernel oops, see attached. Not sure if this is related. additional info: Here is the our test environment: HP ProLiant BL465c G5: CPU: Quad-core AMD Opteron (tm) Processor 2382 (8 cores) RAM: DDR2 800mhz 16 G (2G * 8) Network: 1GbE RHEL 5.5 offiical release Here is the test program consumes the memory 3.5 G memory: #include<stdio.h> #include<stdlib.h> #include<assert.h> #include <sys/time.h> #define SIZE 3500000000L int main() { long ix, i; char *A = malloc(SIZE); struct timeval last, curr; double d; assert(A); gettimeofday(&last, NULL); for(i=0;;i++) { for(ix = 0; ix < SIZE; ix++) { A[ix] = (char)random(); } gettimeofday(&curr, NULL); d = (curr.tv_sec-last.tv_sec); d += ((((double)curr.tv_usec)-((double)last.tv_usec))/1000000.); printf("%d %lf\n", i, d); last = curr; } return 0; }
Can you check by running qemu directly? I like to see the specific error message. Thanks
Can you provide qemu command line? Thanks
The error message is "migration failed". Here is what I did to reproduce it: - Start a VM with the following qemu command-line /usr/libexec/qemu-kvm \ -S \ -M rhel5.4.0 \ -m 3000 \ -smp 1 \ -name vm0 \ -uuid fc8b3336-5b4d-024c-fa8f-f60ee9fd235f \ -pidfile /var/run/libvirt/qemu//vm0.pid \ -boot c \ -drive file=/dev/MyVolGroup/vm0,if=virtio,index=0,boot=on,cache=none \ -serial pty \ -parallel none \ -usb \ -k en-us - When it boots, start the memjob program. - Save the VM to a file with the monitor commands - stop - migrate "exec:cat > STATEFILE" - Restore the VM with the qemu command-line /usr/libexec/qemu-kvm \ -S \ -M rhel5.4.0 \ -m 3000 \ -smp 1 \ -name vm0 \ -uuid fc8b3336-5b4d-024c-fa8f-f60ee9fd235f \ -pidfile /var/run/libvirt/qemu//vm0.pid \ -boot c \ -drive file=/dev/MyVolGroup/vm0,if=virtio,index=0,boot=on,cache=none \ -serial pty \ -parallel none \ -usb \ -k en-us \ -incoming "exec:cat < STATEFILE" - The VM starts and memjob continues to run. - Try again to save the vm - stop - migrate "exec: cat > STATEFILE2" - the qemu monitor prints the error message "migration failed".
I have some questions that might help realizing what happens: - If you run the same load but use live migration, will it fails? - Does the host swap (check vmstat 1)? - What happens if you do not use the -M flag? - Are you using the latest rhel5.5 (or even rhel5.6 candidate code?
Hey Dor, - live migration works OK. - There is no swapping during the save or restore (swap si/so are all zero) - The problem still happens if I omit the -M flag. - The problem also happens w/ RHEL6 beta. (is that the same thing as rhel5.6 candidate code?) I haven't applied any updates to the base rhel5.5 installation.
I installed the debuginfo rpms for KVM and attached GDB and set a breakpoint in do_migrate(). The code fails in popen(). errno is 12: ENOMEM. This is surprising to me. Here are the details: #0 exec_start_outgoing_migration (command=0xeacfd5 "cat > STATE2", bandwidth_limit=33554432, async=0) at migration-exec.c:65 #1 0x000000000046b4d3 in do_migrate (detach=0, uri=0xeacfd0 "exec:cat > STATE2") at migration.c:66 #2 0x00000000004107eb in monitor_handle_command (opaque=<value optimized out>, cmdline=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/monitor.c:2705 #3 monitor_handle_command1 (opaque=<value optimized out>, cmdline=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/monitor.c:3076 #4 0x0000000000464212 in readline_handle_byte (ch=<value optimized out>) at readline.c:398 #5 0x000000000040ecff in term_read (opaque=<value optimized out>, buf=0x2000000 <Address 0x2000000 out of bounds>, size=1) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/monitor.c:3069 #6 0x0000000000465841 in kbd_send_chars (opaque=<value optimized out>) at console.c:1098 #7 0x00000000004659c3 in kbd_put_keysym (keysym=<value optimized out>) at console.c:1151 #8 0x000000000047dac1 in sdl_refresh (ds=0xb4bce0) at sdl.c:439 #9 0x00000000004081e4 in gui_update (opaque=0xeacfd5) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:3684 #10 0x00000000004071bc in qemu_run_timers (ptimer_head=0xb38e00, current_time=1188340664) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:1271 #11 0x0000000000409577 in main_loop_wait (timeout=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:4021 #12 0x00000000004ff1ea in kvm_main_loop () at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/qemu-kvm.c:596 #13 0x000000000040e425 in main_loop (argc=15, argv=0x7fffffffe8b8, envp=<value optimized out>) at /usr/src/debug/kvm-83-maint-snapshot-20090205/qemu/vl.c:4040 f = popen(command, "w"); if (f == NULL) { dprintf("Unable to popen exec target\n"); goto err_after_alloc; } (gdb) p f $12 = (FILE *) 0x0 (gdb) p errno $13 = 12
Try with -incoming exec:"cat<file" ^^^ The " should be post the exec
The result is the same. I tried without the memory intensive job and I also got the same error. I ran qemu under strace. Here is the instance that worked. read(20, 0x7fff7fa583a0, 128) = -1 EAGAIN (Resource temporarily unavailable) clock_gettime(CLOCK_MONOTONIC, {1117600, 410184995}) = 0 select(15, [14], NULL, NULL, {0, 0}) = 1 (in [14], left {0, 0}) ioctl(14, FIONREAD, [32]) = 0 read(14, "\2$\342\0\241n\201\23Z\1\0\0\3\0\340\5\r\0\340\5\201\3\30\1\363\1u\0\20\0\1\0", 32) = 32 select(15, [14], NULL, NULL, {0, 0}) = 0 (Timeout) write(15, "H\2\206\0\r\0\340\5\16\0\340\5\10\0\20\0(\0010\0\0\30\1\0\0\0\0\377\0\0\0\377"..., 536) = 536 write(15, "H\2\206\0\r\0\340\5\16\0\340\5\10\0\20\0\0\0@\0\0\30\1\0\252\252\252\377\252\252\252\377"..., 536) = 536 pipe([22, 23]) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fb4e7f020) = 19783 close(22) = 0 fcntl(23, F_SETFD, 0x800 /* FD_??? */) = 0 clock_gettime(CLOCK_MONOTONIC, {1117600, 540088995}) = 0 ioctl(8, 0x4020ae46, 0x7fff7fa57a70) = 0 ioctl(8, 0x4020ae46, 0x7fff7fa57a70) = 0 ioctl(8, 0x4020ae46, 0x7fff7fa57a70) = 0 ioctl(8, 0x4020ae46, 0x7fff7fa57a70) = 0 ioctl(8, 0x4020ae46, 0x7fff7fa57a70) = 0 ioctl(8, 0x4020ae46, 0x7fff7fa57a70) = 0 And here is the save after the restore, the call to clone fails. 20128 clock_gettime(CLOCK_MONOTONIC, {1117850, 528054995}) = 0 20128 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 20128 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0 20128 clock_gettime(CLOCK_MONOTONIC, {1117850, 528156995}) = 0 20128 select(15, [14], NULL, NULL, {0, 0}) = 1 (in [14], left {0, 0}) 20128 ioctl(14, FIONREAD, [32]) = 0 20128 read(14, "\2$\244\0\237?\205\23Z\1\0\0\3\0\340\5\r\0\340\5\256\3\277\1\261\1\342\0\20\0\1\0", 32) = 32 20128 select(15, [14], NULL, NULL, {0, 0}) = 0 (Timeout) 20128 write(15, "H\2\206\0\r\0\340\5\16\0\340\5\10\0\20\0000\0010\0\0\30\1\0\0\0\0\377\0\0\0\377"..., 536) = 536 20128 write(15, "H\2\206\0\r\0\340\5\16\0\340\5\10\0\20\0\0\0@\0\0\30\1\0\252\252\252\377\252\252\252\377"..., 536) = 536 20128 pipe([22, 23]) = 0 20128 clone( <unfinished ...> 20159 <... rt_sigtimedwait resumed> {si_signo=SIGALRM, si_code=SI_TIMER, si_pid=0, si_uid=0, si_value={int=0, ptr=0}}, 0, 8) = 14 20159 write(21, "\16\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 20159 rt_sigtimedwait([ALRM IO], <unfinished ...> 20128 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2acf7b348020) = -1 ENOMEM (Cannot allocate memory) 20128 close(22) = 0 20128 close(23) = 0 20128 write(15, "H\2\206\0\r\0\340\5\16\0\340\5\10\0\20\0\0\0@\0\0\30\1\0\0\0\0\377\0\0\0\377"..., 536) = 536 20128 write(15, "H\2\206\0\r\0\340\5\16\0\340\5\10\0\20\0\0\0@\0\0\30\1\0\0\0\0\377\0\0\0\377"..., 536) = 536 20128 write(15, "H\2\206\0\r\0\340\5\16\0\340\5\10\0\20\0\10\0@\0\0\30\1\0\0\0\0\377\0\0\0\377"..., 536) = 536 20128 write(15, "H\2\206\0\r\0\340\5\16\0\340\5\10\0\20\0\20\0@\0\0\30\1\0\0\0\0\377\0\0\0\377"..., 536) = 536 The hypervisor has 4GB and the VM is 3GB. /var/log/messages doesn't show anything interesting. Jun 29 13:08:43 delint06 gconfd (root-18385): Resolved address "xml:readonly:/etc/gconf/gconf.xml.defaults" to a read-only configuration source at position 2 Jun 29 13:09:46 delint06 kernel: kvm: 18440: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x130079 Jun 29 13:09:46 delint06 kernel: kvm: 18440: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xffe18f0a Jun 29 13:09:46 delint06 kernel: kvm: 18440: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x530079 Jun 29 13:15:05 delint06 kernel: device tap0 entered promiscuous mode Jun 29 13:15:07 delint06 kernel: br0: topology change detected, propagating Jun 29 13:15:07 delint06 kernel: br0: port 2(tap0) entering forwarding state Jun 29 13:15:56 delint06 kernel: kvm: 19182: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x130079 Jun 29 13:15:56 delint06 kernel: kvm: 19182: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xffe18f0a Jun 29 13:15:56 delint06 kernel: kvm: 19182: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x530079 Is there any way to determine why clone() failed?
clone returned -ENOMEM. This means there is not enough mem on the host. Please use larger host or increase the swap files. Sending /proc/meminfo will help too.
I can also reproduce with 3GB VM on a 4GB hypervisor, if the VM is running a large memory job. I just seems odd that the system can't fork() when there is still over 300MB of available ram. [root@delint06 ~]# free total used free shared buffers cached Mem: 4043172 3729836 313336 0 30344 285684 -/+ buffers/cache: 3413808 629364 Swap: 2096472 136 2096336 [root@delint06 ~]# ps -eF | grep qemu root 29911 12071 38 847246 3169380 0 10:21 pts/3 00:02:44 /usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 3072 -boot c -drive file=/dev/MyVolGroup/vm0,if=virtio,index=0,boot=on,cache=none -net nic -net tap,ifname=tap0,script=no,downscript=no root 30802 14924 0 15295 732 2 10:28 pts/0 00:00:00 grep qemu [root@delint06 ~]# cat /proc/meminfo MemTotal: 4043172 kB MemFree: 314536 kB Buffers: 30636 kB Cached: 285764 kB SwapCached: 0 kB Active: 3254860 kB Inactive: 292572 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 4043172 kB LowFree: 314536 kB SwapTotal: 2096472 kB SwapFree: 2096336 kB Dirty: 12 kB Writeback: 0 kB AnonPages: 3230984 kB Mapped: 21784 kB Slab: 109264 kB PageTables: 12220 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 4118056 kB Committed_AS: 3602380 kB VmallocTotal: 34359738367 kB VmallocUsed: 270924 kB VmallocChunk: 34359467403 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB
Citrix Xen (probably RH xen also) can save/restore the same configuration. 4G hypervisor, 3G VM, large memory job.
Dor, Does popen() use fork() or vfork()? Mike C
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release.
> This request was erroneously denied for the current release of > Red Hat Enterprise Linux. The error has been fixed and this > request has been re-proposed for the current release. Does this mean if the bug has been fixed in the latest 6.0 release? Chong
No, all comment 15 does is cancel out a mistake made in comment 14. And this BZ is for RHEL 5 not RHEL 6.
*** Bug 647189 has been marked as a duplicate of this bug. ***
> No, all comment 15 does is cancel out a mistake made in comment 14. And this BZ is for RHEL 5 not RHEL 6. [Chong] So, are you saying if the latest RHEL 6 does not have this problem?
(In reply to comment #20) > > No, all comment 15 does is cancel out a mistake made in comment 14. And this BZ > is for RHEL 5 not RHEL 6. > > [Chong] So, are you saying if the latest RHEL 6 does not have this problem? No, it might work or not for rhel6. The above comment just fixes some automatic bot that changed the bugzilla state.
Test it in kvm-83-224.el5 with the following steps, cannot reproduce it. steps: 1.start a vm with 3.5g MEM in a 4G host: # /usr/libexec/qemu-kvm -rtc-td-hack -no-hpet -M rhel5.6.0 -m 3500 -smp 1 -name rhel56-64 -uuid `uuidgen` -monitor stdio -drive file=rhel56-64-virtio.qcow2,if=virtio,boot=on,format=qcow2,cache=none -net nic,macaddr=20:20:20:14:56:18,model=virtio,vlan=0 -net tap,script=/etc/qemu-ifup,vlan=0 -usb -vnc :1 2. run the 3.5G memory-consuming program provided in #Description, then in guest: # free -lm total used free shared buffers cached Mem: 3359 3338 20 0 7 181 Low: 3359 3338 20 High: 0 0 0 -/+ buffers/cache: 3150 209 Swap: 4959 293 4666 3. Save the VM to a file with the monitor commands - stop - migrate "exec:cat > STATEFILE" 4. shutdown guest 5. Restore the VM with: # /usr/libexec/qemu-kvm -rtc-td-hack -no-hpet -M rhel5.6.0 -m 3500 -smp 1 -name rhel56-64 -uuid `uuidgen` -monitor stdio -drive file=rhel56-64-virtio.qcow2,if=virtio,boot=on,format=qcow2,cache=none -net nic,macaddr=20:20:20:14:56:18,model=virtio,vlan=0 -net tap,script=/etc/qemu-ifup,vlan=0 -usb -vnc :1 -incoming "exec:cat < statefile" Actual result: do save/restore for 2 times, no failure found. michen-->Chong Chen, can you still hit this problem? can you give any suggestion about how to reproduce this? thanks.
Can you try save/restore through virsh interface? It may be libvirt problem rather than qemu KVM. Chong
Can reproduce with: kvm-83-164.el5 kvm-qemu-img-83-164.el5 kmod-kvm-83-164.el5 Where can I get kvm-224?
(In reply to comment #24) > Can reproduce with: > kvm-83-164.el5 > kvm-qemu-img-83-164.el5 > kmod-kvm-83-164.el5 > > > Where can I get kvm-224? Upgrade to rhel5.6
I am able to reproduce on RHEL56. [root@hb06b07 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.6 (Tikanga) [root@hb06b07 ~]# rpm -qa | grep kvm etherboot-zroms-kvm-5.4.4-13.el5 kvm-83-224.el5 kmod-kvm-83-224.el5 Hypervisor: [root@hb06b07 ~]# free total used free shared buffers cached Mem: 4045524 458848 3586676 0 30020 221716 -/+ buffers/cache: 207112 3838412 Swap: 2097144 16 2097128 VM: [root@hb06b07 ~]# cat startVM.sh /usr/libexec/qemu-kvm \ -S \ -M rhel5.4.0 \ -m 3000 \ -smp 1 \ -boot c \ -drive file=/dev/vmvg/rhel55tmpl4kvm,if=virtio,index=0,boot=on,cache=none \ -net nic,macaddr=DE:AD:BE:EF:26:8F,model=virtio -net tap,script=/root/qemu-ifup memory intensive program is the same, SIZE set as: #define SIZE 3000000000L After starting memjob, wait for it to print out the result of the first iteration, so the hypervisor allocs all memory. It took about 5 minutes in my environment. mem on hypervisor before trying to save: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31660 root 25 0 4752m 3.4g 4388 R 100.2 87.6 13:10.10 qemu-kvm [root@hb06b07 ~]# cat /proc/meminfo MemTotal: 4045524 kB MemFree: 30828 kB Buffers: 31764 kB Cached: 226164 kB SwapCached: 16 kB Active: 3714808 kB Inactive: 155596 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 4045524 kB LowFree: 30828 kB SwapTotal: 2097144 kB SwapFree: 2097128 kB Dirty: 28 kB Writeback: 0 kB AnonPages: 3612456 kB Mapped: 26092 kB Slab: 92728 kB PageTables: 12816 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 4119904 kB Committed_AS: 5028648 kB VmallocTotal: 34359738367 kB VmallocUsed: 264380 kB VmallocChunk: 34359473783 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB mem in VM: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2990 root 25 0 2864m 2.8g 380 R 8.2 97.1 8:44.26 memjob [root@localhost ~]# cat /proc/meminfo MemTotal: 3016480 kB MemFree: 13096 kB Buffers: 724 kB Cached: 10284 kB SwapCached: 1924 kB Active: 2520996 kB Inactive: 437892 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 3016480 kB LowFree: 13096 kB SwapTotal: 2096472 kB SwapFree: 2084076 kB Dirty: 156 kB Writeback: 0 kB AnonPages: 2946676 kB Mapped: 8912 kB Slab: 15136 kB PageTables: 9488 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 3604712 kB Committed_AS: 3141632 kB VmallocTotal: 34359738367 kB VmallocUsed: 1868 kB VmallocChunk: 34359736491 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB strace of kvm process when I try to save. pipe([11, 12]) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2af3d7874b90) = -1 ENOMEM (Cannot allocate memory) close(11) = 0 close(12) = 0
Is this lack of MADV_DONTFORK in qemu-kvm? I added it in RHEL6, but not yet on RHEL5. So it shouldn't happen on RHEL6. If we don't want to mess with exec.c and we ignore qemu tcg, it's enough to remove "&& !kvm_has_sync_mmu()" in qemu-kvm.c:kvm_setup_guest_memory to fix.
(In reply to comment #28) > Is this lack of MADV_DONTFORK in qemu-kvm? I added it in RHEL6, but not yet on > RHEL5. So it shouldn't happen on RHEL6. If we don't want to mess with exec.c > and we ignore qemu tcg, it's enough to remove "&& !kvm_has_sync_mmu()" in > qemu-kvm.c:kvm_setup_guest_memory to fix. Go ahead and try it, looks like you nailed it.
I tried the same test with RHEL6 as host _and_ guest OS. Surprisingly the RES of the memjob process in the guest was much smaller than anticipated. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1524 root 20 0 2864m 781m 220 D 17.3 78.3 1:09.88 memjob Save also fails. [root@hb06b07 ~]# virsh save 3 STATEFILE error: Failed to save domain 3 to STATEFILE error: operation failed: Migration unexpectedly failed I tried starting the VM with qemu-kvm, but the SDL window doesn't come up and the network did work as it did in RHEL5 (I started the VM the same way as shown above, but there was no eth0 in the VM.). Any ideas? I'll keep working on it.
This bug could be related: https://bugzilla.redhat.com/show_bug.cgi?id=639305 The latter appends focus on the save side, but the restore also failes (see the first append).
16 GB HV and 10GB VM also fails. RHEL55 a 4 GB VM on the same host saves OK. top - 12:45:33 up 9 days, 18:03, 1 user, load average: 1.17, 0.97, 0.76 Tasks: 130 total, 1 running, 129 sleeping, 0 stopped, 0 zombie Cpu(s): 1.1%us, 35.3%sy, 0.0%ni, 63.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16508580k total, 11374800k used, 5133780k free, 90784k buffers Swap: 2048276k total, 657500k used, 1390776k free, 929596k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20545 root 15 0 10.1g 8.9g 2516 S 138.4 56.5 6:26.29 qemu-kvm
I increased the swap on the machine to 32GB and then was able to save a 15GB VM. Also, after applying the change in #639305 I was able to restore the 15GB VM.
From my perspective this bug can be closed. 1) The SAVE problem was fixed by adding more swap. 2) The RESTORE problem was fixed in RHEL56. We have three more problems we found with RHEL56. I'll log them as separate issues.