Bug 639305 - virsh restore command fails when restoring guests with 16G of RAM
Summary: virsh restore command fails when restoring guests with 16G of RAM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: libvirt
Version: 5.5
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On: 647189
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-10-01 12:17 UTC by Humble Chirammal
Modified: 2018-10-27 11:50 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-09-29 14:45:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Qemu driver timeout patch (510 bytes, patch)
2010-10-01 12:26 UTC, Humble Chirammal
no flags Details | Diff

Description Humble Chirammal 2010-10-01 12:17:16 UTC
Hi,

Description of problem:


The #virsh save/restore" to save a VM's memory state to a checkpoint file and restore from it worked all the time until we  tried it on a large memory VM with 16GB.

The error from command line is
[root@node5 resumeTest]# virsh restore checkpoint
error: Failed to restore domain from checkpoint
error: operation failed: failed to start VM
The qemu log shows an error
cat: write error: Broken pipe


The ibvirtd log shows

11:25:27.030: error : internal error Timed out while reading monitor startup output
11:25:27.030: error : internal error unable to start guest: char device redirected to /dev/pts/3
11:25:27.031: error : operation failed: failed to start VM

How to reproduce this?
On a RHEL-5.5 host, start a 64bit VM ( RHEL 5 u5 )  with 4 CPU and 16GB of RAM. The key here is to actually use large memory otherwise the checkpoint file would be small and restore will still work. 

Below C code has been written to make sure memory is allocated:

[root@vrstorm ~]# cat largeAllocate.c
#include<stdio.h>
#include<stdlib.h>
//#define SIZE 2000000000 //2G*8Byte=16GB
#define SIZE 1500000000 //1.5G*8Byte=12GB
int main()
{
printf("hello,large memory\n");
long *pts=(long *)malloc(sizeof(long)*SIZE);//8byte per long

long i;
for (i=0;i<SIZE;i++)
{
pts[i]=i;
}
//dead loop
printf("entering dead loop\n");
while (1)
{
}
free(pts);
return 0;
}


Build this code simply with gcc on the VM and run it.
It will consume 12GB of memory (check it by top or free).
Then on the RHEL-5.5 host, issue the command
virsh save <vmname> checkpoint
This will take a while and the resulting checkpoint file is 12GB in disk space. Now restore it by "virsh restore checkpoint" and you will see the error I reported above. 

Version-Release number of selected component (if applicable):


kvm-tools-83-164.el5_5.15
kvm-qemu-img-83-164.el5_5.15
etherboot-zroms-kvm-5.4.4-13.el5
libvirt-python-0.6.3-33.el5_5.3
kvm-83-164.el5_5.23
libvirt-0.6.3-33.el5_5.3
kmod-kvm-83-164.el5_5.15



Steps to Reproduce:

As mentioned in problem description part.
  
Actual results:

virsh restore fails with above mentioned error.

Expected results:

virsh restore should not fail with 16G (RAM) guest.

--Humble

Comment 2 Humble Chirammal 2010-10-01 12:26:55 UTC
Created attachment 450994 [details]
Qemu driver timeout patch

This patch will increase the timeout value..

Comment 6 Jiri Denemark 2010-11-22 13:27:25 UTC
> Even a guest with 1G that keep dirtying his pages will create the same timeout
> for you. So the above BZ is not a blocker here

I can't reproduce even with a 4GB guest, where almost all pages were made dirty. The result was state file with 4.2GB and I could resume from that without any issues.

Comment 7 Jiri Denemark 2010-11-22 13:41:35 UTC
> I can't reproduce even with a 4GB guest, where almost all pages were made
> dirty. The result was state file with 4.2GB and I could resume from that
> without any issues.

Ah but I was able to reproduce it with 0.6.3-based libvirt from RHEL-5.5. It seems like the rebase fixed this issue.

Humble, could you try with the most recent libvirt packages for RHEL-5? Current version is libvirt-0.8.2-12.el5

Comment 8 Dor Laor 2010-11-22 13:48:56 UTC
(In reply to comment #6)
> > Even a guest with 1G that keep dirtying his pages will create the same timeout
> > for you. So the above BZ is not a blocker here
> 
> I can't reproduce even with a 4GB guest, where almost all pages were made
> dirty. The result was state file with 4.2GB and I could resume from that
> without any issues.

My fault, savevm/loadvm are not live migration into file so it doesn't matter what the guest do since it is paused.

Comment 12 Michael Closson 2011-05-16 16:38:55 UTC
I was able to reproduce this bug with RHEL55.  Also, I can confirm that RHEL56 fixes it.


Note You need to log in before you can comment on or make changes to this bug.