Bug 511135

Summary: Xen live migrations fail with "Error Internal error .."
Product: Red Hat Enterprise Linux 5 Reporter: Franco M. Bladilo <bladilo>
Component: xenAssignee: Michal Novotny <minovotn>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: low    
Version: 5.3CC: areis, clalance, espen, jwojcik, leiwang, lilu, llim, minovotn, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: xen-3.0.3-115.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 22:17:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514498    
Attachments:
Description Flags
Xend log of receiver dom0
none
Xend log of sender dom0
none
Xen config receiver dom0
none
Xen configuration of sender dom0
none
Fix restore handling in XenD
none
Check for enough memory on restore and silently change read-only IDE disks to read-write
none
Patch v5
none
xend.log
none
xend.log of receiver none

Description Franco M. Bladilo 2009-07-13 20:37:17 UTC
Created attachment 351517 [details]
Xend log of receiver dom0

Description of problem:

Xen live migrations fail with a non very descriptive error of : 

[2009-07-13 12:18:41 xend 30281] DEBUG (XendCheckpoint:198) restore:shadow=0x0, _static_max=0x400, _static_min=0x400, 
[2009-07-13 12:18:41 xend 30281] DEBUG (balloon:143) Balloon: 1059468 KiB free; need 1048576; done.
[2009-07-13 12:18:41 xend 30281] DEBUG (XendCheckpoint:215) [xc_restore]: /usr/lib64/xen/bin/xc_restore 22 15 1 2 0 0 0
[2009-07-13 12:18:41 xend 30281] INFO (XendCheckpoint:351) xc_domain_restore start: p2m_size = 40800
[2009-07-13 12:18:41 xend 30281] INFO (XendCheckpoint:351) Reloading memory pages:   0%
[2009-07-13 12:18:51 xend 30281] INFO (XendCheckpoint:351) ERROR Internal error: Error when reading batch size
[2009-07-13 12:18:52 xend 30281] INFO (XendCheckpoint:351) Restore exit with rc=1

What's interesting about this problem is that it only happens when I'm migrating from a dom0 that has a higher amount RAM.
I'm running a beta RHEL5u4 kernel (2.6.18-155.el5xen) that was suggested in a similar bugzilla report that indicated failures related to memory fragmentation. The kernel actually fixed those issues but this one still remains.

Version-Release number of selected component (if applicable):

kernel-xen-2.6.18-155.el5
xen-3.0.3-80.el5_3.3

How reproducible:

Always.

Steps to Reproduce:

1. Load dom0s RHELu3 with and latest xen
2. Try to live migrate a VM from a dom0 with higher amount of RAM than the receiving dom0. In my case one dell PE2970 with 8GB has problems when receiving live migrations from a PE2950 with 16GB of RAM. Homogeneous hardware configurations do not exhibit this problem.
  
Actual results:

Failure to migrate.

Expected results:

Additional info:

Comment 1 Franco M. Bladilo 2009-07-13 20:37:55 UTC
Created attachment 351518 [details]
Xend log of sender dom0

Comment 2 Franco M. Bladilo 2009-07-13 20:38:31 UTC
Created attachment 351519 [details]
Xen config receiver dom0

Comment 3 Franco M. Bladilo 2009-07-13 20:39:39 UTC
Created attachment 351520 [details]
Xen configuration of sender dom0

Comment 5 Michal Novotny 2010-03-22 13:00:08 UTC
Well, the error "Internal error: Error when reading batch size" (at least according to code) means that it cannot read the data size from the I/O descriptor here but the descriptor is valid since it's used before and no error is before this. It appears like data ends earlier than expected, i.e. the I/O descriptor was closed which resulted into this error. According to title of this BZ and comment #0 it happened when doing migration, weren't you having some network issues when migration was in progress?

Could you please retest with the latest version of xen package and kernel-xen package, i.e. kernel-xen-2.6.18-192.el5 and xen-3.0.3-105.el5 ?

Thanks,
Michal

Comment 6 Michal Novotny 2010-03-22 13:01:25 UTC
Also, is it doing only when live migration or in both migrations? Could you please try to do both migrations in a row and provide us test results with the latest packages?

Thanks,
Michal

Comment 7 Michal Novotny 2010-06-18 16:58:16 UTC
Well, I was thinking when it does this. According to comment #0 it's being done only when you try to migrate VM to some other host that's not having enough memory to run the guest so we should disallow migration and restore on the machine when there is not enough memory, right?

Michal

Comment 9 Michal Novotny 2010-06-22 13:47:03 UTC
Created attachment 425924 [details]
Fix restore handling in XenD

Well, this is the patch to disallow migration/restore for guests that are
trying to use more memory than host machine does have available. The information
seems to be saved using libxc at the start of the migration when the restore calls
are being hit which means that if the domain is using some read-only IDE disks
or host machine (dom0) doesn't have enough memory for the guest creation it
fails immediately with the error message printed to the xend.log. Both required
and available memory are printed to the xend.log as well and if the guest is
migration/restore is already in progress the maxmem_kb is used instead since
the mem_kb is showing the current memory allocation of the restoring/migrating
domain. The available memory is calculated as the total_memory - memory of all
running guests - dom0-min-mem settings.

It's been tested on 2 RHEL-5 host machines with doing restore on first machine
and also migrating some other guest from the second machine, i.e. sequence of
2 parallel migrations and it was working fine and calculations were fine as
well. Also, read-only IDE disks handling check has been fixed in this patch.

Michal

Comment 11 Michal Novotny 2010-06-23 15:42:48 UTC
Created attachment 426296 [details]
Check for enough memory on restore and silently change read-only IDE disks to read-write

This is the fix to check for enough memory on domain restore and also to silently change read-only IDE disks to read-write IDE disks to preserve the old, although incorrect, behaviour.

Michal

Comment 12 Michal Novotny 2010-07-29 12:55:30 UTC
Created attachment 435268 [details]
Patch v5

This is the patch that also takes dom0-min-mem set to 0 into account and uses current memory allocation on dom0 instead for this case.

Michal

Comment 16 Linqing Lu 2010-08-31 10:40:33 UTC
Created attachment 442150 [details]
xend.log of sender

Tested on xen-3.0.3-115.el5, kernel-xen-2.6.18-214.el5
Using two x64 machines with different cpu and memory.

Met an error while migrating guests between rhel5.5_64 and rhel5.5_32,
but succeeded to migrate between rhel5.5_64 and rhel5.5_64 with same machines.

Please check the xend logs in attachments.

Comment 17 Linqing Lu 2010-08-31 10:41:01 UTC
Created attachment 442151 [details]
xend.log of receiver

Comment 18 Michal Novotny 2010-09-07 11:21:19 UTC
Well, this appears that it's something different since I can see this message in the receiver's log:

[2010-08-31 17:30:32 xend 4006] DEBUG (XendCheckpoint:101) Available memory: 5758 MiB, guest requires: 1024 MiB

This means that the guest is using memory that *is* available on the remote host. The restore itself is called there but the check passes (no surprise when host B is having 5.6 GiBs of RAM and the guest requires only 1 G) and this has nothing to do with this bug then. If it's a well-reproducible issue to migrate between i386 and x86_64 hosts you can file a new bug about this and for testing of this one you should also try to restore a guest that was saved on a host with more RAM, e.g. that was saved on 8G host (and the guest is using 6G itself). If you transfer that saved image to 4G host and try to restore there it should fail since it should fail on that condition now (host won't be having enough memory to restore the guest).

Michal

Comment 19 Linqing Lu 2010-09-08 06:10:13 UTC
(In reply to comment #18)
> Well, this appears that it's something different since I can see this message
> in the receiver's log:
> 
> [2010-08-31 17:30:32 xend 4006] DEBUG (XendCheckpoint:101) Available memory:
> 5758 MiB, guest requires: 1024 MiB
> 
> This means that the guest is using memory that *is* available on the remote
> host. The restore itself is called there but the check passes (no surprise when
> host B is having 5.6 GiBs of RAM and the guest requires only 1 G) and this has
> nothing to do with this bug then. If it's a well-reproducible issue to migrate
> between i386 and x86_64 hosts you can file a new bug about this and for testing
> of this one you should also try to restore a guest that was saved on a host
> with more RAM, e.g. that was saved on 8G host (and the guest is using 6G
> itself). If you transfer that saved image to 4G host and try to restore there
> it should fail since it should fail on that condition now (host won't be having
> enough memory to restore the guest).
> 
> Michal

Hi Michal,

I've tested as your last comment, with migrating a 6G(memory) guest from a 7G host to a 5G host. It fails as you said, output is:
   Error: /usr/lib64/xen/bin/xc_save 20 3 0 0 0 failed
The guest still worked well after migration failed. Then xm-shutdown it, output is:
   Name                                      ID Mem(MiB) VCPUs State   Time(s)
   Domain-0                                   0     1977     4 r-----     72.2
   Zombie-pv-test                             3     6000     1 --ps-d     22.2

The xend.log will be attached in the next comment.

I'm not sure if it's what we are expecting for this bug. Cause the original description did not mention that receiver dom0's memory is less than the migrated guest, but just the receiver dom0's memory is larger than the sender's.

So could you please make sure about it? Thanks.

Comment 20 Linqing Lu 2010-09-08 06:11:03 UTC
Created attachment 445860 [details]
xend.log

Comment 21 Linqing Lu 2010-09-08 06:15:06 UTC
Created attachment 445862 [details]
xend.log of receiver

We can see the warning that "Host machine doesn't have enough memory to restore the guest" from receiver's xend.log

Comment 22 Michal Novotny 2010-09-08 08:38:15 UTC
(In reply to comment #19)
> (In reply to comment #18)
> > Well, this appears that it's something different since I can see this message
> > in the receiver's log:
> > 
> > [2010-08-31 17:30:32 xend 4006] DEBUG (XendCheckpoint:101) Available memory:
> > 5758 MiB, guest requires: 1024 MiB
> > 
> > This means that the guest is using memory that *is* available on the remote
> > host. The restore itself is called there but the check passes (no surprise when
> > host B is having 5.6 GiBs of RAM and the guest requires only 1 G) and this has
> > nothing to do with this bug then. If it's a well-reproducible issue to migrate
> > between i386 and x86_64 hosts you can file a new bug about this and for testing
> > of this one you should also try to restore a guest that was saved on a host
> > with more RAM, e.g. that was saved on 8G host (and the guest is using 6G
> > itself). If you transfer that saved image to 4G host and try to restore there
> > it should fail since it should fail on that condition now (host won't be having
> > enough memory to restore the guest).
> > 
> > Michal
> 
> Hi Michal,
> 
> I've tested as your last comment, with migrating a 6G(memory) guest from a 7G
> host to a 5G host. It fails as you said, output is:
>    Error: /usr/lib64/xen/bin/xc_save 20 3 0 0 0 failed
> The guest still worked well after migration failed. Then xm-shutdown it, output
> is:
>    Name                                      ID Mem(MiB) VCPUs State   Time(s)
>    Domain-0                                   0     1977     4 r-----     72.2
>    Zombie-pv-test                             3     6000     1 --ps-d     22.2
> 
> The xend.log will be attached in the next comment.
> 
> I'm not sure if it's what we are expecting for this bug. Cause the original
> description did not mention that receiver dom0's memory is less than the
> migrated guest, but just the receiver dom0's memory is larger than the
> sender's.
> 
> So could you please make sure about it? Thanks.

That's expected behaviour since there should be some data coming from the event channel but they're missing after resume operation. A kernel-xen bug 589123 is already filled about this one.

Michal

Comment 23 Michal Novotny 2010-09-08 08:48:24 UTC
(In reply to comment #21)
> Created attachment 445862 [details]
> xend.log of receiver
> 
> We can see the warning that "Host machine doesn't have enough memory to restore
> the guest" from receiver's xend.log

I've been looking to the logs and since it was working fine after failed migration and the error message was printed on the receiver's xend.log file it's the behaviour designed. Of course, it applied to restore operation as well.

As mentioned in comment 22 the shutdown is not possible and leaves out zombie domain because of event channel is not giving proper data to finish the domain cleanup (bug 589123 in kernel-xen space) so we can't do anything with this in the user-space stack.

Michal

Comment 24 Linqing Lu 2010-09-08 08:52:54 UTC
(In reply to comment #23)
> I've been looking to the logs and since it was working fine after failed
> migration and the error message was printed on the receiver's xend.log file
> it's the behaviour designed. Of course, it applied to restore operation as
> well.
> 
> As mentioned in comment 22 the shutdown is not possible and leaves out zombie
> domain because of event channel is not giving proper data to finish the domain
> cleanup (bug 589123 in kernel-xen space) so we can't do anything with this in
> the user-space stack.
> 
> Michal

Thanks Michal.

Verified on xen-3.0.3-116.el5, kernel-xen-2.6.18-216.el5

Comment 26 errata-xmlrpc 2011-01-13 22:17:45 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html