Bug 511135
Summary: | Xen live migrations fail with "Error Internal error .." | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Franco M. Bladilo <bladilo> | ||||||||||||||||||||
Component: | xen | Assignee: | Michal Novotny <minovotn> | ||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||
Priority: | low | ||||||||||||||||||||||
Version: | 5.3 | CC: | areis, clalance, espen, jwojcik, leiwang, lilu, llim, minovotn, xen-maint | ||||||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | xen-3.0.3-115.el5 | Doc Type: | Bug Fix | ||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2011-01-13 22:17:45 UTC | Type: | --- | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||
Bug Blocks: | 514498 | ||||||||||||||||||||||
Attachments: |
|
Created attachment 351518 [details]
Xend log of sender dom0
Created attachment 351519 [details]
Xen config receiver dom0
Created attachment 351520 [details]
Xen configuration of sender dom0
Well, the error "Internal error: Error when reading batch size" (at least according to code) means that it cannot read the data size from the I/O descriptor here but the descriptor is valid since it's used before and no error is before this. It appears like data ends earlier than expected, i.e. the I/O descriptor was closed which resulted into this error. According to title of this BZ and comment #0 it happened when doing migration, weren't you having some network issues when migration was in progress? Could you please retest with the latest version of xen package and kernel-xen package, i.e. kernel-xen-2.6.18-192.el5 and xen-3.0.3-105.el5 ? Thanks, Michal Also, is it doing only when live migration or in both migrations? Could you please try to do both migrations in a row and provide us test results with the latest packages? Thanks, Michal Well, I was thinking when it does this. According to comment #0 it's being done only when you try to migrate VM to some other host that's not having enough memory to run the guest so we should disallow migration and restore on the machine when there is not enough memory, right? Michal Created attachment 425924 [details]
Fix restore handling in XenD
Well, this is the patch to disallow migration/restore for guests that are
trying to use more memory than host machine does have available. The information
seems to be saved using libxc at the start of the migration when the restore calls
are being hit which means that if the domain is using some read-only IDE disks
or host machine (dom0) doesn't have enough memory for the guest creation it
fails immediately with the error message printed to the xend.log. Both required
and available memory are printed to the xend.log as well and if the guest is
migration/restore is already in progress the maxmem_kb is used instead since
the mem_kb is showing the current memory allocation of the restoring/migrating
domain. The available memory is calculated as the total_memory - memory of all
running guests - dom0-min-mem settings.
It's been tested on 2 RHEL-5 host machines with doing restore on first machine
and also migrating some other guest from the second machine, i.e. sequence of
2 parallel migrations and it was working fine and calculations were fine as
well. Also, read-only IDE disks handling check has been fixed in this patch.
Michal
Created attachment 426296 [details]
Check for enough memory on restore and silently change read-only IDE disks to read-write
This is the fix to check for enough memory on domain restore and also to silently change read-only IDE disks to read-write IDE disks to preserve the old, although incorrect, behaviour.
Michal
Created attachment 435268 [details]
Patch v5
This is the patch that also takes dom0-min-mem set to 0 into account and uses current memory allocation on dom0 instead for this case.
Michal
Created attachment 442150 [details]
xend.log of sender
Tested on xen-3.0.3-115.el5, kernel-xen-2.6.18-214.el5
Using two x64 machines with different cpu and memory.
Met an error while migrating guests between rhel5.5_64 and rhel5.5_32,
but succeeded to migrate between rhel5.5_64 and rhel5.5_64 with same machines.
Please check the xend logs in attachments.
Created attachment 442151 [details]
xend.log of receiver
Well, this appears that it's something different since I can see this message in the receiver's log: [2010-08-31 17:30:32 xend 4006] DEBUG (XendCheckpoint:101) Available memory: 5758 MiB, guest requires: 1024 MiB This means that the guest is using memory that *is* available on the remote host. The restore itself is called there but the check passes (no surprise when host B is having 5.6 GiBs of RAM and the guest requires only 1 G) and this has nothing to do with this bug then. If it's a well-reproducible issue to migrate between i386 and x86_64 hosts you can file a new bug about this and for testing of this one you should also try to restore a guest that was saved on a host with more RAM, e.g. that was saved on 8G host (and the guest is using 6G itself). If you transfer that saved image to 4G host and try to restore there it should fail since it should fail on that condition now (host won't be having enough memory to restore the guest). Michal (In reply to comment #18) > Well, this appears that it's something different since I can see this message > in the receiver's log: > > [2010-08-31 17:30:32 xend 4006] DEBUG (XendCheckpoint:101) Available memory: > 5758 MiB, guest requires: 1024 MiB > > This means that the guest is using memory that *is* available on the remote > host. The restore itself is called there but the check passes (no surprise when > host B is having 5.6 GiBs of RAM and the guest requires only 1 G) and this has > nothing to do with this bug then. If it's a well-reproducible issue to migrate > between i386 and x86_64 hosts you can file a new bug about this and for testing > of this one you should also try to restore a guest that was saved on a host > with more RAM, e.g. that was saved on 8G host (and the guest is using 6G > itself). If you transfer that saved image to 4G host and try to restore there > it should fail since it should fail on that condition now (host won't be having > enough memory to restore the guest). > > Michal Hi Michal, I've tested as your last comment, with migrating a 6G(memory) guest from a 7G host to a 5G host. It fails as you said, output is: Error: /usr/lib64/xen/bin/xc_save 20 3 0 0 0 failed The guest still worked well after migration failed. Then xm-shutdown it, output is: Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 1977 4 r----- 72.2 Zombie-pv-test 3 6000 1 --ps-d 22.2 The xend.log will be attached in the next comment. I'm not sure if it's what we are expecting for this bug. Cause the original description did not mention that receiver dom0's memory is less than the migrated guest, but just the receiver dom0's memory is larger than the sender's. So could you please make sure about it? Thanks. Created attachment 445860 [details]
xend.log
Created attachment 445862 [details]
xend.log of receiver
We can see the warning that "Host machine doesn't have enough memory to restore the guest" from receiver's xend.log
(In reply to comment #19) > (In reply to comment #18) > > Well, this appears that it's something different since I can see this message > > in the receiver's log: > > > > [2010-08-31 17:30:32 xend 4006] DEBUG (XendCheckpoint:101) Available memory: > > 5758 MiB, guest requires: 1024 MiB > > > > This means that the guest is using memory that *is* available on the remote > > host. The restore itself is called there but the check passes (no surprise when > > host B is having 5.6 GiBs of RAM and the guest requires only 1 G) and this has > > nothing to do with this bug then. If it's a well-reproducible issue to migrate > > between i386 and x86_64 hosts you can file a new bug about this and for testing > > of this one you should also try to restore a guest that was saved on a host > > with more RAM, e.g. that was saved on 8G host (and the guest is using 6G > > itself). If you transfer that saved image to 4G host and try to restore there > > it should fail since it should fail on that condition now (host won't be having > > enough memory to restore the guest). > > > > Michal > > Hi Michal, > > I've tested as your last comment, with migrating a 6G(memory) guest from a 7G > host to a 5G host. It fails as you said, output is: > Error: /usr/lib64/xen/bin/xc_save 20 3 0 0 0 failed > The guest still worked well after migration failed. Then xm-shutdown it, output > is: > Name ID Mem(MiB) VCPUs State Time(s) > Domain-0 0 1977 4 r----- 72.2 > Zombie-pv-test 3 6000 1 --ps-d 22.2 > > The xend.log will be attached in the next comment. > > I'm not sure if it's what we are expecting for this bug. Cause the original > description did not mention that receiver dom0's memory is less than the > migrated guest, but just the receiver dom0's memory is larger than the > sender's. > > So could you please make sure about it? Thanks. That's expected behaviour since there should be some data coming from the event channel but they're missing after resume operation. A kernel-xen bug 589123 is already filled about this one. Michal (In reply to comment #21) > Created attachment 445862 [details] > xend.log of receiver > > We can see the warning that "Host machine doesn't have enough memory to restore > the guest" from receiver's xend.log I've been looking to the logs and since it was working fine after failed migration and the error message was printed on the receiver's xend.log file it's the behaviour designed. Of course, it applied to restore operation as well. As mentioned in comment 22 the shutdown is not possible and leaves out zombie domain because of event channel is not giving proper data to finish the domain cleanup (bug 589123 in kernel-xen space) so we can't do anything with this in the user-space stack. Michal (In reply to comment #23) > I've been looking to the logs and since it was working fine after failed > migration and the error message was printed on the receiver's xend.log file > it's the behaviour designed. Of course, it applied to restore operation as > well. > > As mentioned in comment 22 the shutdown is not possible and leaves out zombie > domain because of event channel is not giving proper data to finish the domain > cleanup (bug 589123 in kernel-xen space) so we can't do anything with this in > the user-space stack. > > Michal Thanks Michal. Verified on xen-3.0.3-116.el5, kernel-xen-2.6.18-216.el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0031.html |
Created attachment 351517 [details] Xend log of receiver dom0 Description of problem: Xen live migrations fail with a non very descriptive error of : [2009-07-13 12:18:41 xend 30281] DEBUG (XendCheckpoint:198) restore:shadow=0x0, _static_max=0x400, _static_min=0x400, [2009-07-13 12:18:41 xend 30281] DEBUG (balloon:143) Balloon: 1059468 KiB free; need 1048576; done. [2009-07-13 12:18:41 xend 30281] DEBUG (XendCheckpoint:215) [xc_restore]: /usr/lib64/xen/bin/xc_restore 22 15 1 2 0 0 0 [2009-07-13 12:18:41 xend 30281] INFO (XendCheckpoint:351) xc_domain_restore start: p2m_size = 40800 [2009-07-13 12:18:41 xend 30281] INFO (XendCheckpoint:351) Reloading memory pages: 0% [2009-07-13 12:18:51 xend 30281] INFO (XendCheckpoint:351) ERROR Internal error: Error when reading batch size [2009-07-13 12:18:52 xend 30281] INFO (XendCheckpoint:351) Restore exit with rc=1 What's interesting about this problem is that it only happens when I'm migrating from a dom0 that has a higher amount RAM. I'm running a beta RHEL5u4 kernel (2.6.18-155.el5xen) that was suggested in a similar bugzilla report that indicated failures related to memory fragmentation. The kernel actually fixed those issues but this one still remains. Version-Release number of selected component (if applicable): kernel-xen-2.6.18-155.el5 xen-3.0.3-80.el5_3.3 How reproducible: Always. Steps to Reproduce: 1. Load dom0s RHELu3 with and latest xen 2. Try to live migrate a VM from a dom0 with higher amount of RAM than the receiving dom0. In my case one dell PE2970 with 8GB has problems when receiving live migrations from a PE2950 with 16GB of RAM. Homogeneous hardware configurations do not exhibit this problem. Actual results: Failure to migrate. Expected results: Additional info: