Bug 1372153

Summary: migration failed from rhel7.3 to rhel7.0 when guest with numa setting
Product: Red Hat Enterprise Linux 7 Reporter: yafu <yafu>
Component: libvirtAssignee: Martin Kletzander <mkletzan>
Status: CLOSED WONTFIX QA Contact: zhe peng <zpeng>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: dyuan, fjin, jsuchane, mzhan, rbalakri, xuzhang, yafu, yanqzhan, zpeng
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-12 14:49:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
libvirtd.log and qemu.log both on source and target host
none
The guest XML none

Description yafu 2016-09-01 04:04:40 UTC
Description of problem:
migration failed from rhel7.3 to rhel7.0 when guest with numa setting


Version-Release number of selected component (if applicable):
Source:
libvirt-2.0.0-6.el7.x86_64
qemu-kvm-rhev-2.6.0-22.el7.x86_64

target:
libvirt-1.1.1-29.el7_0.7.x86_64
qemu-kvm-rhev-1.5.3-60.el7_0.10.x86_64

How reproducible:
100%

Steps to reproduce:
1.Start a guest with numa setting:
  #virsh dumpxml mig1
   ...
   <cpu>
     ...
     <numa>
      <cell id='0' cpus='0-1' memory='512000' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='512000' unit='KiB'/>
    </numa>
    ...
  </cpu>
  ...

2.Migrate to the target host:
# virsh migrate mig1 qemu+ssh://10.66.144.76/system --live --verbose
root.144.76's password:
error: operation failed: migration job: unexpectedly failed

Actual results:
Migration failed.

Expected results:
Migration complete correctly.

Additional info:
1.Error in the qemu log on the target host:
 #cat /var/log/libvirt/qemu/mig1.log
 ...
 Unknown ramblock "/objects/ram-node0", cannot accept migration
qemu: warning: error while loading state for instance 0x0 of device 'ram'
load of migration failed
 ...



Additional info:

Comment 2 yafu 2016-09-05 03:03:40 UTC
Correct the xml setting for the guest, the error was caused by numatune:
#virsh dumpxml mig1
   ...
   <cpu>
     ...
     <numa>
      <cell id='0' cpus='0-1' memory='512000' unit='KiB'/>
      <cell id='1' cpus='2-3' memory='512000' unit='KiB'/>
    </numa>
    ...
  </cpu>
...
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
...

Comment 3 yafu 2016-09-05 03:04:42 UTC
Created attachment 1197748 [details]
libvirtd.log and qemu.log both on source and  target host

Comment 4 Martin Kletzander 2016-09-23 13:07:36 UTC
Do you have some matrix of migrations from/to which work and which don't?  I'm guessing if this doesn't work, then 7.0 -> 7.3 doesn't work either, also 7.2 <-> 7.3 is broken both ways, right?  Make sure you have (at minimum):

<memoryBacking>
  <hugepages/>
</memoryBacking>
<cpu>
  <numa>
    <cell .../>
  </numa>
</cpu>

but no <numatune>, neither nodeset= in <hugepages/>.

Comment 5 Martin Kletzander 2016-09-29 13:44:57 UTC
Fixed upstream with commit v2.3.0-rc1-10-gff3112f3dc2c:

commit ff3112f3dc2c276a7e387ff7bb86f4fbbdf7bf2c
Author: Martin Kletzander <mkletzan>
Date:   Fri Sep 23 11:31:30 2016 +0200

    qemu: Only use memory-backend-file with NUMA if needed

Comment 6 yafu 2016-10-08 08:55:26 UTC
(In reply to Martin Kletzander from comment #4)
> Do you have some matrix of migrations from/to which work and which don't? 
> I'm guessing if this doesn't work, then 7.0 -> 7.3 doesn't work either, also
> 7.2 <-> 7.3 is broken both ways, right?  Make sure you have (at minimum):
> 
> <memoryBacking>
>   <hugepages/>
> </memoryBacking>
> <cpu>
>   <numa>
>     <cell .../>
>   </numa>
> </cpu>
> 
> but no <numatune>, neither nodeset= in <hugepages/>.

Sorry for late. I just come back from holiday.

With the following setting,but no <numatune>, neither nodeset= in <hugepages/>:
<memoryBacking>
  <hugepages/>
</memoryBacking>
<cpu>
  <numa>
    <cell .../>
  </numa>
</cpu>

Test results are as follows:
1.Migration failed from rhel7.3 to rhel7.0, since the qemu cmd line use "memory-backend-file" in the rhel7.3, but it uses "-mem-prealloc -mem-path /dev/hugepages/libvirt/qemu" in rhel7.0.
2.Migration works well from rhel7.0 to rhel7.3, both source and target host are use "-mem-prealloc -mem-path /dev/hugepages/libvirt/qemu".
3.It works well when do migration between rhel7.2 and rhel7.3, since both rhel7.2 and rhel7.3 use "memory-backend-file".

Comment 7 yafu 2016-10-08 08:57:06 UTC
Created attachment 1208308 [details]
The guest XML

Comment 8 yafu 2016-10-08 08:57:44 UTC
Please see the guest XML in the attachment.

Comment 9 Martin Kletzander 2016-10-10 13:47:02 UTC
(In reply to yafu from comment #6)
You are saying that 7.0 <-> 7.2 doesn't work either?  Would you mind checking 7.0 <-> 7.1 as well?  Thanks a lot in advance.

Comment 10 yafu 2016-10-11 06:23:14 UTC
(In reply to Martin Kletzander from comment #9)
> (In reply to yafu from comment #6)
> You are saying that 7.0 <-> 7.2 doesn't work either?  Would you mind
> checking 7.0 <-> 7.1 as well?  Thanks a lot in advance.

1.rhel7.2->rhel7.0 works well now, since Bug 1266856 - Migration from 7.0 to 7.2 failed with numa+hugepage settings is fixed;
2.rhel7.1->rhel7.0 failed with the same error with rhel7.3->rhel7.0;

Comment 11 Martin Kletzander 2016-10-11 10:37:17 UTC
(In reply to yafu from comment #10)
Oh, my bad, I just figured it out.  So Bug 1266856 fixed the scenario with:

<memoryBacking>
  <hugepages/>
</memoryBacking>
<cpu>
  <numa>
    <cell .../>
  </numa>
</cpu>

but what we need to fix here is:

<memoryBacking>
  <hugepages size='...'/>
</memoryBacking>
<cpu>
  <numa>
    <cell .../>
  </numa>
</cpu>

It has different code path and hence it might be beneficial to test both aproaches in the migration matrix, I guess.

Comment 12 Martin Kletzander 2017-05-12 14:49:28 UTC
Any easy fix that would be provided now could actually break newer migration scenarios (rhel7.2 -> rhel7.3).  Since this is corner case and was not reported by any customer, I'm closing this as WONTFIX.  The reasoning behind it is just that we will have less broken things this way then if we "fixed" this particular scenario.  This bug only affects migration from rhel7.0 to newer ones, I believe.