Bug 1459831

Summary: Migration fails with --rdma-pin-all option
Product: Red Hat Enterprise Linux 7 Reporter: Dan Zheng <dzheng>
Component: qemu-kvm-rhevAssignee: Dr. David Alan Gilbert <dgilbert>
Status: CLOSED NOTABUG QA Contact: xianwang <xianwang>
Severity: medium Docs Contact:
Priority: high    
Version: 7.4CC: chayang, dgilbert, dzheng, fjin, juzhang, knoel, michen, qzhang, virt-maint, xianwang, yafu, yanqzhan, zpeng
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-09 09:51:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
source host libvirtd
none
remote host libvirtd
none
remote qemu log
none
local host qemu
none
guest xxml none

Description Dan Zheng 2017-06-08 10:08:37 UTC
Created attachment 1286096 [details]
source host libvirtd

Description of problem:
Migration fails with --rdma-pin-all option.

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.9.0-8.el7.x86_64
libvirt-3.2.0-7.el7.x86_64
3.10.0-679.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start guest and do migration without --rdma-pin-all. 
   Migration succeeds.

2. Start guest and do migration with --rdma-pin-all.

# virsh migrate --live --migrateuri rdma://192.168.0.2 setusertest --listen-address 0 qemu+ssh://192.168.0.2/system --verbose ***--rdma-pin-all***

error: internal error: qemu unexpectedly closed the monitor: 2017-06-08T09:56:14.612417Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/3 (label charserial0)
dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (2) Ethernet
Failed to register local dest ram block!
: Cannot allocate memory
2017-06-08T09:56:14.737291Z qemu-kvm: rdma migration: error dest registering ram blocks
2017-06-08T09:56:14.737301Z qemu-kvm: error while loading state for instance 0x0 of device 'ram'
2017-06-08T09:56:14.737461Z qemu-kvm: Early error. Sending error.
2017-06-08T09:56:40.666831Z qemu-kvm: load of migration failed: Operation not permitted


Actual results:
See above

Expected results:
Migration with --rdma-pin-all is successful.

Additional info:

Logs are attached.

Comment 2 Dan Zheng 2017-06-08 10:09:15 UTC
Created attachment 1286097 [details]
remote host libvirtd

Comment 3 Dan Zheng 2017-06-08 10:09:53 UTC
Created attachment 1286098 [details]
remote qemu log

Comment 4 Dan Zheng 2017-06-08 10:10:24 UTC
Created attachment 1286100 [details]
local host qemu

Comment 5 Dan Zheng 2017-06-08 10:10:52 UTC
Created attachment 1286101 [details]
guest xxml

Comment 6 Dan Zheng 2017-06-08 10:13:14 UTC
This is a regression problem which does not exist in RHEL 7.3

Comment 8 Dr. David Alan Gilbert 2017-06-08 10:50:38 UTC
Hi Dan,
  Can you tell me how much RAM your destination host has and whether it's running any other VMs?
  Does increasing the 'hard_limit' value in the XML help?

Comment 9 Dan Zheng 2017-06-09 09:51:04 UTC
OK. I have to appologize this is a user error due to too small of hard_limit
With below section, the migration succeeds.

  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <memtune>
    <hard_limit unit='KiB'>3145728</hard_limit>
    <swap_hard_limit unit='KiB'>4194304</swap_hard_limit>
  </memtune>

# virsh migrate --live --migrateuri rdma://192.168.100.2 setusertest --listen-address 0 qemu+ssh://192.168.100.2/system --verbose --rdma-pin-allroot.100.2's password: 
Migration: [100 %]

Comment 10 Dr. David Alan Gilbert 2017-06-09 10:34:01 UTC
Hi Dan,
  Thanks for trying that;  can you tell me whether on 7.3 the original hard_limit value worked?

Comment 11 Dan Zheng 2017-06-12 03:41:58 UTC
Hi David,
Currently I have no RDMA machines with RHEL7.3 for testing. I would like to try once I get them ready.

Comment 12 Dan Zheng 2017-06-26 10:10:55 UTC
Hi, David,

After confirmation, the configuration in original XML like below can work in RHEL7.3. Thanks.

  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <memtune>
    <hard_limit unit='KiB'>2097152</hard_limit>
    <swap_hard_limit unit='KiB'>2097152</swap_hard_limit>
  </memtune>

Comment 13 Dr. David Alan Gilbert 2017-06-29 17:46:11 UTC
(In reply to Dan Zheng from comment #12)
> Hi, David,
> 
> After confirmation, the configuration in original XML like below can work in
> RHEL7.3. Thanks.
> 
>   <memory unit='KiB'>2097152</memory>
>   <currentMemory unit='KiB'>2097152</currentMemory>
>   <memtune>
>     <hard_limit unit='KiB'>2097152</hard_limit>
>     <swap_hard_limit unit='KiB'>2097152</swap_hard_limit>
>   </memtune>


Hi Dan,
  Can you try and figure out what component causes it to change;  for example for me with a 7.4 install and a 7.3 qemu it still fails.  So can you try with 7.3 kernel and 7.4 qemu etc and see which component it is that causes the change.

Thanks.

Comment 14 Yanqiu Zhang 2017-07-06 07:41:27 UTC
Hi David,

This is a known change that the hard_limit setting is different between rhel7.3 and rhel7.4. On rhel7.4 the memory hard_limit need be about 2G larger than the memory as our testing experience. There are bugs related to this issue such as Bz1373783. Btw, it's not easy for us to prepare the specific machines and set up env.

For more detail changes info, please confirm with QEMU QEs, that should be a more faster way. 


Thanks.

Comment 15 Dr. David Alan Gilbert 2017-07-06 08:17:31 UTC
(In reply to yanqzhan from comment #14)
> Hi David,
> 
> This is a known change that the hard_limit setting is different between
> rhel7.3 and rhel7.4. On rhel7.4 the memory hard_limit need be about 2G
> larger than the memory as our testing experience. There are bugs related to
> this issue such as Bz1373783. Btw, it's not easy for us to prepare the
> specific machines and set up env.
> 
> For more detail changes info, please confirm with QEMU QEs, that should be a
> more faster way. 

I can't find any more details about it;  if it has doubled we need to understand why. bz 1373783 is just a documentation bug, it doesn't help - can you please provide some more information about what is known here.

> 
> 
> Thanks.

Comment 16 Yanqiu Zhang 2017-07-06 13:29:49 UTC
Sorry, it's misunderstanding between our group discussion. The larger hard limit requirement exists in previous product, refer to BZ1160997, BZ1046833. Maybe the xml configuration in comment12 (which mem equals to hard limit) needs further confirmation.

Comment 17 Dr. David Alan Gilbert 2017-07-06 13:36:34 UTC
OK, that's fine - I'm only worried if it's a regression where the amount needed suddenly increases a lot.