Bug 1368431

Summary: rdma migration/cxgb4/timeout
Product: Red Hat Enterprise Linux 7 Reporter: Dr. David Alan Gilbert <dgilbert>
Component: qemu-kvm-rhevAssignee: Dr. David Alan Gilbert <dgilbert>
Status: CLOSED DEFERRED QA Contact: Li Xiaohui <xiaohli>
Severity: unspecified Docs Contact:
Priority: low    
Version: 7.4CC: chayang, dgilbert, dzheng, jinzhao, juzhang, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-22 20:29:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Dr. David Alan Gilbert 2016-08-19 11:23:41 UTC
Description of problem:
RDMA migration on a chelsio T520-CR device times out with rdma-pin-all=on
but works with pin-all off

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
[root@rdma-dev-13 ~]$ ./rdma-test 
Starting src
  PID TTY          TIME CMD
 3678 pts/0    00:00:00 qemu-kvm
Starting dst
  PID TTY          TIME CMD
 3689 pts/0    00:00:00 qemu-kvm
QEMU 2.6.0 monitor - type 'help' for more information
(qemu) info status
VM status: running
Found: VM status: running
QEMU 2.6.0 monitor - type 'help' for more information
(qemu) info status
VM status: paused (inmigrate)
Found: VM status: paused (inmigrate)
Good - both qemu's running
(qemu) migrate_set_speed 100G
(qemu) migrate rdma:172.31.50.43:4444
source_resolve_host RDMA Device opened: kernel name cxgb4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/cxgb4_0, transport: (2) Ethernet
dest_init RDMA Device opened: kernel name cxgb4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/cxgb4_0, transport: (2) Ethernet
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off 
Migration status: completed
Found: Migration status: completed
(qemu) info status
VM status: running
Found: VM status: running
passed pin_all=false
qemu-kvm: terminating on signal 15 from pid 3669
qemu-kvm: terminating on signal 15 from pid 3669
Starting src
  PID TTY          TIME CMD
 3763 pts/0    00:00:00 qemu-kvm
Starting dst
  PID TTY          TIME CMD
 3774 pts/0    00:00:00 qemu-kvm
QEMU 2.6.0 monitor - type 'help' for more information
(qemu) info status
VM status: running
Found: VM status: running
QEMU 2.6.0 monitor - type 'help' for more information
(qemu) info status
VM status: paused (inmigrate)
Found: VM status: paused (inmigrate)
Good - both qemu's running
(qemu) migrate_set_speed 100G
(qemu) migrate_set_capability rdma-pin-all on
source_resolve_host RDMA Device opened: kernel name cxgb4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/cxgb4_0, transport: (2) Ethernet
dest_init RDMA Device opened: kernel name cxgb4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/cxgb4_0, transport: (2) Ethernet
(qemu) migrate rdma:172.31.50.43:4444
Timeout waiting for Migration status: completed
qemu-kvm: terminating on signal 15 from pid 3669
qemu-kvm: terminating on signal 15 from pid 3669

looks like  is at:
#0  0x00002b3928cc349d in read () from /lib64/libpthread.so.0
No symbol table info available.
---Type <return> to continue, or q <return> to quit---
#1  0x00002b392757c063 in ibv_get_cq_event () from /lib64/libibverbs.so.1
No symbol table info available.
#2  0x00002b391e26a8f8 in qemu_rdma_block_for_wrid ()
No symbol table info available.
#3  0x00002b391e26cf8f in qemu_rdma_registration_stop ()
No symbol table info available.
#4  0x00002b391e266a4b in ram_control_after_iterate ()
No symbol table info available.
#5  0x00002b391e0dbb0e in ram_save_iterate ()
No symbol table info available.
#6  0x00002b391e0e0733 in qemu_savevm_state_iterate ()
No symbol table info available.

(Not fixed by yee-oldee rdma-race-fix)

Comment 1 Qianqian Zhu 2016-08-26 07:32:44 UTC
Hi David,

Just to confirm, is it a device specific issue? Since rdma works well with mlx5 card according to QE's test, test env see https://bugzilla.redhat.com/show_bug.cgi?id=1356959#c4.

Thanks,
Qianqian

Comment 2 Dr. David Alan Gilbert 2016-09-05 11:21:36 UTC
(In reply to qianqianzhu from comment #1)
> Hi David,
> 
> Just to confirm, is it a device specific issue? Since rdma works well with
> mlx5 card according to QE's test, test env see
> https://bugzilla.redhat.com/show_bug.cgi?id=1356959#c4.
> 
> Thanks,
> Qianqian

Yes, this bug is cxgb4 specific.

While you say mlx5 works for QE, I have a reliable test case of mlx5 failing, that is why I'm keeping bz 1356959 open.

Dave

Comment 3 Qianqian Zhu 2016-09-08 01:40:50 UTC
(In reply to Dr. David Alan Gilbert from comment #2)
> (In reply to qianqianzhu from comment #1)
> > Hi David,
> > 
> > Just to confirm, is it a device specific issue? Since rdma works well with
> > mlx5 card according to QE's test, test env see
> > https://bugzilla.redhat.com/show_bug.cgi?id=1356959#c4.
> > 
> > Thanks,
> > Qianqian
> 
> Yes, this bug is cxgb4 specific.
> 
> While you say mlx5 works for QE, I have a reliable test case of mlx5
> failing, that is why I'm keeping bz 1356959 open.
> 
> Dave

Thanks David, Sorry that I was not saying it clearly, I mean mlx5 works well for x86, bz1356959 is ppc only.

Comment 4 Qianqian Zhu 2016-09-08 03:18:02 UTC
(In reply to qianqianzhu from comment #3)
> (In reply to Dr. David Alan Gilbert from comment #2)
> > (In reply to qianqianzhu from comment #1)
> > > Hi David,
> > > 
> > > Just to confirm, is it a device specific issue? Since rdma works well with
> > > mlx5 card according to QE's test, test env see
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1356959#c4.
> > > 
> > > Thanks,
> > > Qianqian
> > 
> > Yes, this bug is cxgb4 specific.
> > 
> > While you say mlx5 works for QE, I have a reliable test case of mlx5
> > failing, that is why I'm keeping bz 1356959 open.
> > 
> > Dave
> 
> Thanks David, Sorry that I was not saying it clearly, I mean mlx5 works well
> for x86, bz1356959 is ppc only.

Double confirmed about https://bugzilla.redhat.com/show_bug.cgi?id=1356959#c4, QE's test result should be: x86 passed rdma with mlx4, and ppc failed rdma with mlx5.

Comment 5 Dr. David Alan Gilbert 2016-12-01 13:22:47 UTC
Hmm latest test run is showing it failing with various timeouts with pinall=on across all cards - but different erros on different cards.