Hide Forgot
Description of problem: RDMA migration on a chelsio T520-CR device times out with rdma-pin-all=on but works with pin-all off Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: [root@rdma-dev-13 ~]$ ./rdma-test Starting src PID TTY TIME CMD 3678 pts/0 00:00:00 qemu-kvm Starting dst PID TTY TIME CMD 3689 pts/0 00:00:00 qemu-kvm QEMU 2.6.0 monitor - type 'help' for more information (qemu) info status VM status: running Found: VM status: running QEMU 2.6.0 monitor - type 'help' for more information (qemu) info status VM status: paused (inmigrate) Found: VM status: paused (inmigrate) Good - both qemu's running (qemu) migrate_set_speed 100G (qemu) migrate rdma:172.31.50.43:4444 source_resolve_host RDMA Device opened: kernel name cxgb4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/cxgb4_0, transport: (2) Ethernet dest_init RDMA Device opened: kernel name cxgb4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/cxgb4_0, transport: (2) Ethernet (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off Migration status: completed Found: Migration status: completed (qemu) info status VM status: running Found: VM status: running passed pin_all=false qemu-kvm: terminating on signal 15 from pid 3669 qemu-kvm: terminating on signal 15 from pid 3669 Starting src PID TTY TIME CMD 3763 pts/0 00:00:00 qemu-kvm Starting dst PID TTY TIME CMD 3774 pts/0 00:00:00 qemu-kvm QEMU 2.6.0 monitor - type 'help' for more information (qemu) info status VM status: running Found: VM status: running QEMU 2.6.0 monitor - type 'help' for more information (qemu) info status VM status: paused (inmigrate) Found: VM status: paused (inmigrate) Good - both qemu's running (qemu) migrate_set_speed 100G (qemu) migrate_set_capability rdma-pin-all on source_resolve_host RDMA Device opened: kernel name cxgb4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/cxgb4_0, transport: (2) Ethernet dest_init RDMA Device opened: kernel name cxgb4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/cxgb4_0, transport: (2) Ethernet (qemu) migrate rdma:172.31.50.43:4444 Timeout waiting for Migration status: completed qemu-kvm: terminating on signal 15 from pid 3669 qemu-kvm: terminating on signal 15 from pid 3669 looks like is at: #0 0x00002b3928cc349d in read () from /lib64/libpthread.so.0 No symbol table info available. ---Type <return> to continue, or q <return> to quit--- #1 0x00002b392757c063 in ibv_get_cq_event () from /lib64/libibverbs.so.1 No symbol table info available. #2 0x00002b391e26a8f8 in qemu_rdma_block_for_wrid () No symbol table info available. #3 0x00002b391e26cf8f in qemu_rdma_registration_stop () No symbol table info available. #4 0x00002b391e266a4b in ram_control_after_iterate () No symbol table info available. #5 0x00002b391e0dbb0e in ram_save_iterate () No symbol table info available. #6 0x00002b391e0e0733 in qemu_savevm_state_iterate () No symbol table info available. (Not fixed by yee-oldee rdma-race-fix)
Hi David, Just to confirm, is it a device specific issue? Since rdma works well with mlx5 card according to QE's test, test env see https://bugzilla.redhat.com/show_bug.cgi?id=1356959#c4. Thanks, Qianqian
(In reply to qianqianzhu from comment #1) > Hi David, > > Just to confirm, is it a device specific issue? Since rdma works well with > mlx5 card according to QE's test, test env see > https://bugzilla.redhat.com/show_bug.cgi?id=1356959#c4. > > Thanks, > Qianqian Yes, this bug is cxgb4 specific. While you say mlx5 works for QE, I have a reliable test case of mlx5 failing, that is why I'm keeping bz 1356959 open. Dave
(In reply to Dr. David Alan Gilbert from comment #2) > (In reply to qianqianzhu from comment #1) > > Hi David, > > > > Just to confirm, is it a device specific issue? Since rdma works well with > > mlx5 card according to QE's test, test env see > > https://bugzilla.redhat.com/show_bug.cgi?id=1356959#c4. > > > > Thanks, > > Qianqian > > Yes, this bug is cxgb4 specific. > > While you say mlx5 works for QE, I have a reliable test case of mlx5 > failing, that is why I'm keeping bz 1356959 open. > > Dave Thanks David, Sorry that I was not saying it clearly, I mean mlx5 works well for x86, bz1356959 is ppc only.
(In reply to qianqianzhu from comment #3) > (In reply to Dr. David Alan Gilbert from comment #2) > > (In reply to qianqianzhu from comment #1) > > > Hi David, > > > > > > Just to confirm, is it a device specific issue? Since rdma works well with > > > mlx5 card according to QE's test, test env see > > > https://bugzilla.redhat.com/show_bug.cgi?id=1356959#c4. > > > > > > Thanks, > > > Qianqian > > > > Yes, this bug is cxgb4 specific. > > > > While you say mlx5 works for QE, I have a reliable test case of mlx5 > > failing, that is why I'm keeping bz 1356959 open. > > > > Dave > > Thanks David, Sorry that I was not saying it clearly, I mean mlx5 works well > for x86, bz1356959 is ppc only. Double confirmed about https://bugzilla.redhat.com/show_bug.cgi?id=1356959#c4, QE's test result should be: x86 passed rdma with mlx4, and ppc failed rdma with mlx5.
Hmm latest test run is showing it failing with various timeouts with pinall=on across all cards - but different erros on different cards.