Bug 1285044 - migration/RDMA: Race condition
migration/RDMA: Race condition
Status: ON_QA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev (Show other bugs)
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Dr. David Alan Gilbert
Depends On:
Blocks: 1475751
  Show dependency treegraph
Reported: 2015-11-24 13:04 EST by Dr. David Alan Gilbert
Modified: 2017-10-03 04:41 EDT (History)
11 users (show)

See Also:
Fixed In Version: qemu-kvm-rhev-2.10.0-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1475751 (view as bug list)
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Dr. David Alan Gilbert 2015-11-24 13:04:27 EST
Description of problem:
There's a race condition in RDMA startup, triggerable by adding a ~2ms usleep before the post_recv_control(rdma, RDMA_WRID_READY) near the end of qemu_rdma_connect

Version-Release number of selected component (if applicable):
tested on 2.5-rc upstream

How reproducible:

Steps to Reproduce:

Actual results:

ibv_poll_cq wc.status=13 RNR retry counter exceeded!

This is on  the poll waiting for data in qemu_rdma_get_buffer
Expected results:
No error

Additional info:
I think this is because we're doing an rdma_post_send_control and sending a WRID_SEND_CONTROL dest->source before the source has post_recv_control - there needs to be some interlock there to ensure it happens in the right order.
Comment 1 Dr. David Alan Gilbert 2017-06-20 10:35:21 EDT
possibly the race we're hitting on rxe?
Comment 2 Dr. David Alan Gilbert 2017-06-30 12:22:47 EDT
Not the race we're hitting on rxe.

Posted fix for this one upstream:
   migration/rdma: Fix race on source
Comment 3 Dr. David Alan Gilbert 2017-07-19 10:38:06 EDT
Merged upstream:

9cf2bab2edca1e651eef migration/rdma: Fix race on source
3a0f2ceaedcf70ff79b6 migration: Close file on failed migration load
0b3c15f09715acd78063 migration/rdma: fix qemu_rdma_block_for_wrid error paths
9c98cfbe72b21d9d84b9 migration/rdma: Allow cancelling while waiting for wrid
482a33c53cbc9d2b0c47 migration/rdma: Safely convert control types
32bce196344772df8d68 migration/rdma: Send error during cancelling

(Note if we backport this to a 2.9 world we don't want 'Close file on failed')

Note You need to log in before you can comment on or make changes to this bug.