RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2063615 - Fail to migate wia RDMA uri: ERROR: result not equal to event_addr_resolved RDMA_CM_EVENT_ADDR_ERROR
Summary: Fail to migate wia RDMA uri: ERROR: result not equal to event_addr_resolved R...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Nitesh Narayan Lal
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-14 03:36 UTC by Han Han
Modified: 2023-04-16 12:28 UTC (History)
13 users (show)

Fixed In Version: upstream 8.0.0rc1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-04-14 16:42:53 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The logs of virtqemud (126.48 KB, application/gzip)
2022-03-14 03:36 UTC, Han Han
no flags Details
The logs of comment2 (130.24 KB, application/gzip)
2022-03-15 03:56 UTC, Han Han
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-115422 0 None None None 2022-03-14 06:16:50 UTC

Description Han Han 2022-03-14 03:36:10 UTC
Created attachment 1865816 [details]
The logs of virtqemud

Description of problem:
As subject

Version-Release number of selected component (if applicable):
libvirt-8.0.0-6.el9.x86_64
qemu-kvm-6.2.0-11.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Prepare hosts with RMDA:
dst host:
7: enp175s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 50:6b:4b:e3:f4:26 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.3/24 scope global enp175s0np0
       valid_lft forever preferred_lft forever
CA 'mlx5_1'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.25.1020
        Hardware version: 0
        Node GUID: 0x506b4b0300e3f426
        System image GUID: 0x506b4b0300e3f426
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x526b4bfffee3f426
                Link layer: Ethernet

src host:
7: enp175s0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 50:6b:4b:d4:2c:4c brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.4/24 scope global enp175s0np0
       valid_lft forever preferred_lft forever
CA 'mlx5_1'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.24.1000
        Hardware version: 0
        Node GUID: 0x506b4b0300d42c4c
        System image GUID: 0x506b4b0300d42c4c
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x526b4bfffed42c4c
                Link layer: Ethernet

2. Start an VM on src host
3. Migrate it to dst host:
~ virsh migrate --live --migrateuri rdma://192.168.100.3 test --listen-address 0.0.0.0 qemu+ssh://XXXX/system --verbose --p2p   
error: internal error: unable to execute QEMU command 'migrate': RDMA ERROR: result not equal to event_addr_resolved RDMA_CM_EVENT_ADDR_ERROR

Actual results:
As above

Expected results:
Migration finishes.

Additional info:
1. There are SELinux denials during the migration https://bugzilla.redhat.com/show_bug.cgi?id=2063612 . But it still fails after set SELinux as permissive https://bugzilla.redhat.com/show_bug.cgi?id=2063612#c3
2. Migration could pass with pure qemu cmds
1) Start src guest with
/usr/libexec/qemu-kvm \                                                                                                                                        
-name guest=test,debug-threads=on \
-S \                                                                           
-machine pc-i440fx-rhel7.6.0,usb=off,dump-guest-core=off,memory-backend=pc.ram \
-accel kvm \                      
-cpu Skylake-Server-IBRS,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,clflushopt=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaves
=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rsba=on,skip-l1dfl-vmentry=on,pschange-mc-no=on \
-m 1024 \                                                                      
-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":1073741824}' \                                                                                  
-overcommit mem-lock=off \
-smp 1,sockets=1,cores=1,threads=1 \                                           
-uuid bbd8dc20-176f-4e51-8689-620e53b779f5 \
-qmp tcp:localhost:4444,server,nowait

2) Start dst guest with
/usr/libexec/qemu-kvm \                                                                                                                                        
-name guest=test,debug-threads=on \
-S \                                                                           
-machine pc-i440fx-rhel7.6.0,usb=off,dump-guest-core=off,memory-backend=pc.ram \
-accel kvm \                      
-cpu Skylake-Server-IBRS,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,clflushopt=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaves
=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rsba=on,skip-l1dfl-vmentry=on,pschange-mc-no=on \
-m 1024 \                                                                      
-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":1073741824}' \                                                                                  
-overcommit mem-lock=off \
-smp 1,sockets=1,cores=1,threads=1 \                                           
-uuid bbd8dc20-176f-4e51-8689-620e53b779f5 \
-qmp tcp:localhost:4444,server,nowait \
-incoming defer

3) Execute migrate-incoming on dst:
{"execute":"migrate-incoming","arguments":{"uri":"rdma:0.0.0.0:49152"}}

4) Execute migrate on src:
{"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"rdma:192.168.100.4:49152"}}

Step 4 finished without error.
So I guest it is something wrong with libvirt.

3. See the logs of virtqemud in attachment
4. Passed on libvirt-7.6.0-2.el9

Comment 1 Dr. David Alan Gilbert 2022-03-14 10:14:47 UTC
Can you try it setting the listen-address to the IP of the dest host, i.e. --listen 192.168.100.3

I've seen some RDMA cards that don't like broadcast.

(I've not tried RDMA on RHEL9)

Comment 2 Han Han 2022-03-15 03:55:51 UTC
(In reply to Dr. David Alan Gilbert from comment #1)
> Can you try it setting the listen-address to the IP of the dest host, i.e.
> --listen 192.168.100.3
It doesn't work either. The error is different (SELinux is permissive)
➜  ~ virsh migrate --live --migrateuri rdma://192.168.124.3 test --listen-address 192.168.124.3 qemu+ssh://root.lab.eng.bos.redhat.com/system --verbose --p2p
error: operation failed: migration out job: unexpectedly failed
> 
> I've seen some RDMA cards that don't like broadcast.
> 
> (I've not tried RDMA on RHEL9)

Comment 3 Han Han 2022-03-15 03:56:37 UTC
Created attachment 1865942 [details]
The logs of comment2

Comment 4 Dr. David Alan Gilbert 2022-03-15 09:32:12 UTC
OK, I don't see the source log in there; but that might have a little more info.
Still, needs investigation.

Comment 5 Han Han 2022-03-15 10:26:16 UTC
(In reply to Dr. David Alan Gilbert from comment #4)
> OK, I don't see the source log in there; but that might have a little more
> info.
> Still, needs investigation.

The qemu log from src host:
2022-03-15 03:10:53.576+0000: starting up libvirt version: 8.0.0, package: 6.el9 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2022-03-03-10:55:38, ), qemu version: 6.2.0qemu-kvm-6.2.0-11.el9, kernel: 5.14.0-70.el9.x86_64, hostname: dell-per740-03.dell2.lab.eng.bos.redhat.com
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \
HOME=/var/lib/libvirt/qemu/domain-1-test \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-1-test/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-1-test/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-1-test/.config \
/usr/libexec/qemu-kvm \
-name guest=test,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-test/master-key.aes"}' \
-machine pc-i440fx-rhel7.6.0,usb=off,dump-guest-core=off,memory-backend=pc.ram \
-accel kvm \
-cpu Skylake-Server-IBRS,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,clflushopt=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rsba=on,skip-l1dfl-vmentry=on,pschange-mc-no=on \
-m 1024 \
-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":1073741824}' \
-overcommit mem-lock=off \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid bbd8dc20-176f-4e51-8689-620e53b779f5 \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=22,server=on,wait=off \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-global PIIX4_PM.disable_s3=1 \
-global PIIX4_PM.disable_s4=1 \
-boot strict=on \
-device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x4.0x7 \
-device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x4 \
-device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x4.0x1 \
-device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x4.0x2 \
-blockdev '{"driver":"file","filename":"/nfs/test.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \
-device virtio-blk-pci,bus=pci.0,addr=0x5,drive=libvirt-1-format,id=virtio-disk0,bootindex=1 \
-netdev tap,fd=23,id=hostnet0 \
-device e1000,netdev=hostnet0,id=net0,mac=52:54:00:d8:f7:b4,bus=pci.0,addr=0x3 \
-chardev pty,id=charserial0 \
-device isa-serial,chardev=charserial0,id=serial0 \
-device usb-tablet,id=input0,bus=usb.0,port=1 \
-audiodev '{"id":"audio1","driver":"none"}' \
-vnc 127.0.0.1:0,audiodev=audio1 \
-device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 \
-incoming defer \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
char device redirected to /dev/pts/4 (label charserial0)
dest_init RDMA Device opened: kernel name mlx5_1 uverbs device name uverbs1, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs1, infiniband class device path /sys/class/infiniband/mlx5_1, transport: (2) Ethernet
2022-03-15 03:10:54.556+0000: shutting down, reason=failed
2022-03-15T03:10:54.557830Z qemu-kvm: terminating on signal 15 from pid 5509 (/usr/sbin/virtqemud)
2022-03-15T03:10:54.615097Z qemu-kvm: receive cm event, cm event is 10
2022-03-15T03:10:54.615121Z qemu-kvm: rdma migration: recv polling control error!
2022-03-15T03:10:54.615142Z qemu-kvm: RDMA is in an error state waiting migration to abort!
2022-03-15T03:10:54.615149Z qemu-kvm: Not a migration stream
2022-03-15T03:10:54.615157Z qemu-kvm: load of migration failed: Invalid argument

Comment 6 Han Han 2022-03-15 10:29:19 UTC
BTW, the cgroup_device_acl of /etc/libvirt/qemu.conf was set as following in both hosts:
cgroup_device_acl = [
    "/dev/null", "/dev/full", "/dev/zero",
    "/dev/random", "/dev/urandom",
    "/dev/ptmx", "/dev/kvm", "/dev/kqemu",
    "/dev/rtc","/dev/hpet", "/dev/vfio/vfio",
    "/dev/infiniband/rdma_cm",
    "/dev/infiniband/issm0",
    "/dev/infiniband/issm1",
    "/dev/infiniband/umad0",
    "/dev/infiniband/umad1",
    "/dev/infiniband/uverbs0",
    "/dev/infiniband/uverbs1",
]

Comment 7 Jiri Denemark 2022-03-15 12:01:36 UTC
Moving to qemu-kvm for investigation is the error is reported there. We'll see
if any libvirt work is needed once we know the root cause of this bug.

Comment 8 Fangge Jin 2022-03-15 12:02:54 UTC
See also: Bug 1822518 - RDMA migration succeeds but there is audit error "AVC denied qemu-kvm create netlink_rdma_socket"

Comment 9 yafu 2022-09-09 03:30:47 UTC
Reproduced with:
 libvirt-daemon-8.5.0-6.el9.x86_64
 qemu-kvm-7.0.0-12.el9.x86_64

# virsh migrate --live --migrateuri rdma://192.168.100.2 avocado-vt-vm1 --listen-address 192.168.100.2 qemu+ssh://192.168.100.2/system --verbose --rdma-pin-all
root.100.2's password: 
error: Domain not found: no domain with matching uuid '5b0c05cd-705a-491c-b0c4-7965fdd0434c' (avocado-vt-vm1)

Check the error in qemu log:
dest_init RDMA Device opened: kernel name mlx5_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx5_0, transport: (2) Ethernet
2022-09-06T07:13:59.227077Z qemu-kvm: receive cm event, cm event is 10
2022-09-06T07:13:59.227132Z qemu-kvm: rdma migration: recv polling control error!
2022-09-06T07:13:59.227186Z qemu-kvm: RDMA is in an error state waiting migration to abort!
2022-09-06T07:13:59.227220Z qemu-kvm: Not a migration stream
2022-09-06T07:13:59.227257Z qemu-kvm: load of migration failed: Invalid argument
2022-09-06 07:13:59.630+0000: shutting down, reason=crashed

Comment 10 Leonardo Bras 2023-03-07 16:13:53 UTC
I spoke to David about this one.
Since very few people use RDMA migration, me and Nitesh agreed to reduce prio/sev from high to medium.

Comment 11 Dr. David Alan Gilbert 2023-03-09 12:55:35 UTC
I've just done a very basic test with qemu on it's own and it seems fine; this is on a 9.2 ost on rdma-dev-19 and rdma-dev-20:

[root@rdma-dev-19 ~]$ /usr/libexec/qemu-kvm -M q35 -nographic -cpu host

   [root@rdma-dev-20 ~]$ /usr/libexec/qemu-kvm -M q35 -nographic -cpu host -incoming rdma:172.31.1.120:4444

(qemu) migrate -d rdma:172.31.1.120:4444
(qemu) info migrate
.....
Migration status: completed

  (qemu) info status

the machine calls those interfaces mlx5_ib1  so I think it's the same type of device.

Comment 12 Nitesh Narayan Lal 2023-03-09 13:14:54 UTC
Thanks, Dave. Maybe the issue is already resolved.
Han, can you please try reproducing this issue with the latest 9.2 builds and share if this is still reproducible?
Thanks

Comment 13 Dr. David Alan Gilbert 2023-03-09 15:57:40 UTC
I've just tried this with libvirt and I agree it's failing; 

virsh migrate --live --verbose --migrateuri rdma://172.31.1.120 --listen-address 172.31.1.120 --desturi qemu+ssh://rdma-dev-20/system  --domain rhel9.1
error: operation failed: job 'migration out' unexpectedly failed

2023-03-09T15:56:10.156477Z qemu-kvm: terminating on signal 15 from pid 53491 (<unknown process>)
2023-03-09T15:56:10.202064Z qemu-kvm: receive cm event, cm event is 10
2023-03-09T15:56:10.202083Z qemu-kvm: rdma migration: recv polling control error!
2023-03-09T15:56:10.202110Z qemu-kvm: RDMA is in an error state waiting migration to abort!
2023-03-09T15:56:10.202115Z qemu-kvm: Not a migration stream
2023-03-09T15:56:10.202127Z qemu-kvm: load of migration failed: Invalid argument

and I have:

cgroup_device_acl = [
    "/dev/null", "/dev/full", "/dev/zero",
    "/dev/random", "/dev/urandom",
    "/dev/ptmx", "/dev/kvm",
   "/dev/infiniband/rdma_cm",
   "/dev/infiniband/issm0",
   "/dev/infiniband/issm1",
   "/dev/infiniband/issm2",
   "/dev/infiniband/issm3",
   "/dev/infiniband/umad0",
   "/dev/infiniband/umad1",
   "/dev/infiniband/umad2",
   "/dev/infiniband/umad3",
   "/dev/infiniband/uverbs0",
   "/dev/infiniband/uverbs1",
   "/dev/infiniband/uverbs2",
   "/dev/infiniband/uverbs3"
]


so something is going on.

Comment 14 Han Han 2023-03-13 08:31:13 UTC
(In reply to Dr. David Alan Gilbert from comment #13)
> I've just tried this with libvirt and I agree it's failing; 
> 
> virsh migrate --live --verbose --migrateuri rdma://172.31.1.120
> --listen-address 172.31.1.120 --desturi qemu+ssh://rdma-dev-20/system 
> --domain rhel9.1
> error: operation failed: job 'migration out' unexpectedly failed
> 
> 2023-03-09T15:56:10.156477Z qemu-kvm: terminating on signal 15 from pid
> 53491 (<unknown process>)
> 2023-03-09T15:56:10.202064Z qemu-kvm: receive cm event, cm event is 10
> 2023-03-09T15:56:10.202083Z qemu-kvm: rdma migration: recv polling control
> error!
> 2023-03-09T15:56:10.202110Z qemu-kvm: RDMA is in an error state waiting
> migration to abort!
> 2023-03-09T15:56:10.202115Z qemu-kvm: Not a migration stream
> 2023-03-09T15:56:10.202127Z qemu-kvm: load of migration failed: Invalid
> argument
> 
> and I have:
> 
> cgroup_device_acl = [
>     "/dev/null", "/dev/full", "/dev/zero",
>     "/dev/random", "/dev/urandom",
>     "/dev/ptmx", "/dev/kvm",
>    "/dev/infiniband/rdma_cm",
>    "/dev/infiniband/issm0",
>    "/dev/infiniband/issm1",
>    "/dev/infiniband/issm2",
>    "/dev/infiniband/issm3",
>    "/dev/infiniband/umad0",
>    "/dev/infiniband/umad1",
>    "/dev/infiniband/umad2",
>    "/dev/infiniband/umad3",
>    "/dev/infiniband/uverbs0",
>    "/dev/infiniband/uverbs1",
>    "/dev/infiniband/uverbs2",
>    "/dev/infiniband/uverbs3"
> ]
> 
> 
> so something is going on.

Yes. The same error for the tests on RHEL9.2
# virsh migrate --live --migrateuri rdma://192.168.100.3 test2 --listen-address 0 qemu+ssh://192.168.100.3/system --verbose --rdma-pin-all
root.100.3's password:
error: operation failed: job 'migration out' unexpectedly failed

Comment 15 Dr. David Alan Gilbert 2023-03-13 19:58:02 UTC
Using upstream qemu we get the very recently added extra errors on the source:

2023-03-13T19:51:15.304821Z qemu-system-x86_64: RDMA control channel input is not set
2023-03-13T19:51:15.304870Z qemu-system-x86_64: RDMA control channel input is not set
2023-03-13T19:51:15.304876Z qemu-system-x86_64: RDMA control channel input is not set
2023-03-13T19:51:15.306533Z qemu-system-x86_64: RDMA control channel output is not set


I don't think that's a change in behaviour - just telling us a bit about what's going wrong.

Comment 16 Dr. David Alan Gilbert 2023-03-14 13:08:22 UTC
OK, I see what's going on.

libvirt is enabling the 'return-path' capability and that doesn't work with RDMA.
(There's an interesting question if it's ever worked with RDMA, I tried 7.0 and 6.0
and it still fails, but I can see a patch in qemu just after 6.0 that claims to fix some case, but that's confusing me - that's 
44bcfd45e9806c78d9d526d69b0590227d215a78 in qemu).

Although qemu has had the return-path capability for ages, libvirt only enabled it in libvirt 8.0
(libvirt commit v7.10.0-309-g877d1c2478  - 877d1c2 ).  I think that landed in 8.6.0 and 9.x

Comment 17 Jiri Denemark 2023-03-14 14:53:48 UTC
Oops, would be nice if QEMU reported an error in such case :-)

Anyway, I guess we should disable return-path for RDMA migration then. Is this
the only case when it doesn't work or should we check for more things before
enabling the capability?

Comment 18 Dr. David Alan Gilbert 2023-03-14 17:01:36 UTC
Hang on, I've got a qemu fix for this.  It turns out the RDMA code has the return path code, but only enabled if postcopy is enabled; it only breaks if you're
running with postcopy not enabled.

Comment 19 Dr. David Alan Gilbert 2023-03-14 17:16:54 UTC
Posted upstream:
migration/rdma: Fix return-path case


Note You need to log in before you can comment on or make changes to this bug.