Bug 2101375
| Summary: | passt: failed to transfer file through ipv4 from host to guest | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Quan Wenli <wquan> |
| Component: | passt | Assignee: | Stefano Brivio <sbrivio> |
| Status: | CLOSED ERRATA | QA Contact: | Lei Yang <leiyang> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 9.1 | CC: | aadam, leiyang, lvivier, sbrivio, yalzhang |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | passt-0^20230222.g4ddbcb9-1.el9 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-09 07:43:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 2122788 | ||
| Bug Blocks: | |||
Could you please collect a guest-side packet capture of the failing case by starting passt with the -p (--pcap) argument, such as "-p file_transfer.pcap"? (In reply to Stefano Brivio from comment #1) > Could you please collect a guest-side packet capture of the failing case by > starting passt with the -p (--pcap) argument, such as "-p > file_transfer.pcap"? Please check the file_transfer.pcap. I reduced the file to 2M, otherwise it's too large to upload in bugzilla. (In reply to Quan Wenli from comment #3) > (In reply to Stefano Brivio from comment #1) > > Could you please collect a guest-side packet capture of the failing case by > > starting passt with the -p (--pcap) argument, such as "-p > > file_transfer.pcap"? > > Please check the file_transfer.pcap. I reduced the file to 2M, otherwise > it's too large to upload in bugzilla. Thanks! And sorry for the delay, I finally had a chance to look at the captured traffic. In the initial part of the file transfer I couldn't spot any particular issue with sequences, acknowledgements, windows, etc. I guess there might be rather a problem toward the end of the transfer, with the connection not closing or something like that. Two requests from my side: - you can reproduce this only using the version (I guess) I installed on your test system (under /usr/local/bin) a while ago. I don't remember exactly what I was trying to debug with that. Note that the package would install binaries under /usr/bin, not /usr/local/bin. Could you try to see if this still happens with the latest RPM I published on my Copr repository? - if yes, would it be possible for you to share the last part of the capture (the last two megabytes or so), instead of the beginning of it? (In reply to Stefano Brivio from comment #4) > (In reply to Quan Wenli from comment #3) > > (In reply to Stefano Brivio from comment #1) > > > Could you please collect a guest-side packet capture of the failing case by > > > starting passt with the -p (--pcap) argument, such as "-p > > > file_transfer.pcap"? > > > > Please check the file_transfer.pcap. I reduced the file to 2M, otherwise > > it's too large to upload in bugzilla. > > Thanks! And sorry for the delay, I finally had a chance to look at the > captured traffic. > > In the initial part of the file transfer I couldn't spot any particular > issue with sequences, acknowledgements, windows, etc. I guess there might be > rather a problem toward the end of the transfer, with the connection not > closing or something like that. Two requests from my side: > > - you can reproduce this only using the version (I guess) I installed on > your test system (under /usr/local/bin) a while ago. I don't remember > exactly what I was trying to debug with that. Note that the package would > install binaries under /usr/bin, not /usr/local/bin. Could you try to see if > this still happens with the latest RPM I published on my Copr repository? > Yes, still can reproduce with latest passt-0.git.2022_07_20.9af2e5d-0.el9.x86_64 > - if yes, would it be possible for you to share the last part of the capture > (the last two megabytes or so), instead of the beginning of it? I found no problem with 1m file transfer, but sill problem with 2m file transfer. since " -p file_transfer.pcap_1 " with passt command, I am not sure how to reduce capture file. Checking the new (In reply to Quan Wenli from comment #6) > Created attachment 1902781 [details] > 2M file transfer capture Actually I see more than 4 MiB transferred there. If that's the full capture, it ends (frame #187) with an ACK segment from the guest, and nothing else. Thinking about it, I wonder if the reason is an issue in virtio-net or qemu (I couldn't really find out) I'm currently working around with the -x-txburst parameter passed for the virtio-net-pci device. Specifically, the issue I observed was that at some point packets would stop to be queued by qemu, until virtio_net is unloaded and reloaded. I haven't filed a ticket about that, yet. There's a simple check that should tell us if that's the case: when you hit this failure, you could check if the guest is now completely unable to reach the network (for example with a ping). If it is, and: rmmod virtio_net; modprobe virtio_net restores connectivity, I guess that might be the case. The workaround is not entirely robust, but I observed that with higher values of x-txburst, for example: -device virtio-net-pci,netdev=hostnet0,x-txburst=524288 failures are quite unlikely to occur. Could you please give that a try? By the way, (In reply to Quan Wenli from comment #5) > since " -p file_transfer.pcap_1 " with passt command, I am not sure how to > reduce capture file. you can do it once the capture is ready with a tool such as editcap(1). As I never remember the syntax, actually, I open the capture file with Wireshark, select a part of the packets, and save only that as a separate file (File -> Export Specified Packets). (In reply to Stefano Brivio from comment #7) > Checking the new (In reply to Quan Wenli from comment #6) > > Created attachment 1902781 [details] > > 2M file transfer capture > > Actually I see more than 4 MiB transferred there. If that's the full > capture, it ends (frame #187) with an ACK segment from the guest, and > nothing else. > > Thinking about it, I wonder if the reason is an issue in virtio-net or qemu But as "Additional info " part of comment #0, can not reproduce with upstream passt. so I think it's not qemu issue. > (I couldn't really find out) I'm currently working around with the > -x-txburst parameter passed for the virtio-net-pci device. Specifically, the > issue I observed was that at some point packets would stop to be queued by > qemu, until virtio_net is unloaded and reloaded. I haven't filed a ticket > about that, yet. > > There's a simple check that should tell us if that's the case: when you hit > this failure, you could check if the guest is now completely unable to reach > the network (for example with a ping). If it is, and: > > rmmod virtio_net; modprobe virtio_net > > restores connectivity, I guess that might be the case. The workaround is not > entirely robust, but I observed that with higher values of x-txburst, for > example: > > -device virtio-net-pci,netdev=hostnet0,x-txburst=524288 > Yes, I tried, but still can reproduce the issue. > failures are quite unlikely to occur. Could you please give that a try? > > By the way, > > (In reply to Quan Wenli from comment #5) > > since " -p file_transfer.pcap_1 " with passt command, I am not sure how to > > reduce capture file. Ok, I splited it to 4 files like file_transfer_00000_xxx.pcap_1_output > > you can do it once the capture is ready with a tool such as editcap(1). As I > never remember the syntax, actually, I open the capture file with Wireshark, > select a part of the packets, and save only that as a separate file (File -> > Export Specified Packets). Wenli, BZ 2122788 (Depends-on) has been fixed in qemu-kvm-7.2.0-1.el9 Could you re-test? (In reply to Laurent Vivier from comment #13) > Wenli, > > BZ 2122788 (Depends-on) has been fixed in qemu-kvm-7.2.0-1.el9 > > Could you re-test? I have tested it with below packages with the steps in comment 0, the issue can still be reproduced. qemu-kvm-7.2.0-7.el9.x86_64 passt-0^20221110.g4129764-1.el9.x86_64 I found the data transfer only happens in the first several seconds and then stoped. The file on the guest keeps as 99M # ll test_big.bin -rw-r--r--. 1 root root 103588817 Feb 7 19:10 test_big.bin file on host: # ll /tmp/tmp.jPM9PLbjTq -rw-r--r--. 1 test test 104857600 Feb 7 05:31 /tmp/tmp.jPM9PLbjTq It works well with twp QEMUs back-to-back, so I think the problem is really in passt. Should we move it to RHEL 9.3.0? ==> Reproduced this bug on passt-0^20221110.g4129764-1.el9.x86_64
Test Version:
passt-0^20221110.g4129764-1.el9.x86_64
qemu-kvm-7.2.0-9.el9.x86_64
kernel-5.14.0-281.el9.x86_64
libvirt-9.0.0-6.el9.x86_64
edk2-ovmf-20221207gitfff6d81270b5-6.el9.noarch
Test Steps:
1. Create passt pid
# passt -f -t 10001 -u 10001 -P passt.pid
Outbound interface (IPv4): switch
Outbound interface (IPv6): switch
MAC:
host: ec:2a:72:30:86:32
DHCP:
assign: 10.73.212.78
mask: 255.255.254.0
router: 10.73.213.254
DNS:
10.72.17.5
10.68.5.26
DNS search list:
lab.eng.pek2.redhat.com
NDP/DHCPv6:
assign: 2620:52:0:49d4:daa3:c9ff:27c7:4e06
router: fe80::52c7:903:543b:88e1
our link-local: fe80::c3e8:6ed9:7dbc:ba55
DNS search list:
lab.eng.pek2.redhat.com
UNIX domain socket bound at /tmp/passt_1.socket
You can now start qemu (>= 7.2, with commit 13c6be96618c):
kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket
or qrap, for earlier qemu versions:
./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio
accepted connection from PID 3038
NDP: received NS, sending NA
DHCP: ack to request
from 9a:36:de:f2:81:a1
NDP: received RS, sending RA
DHCPv6: received SOLICIT, sending ADVERTISE
DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY
2. Boot a guest
$ PATH=$PATH:/usr/libexec
$ qrap 5 qemu-kvm -m 16059 -smp 6 -blockdev '{"node-name": "file_ovmf_code", "driver": "file", "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd", "auto-read-only": true, "discard": "unmap"}' -blockdev '{"node-name": "drive_ovmf_code", "driver": "raw", "read-only": true, "file": "file_ovmf_code"}' -blockdev '{"node-name": "file_ovmf_vars", "driver": "file", "filename": "/home/test/avocado-vt-vm1_rhel920-64-virtio-scsi_qcow2_filesystem_VARS.fd", "auto-read-only": true, "discard": "unmap"}' -blockdev '{"node-name": "drive_ovmf_vars", "driver": "raw", "read-only": false, "file": "file_ovmf_vars"}' -machine q35,memory-backend=mem-machine_mem,pflash0=drive_ovmf_code,pflash1=drive_ovmf_vars -device '{"id": "pcie-root-port-0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x1", "chassis": 1}' -device '{"id": "pcie-pci-bridge-0", "driver": "pcie-pci-bridge", "addr": "0x0", "bus": "pcie-root-port-0"}' -nodefaults -device '{"driver": "VGA", "bus": "pcie.0", "addr": "0x2"}' -m 62464 -object '{"size": 65498251264, "id": "mem-machine_mem", "qom-type": "memory-backend-ram"}' -smp 28,maxcpus=28,cores=14,threads=1,dies=1,sockets=2 -cpu 'Icelake-Server',ds=on,ss=on,dtes64=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,avx512ifma=on,sha-ni=on,rdpid=on,fsrm=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,mpx=off,intel-pt=off,kvm_pv_unhalt=on -device '{"id": "pcie-root-port-2", "port": 2, "driver": "pcie-root-port", "addr": "0x1.0x2", "bus": "pcie.0", "chassis": 3}' -device '{"id": "virtio_scsi_pci0", "driver": "virtio-scsi-pci", "bus": "pcie-root-port-2", "addr": "0x0"}' -blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/test/rhel920-64-virtio-scsi.qcow2", "cache": {"direct": true, "no-flush": false}}' -blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' -device '{"driver": "scsi-hd", "id": "image1", "drive": "drive_image1", "write-cache": "on"}' -device '{"id": "pcie-root-port-3", "port": 3, "driver": "pcie-root-port", "addr": "0x1.0x3", "bus": "pcie.0", "chassis": 4}' -device '{"driver": "virtio-net-pci", "mac": "9a:36:de:f2:81:a1", "id": "net0", "netdev": "hostnet0", "x-txburst": 16384, "bus": "pcie-root-port-3", "addr": "0x0"}' -netdev socket,fd=5,id=hostnet0 -boot menu=off,order=cdn,once=c,strict=off -vnc :0 -boot menu=off,order=cdn,once=c,strict=off -monitor stdio
3. Disable firewall on both guest and host
# systemctl stop firewalld.service || service iptables stop || iptables -F || nft flush ruleset
4. Create a 100M file
on host:
$ dd if=/dev/urandom bs=1M count=100 > /tmp/tmp.jPM9PLbjTq
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.347883 s, 301 MB/s
5. Transfer file from host to guest
on guest:
# nc -l 10001 > test_big.bin
on host:
$ cat /tmp/tmp.jPM9PLbjTq |nc 127.0.0.1 10001
6. Wait for more than 10 mins, the nc command does not stopped. The data transfer only happens in the first several seconds and then stopped.
# ll test_big.bin
-rw-r--r--. 1 root root 103758237 Feb 23 08:45 test_big.bin
==>So reproduced this problem on passt-0^20221110.g4129764-1.el9.x86_64
==>Update the passt version to the latest version: passt-0^20230222.g4ddbcb9-1.el9.x86_64
Test Version:
passt-0^20230222.g4ddbcb9-1.el9.x86_64
qemu-kvm-7.2.0-9.el9.x86_64
kernel-5.14.0-281.el9.x86_64
libvirt-9.0.0-6.el9.x86_64
edk2-ovmf-20221207gitfff6d81270b5-6.el9.noarch
Test Steps:
1. Update the passt version to the latest version: passt-0^20230222.g4ddbcb9-1.el9.x86_64
# yum -y install passt-0^20230222.g4ddbcb9-1.el9.x86_64.rpm
# reboot (on the host)
2. After the host is power on, repeat the above test steps.The data will be transferred within seconds and nc command stopped.
# ll test_big.bin
-rw-r--r--. 1 root root 104857600 Feb 23 09:08 test_big.bin
==> So based on the above test result this bug has been fixed very well on passt-0^20230222.g4ddbcb9-1.el9.x86_64.
Moving to modified, thanks Lei Yang for checking! Yes, we found a separate TCP stall issue which is now fixed in passt-0^20230222.g4ddbcb9-1.el9. Based on the Comment 18 test result, move to "VERIFIED". Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (passt bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2292 |
Description of problem: Failed to transfer file through ipv4 from host to guest with passt Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1.[test@dell-per440-18 ~]$ /usr/local/bin/passt -f -t 10001 -u 10001 -P passt.pid Outbound interface: eno1 ARP: address: 2c:ea:7f:71:b6:ee DHCP: assign: 10.73.114.95 mask: 255.255.254.0 router: 10.73.115.254 DNS: 10.73.2.107 10.73.2.108 10.66.127.10 DNS search list: lab.eng.pek2.redhat.com NDP/DHCPv6: assign: 2620:52:0:4972:2eea:7fff:fe71:b6ee router: fe80::cee1:9402:8b35:be41 our link-local: fe80::2eea:7fff:fe71:b6ee DNS search list: lab.eng.pek2.redhat.com UNIX domain socket bound at /tmp/passt_1.socket You can now start qrap: ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio or directly qemu, patched with: qemu/0001-net-Allow-also-UNIX-domain-sockets-to-be-used-as-net.patch as follows: kvm ... -net socket,connect=/tmp/passt_1.socket -net nic,model=virtio DHCP: ack to request from 52:54:00:12:34:56 NDP: received RS, sending RA DHCPv6: received SOLICIT, sending ADVERTISE DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY 2. boot up guest [test@dell-per440-18 ~]$PATH=$PATH:/usr/libexec [test@dell-per440-18 ~]$ qrap 5 qemu-kvm -m 16059 -cpu host -smp 6 -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=./rhel900-64-virtio.qcow2 -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0 -nographic -serial stdio -nodefaults -device virtio-net-pci,netdev=hostnet0,x-txburst=16384 -netdev socket,fd=5,id=hostnet0 3. Disable firewall on both guest and host on guest: [root@dell-per440-18 ~]# systemctl stop firewalld.service || service iptables stop || iptables -F || nft flush ruleset on host: [test@dell-per440-18 ~]$ systemctl stop firewalld.service || service iptables stop || iptables -F || nft flush ruleset 4. on host: [test@dell-per440-18 ~]$ dd if=/dev/urandom bs=1M count=100 > /tmp/tmp.jPM9PLbjTq 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 0.632863 s, 166 MB/s on guest: [root@dell-per440-18 ~]# nc -l 10001 > test_big.bin on host: [test@dell-per440-18 ~]$ cat /tmp/tmp.jPM9PLbjTq |nc 127.0.0.1 10001 5. wait for more than 10 mins, the nc command does not stopped. Actual results: Expected results: Additional info: 1. there is no problem with upstream "$make valgrind" 1.1 it's passed with $valgrind --max-stackframe=4194304 --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f -t 10001 -u 10001 -P passt.pid 1.2. it's passed with "./passt -f -t 10001 -u 10001 -P passt.pid" 2. problem is existed with upstream "$make"