Bug 2189684

Summary: RHEL 8.8 XDP xdp_tools hangs on xdp-tools-dump-native-*
Product: Red Hat Enterprise Linux 8 Reporter: Jon Trossbach <jtrossba>
Component: xdp-toolsAssignee: Toke Høiland-Jørgensen <thoiland>
Status: NEW --- QA Contact: Christian Trautman <ctrautma>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 8.8CC: jbenc, kzhang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jon Trossbach 2023-04-25 21:46:06 UTC
Setting as urgent as this is happening in spotcheck. Will Update with verbose outputs when I have them.


Description of problem:
 xdp_tools mlx5-cx5 hangs on xdp-tools-dump-native-promis -- promiscuous mode testcase

Version-Release number of selected component (if applicable):
xdp-tools       x86_64       1.2.10-1.el8

How reproducible:
Always

Steps to Reproduce:
:: [ 22:45:00 ] :: [   LOG    ] :: Start recording CPU usage
:: [ 22:45:00 ] :: [  BEGIN   ] :: Running './CpuReporter -f xdp_tools_dump_native_snap.html &'
:: [ 22:45:00 ] :: [   PASS   ] :: Command './CpuReporter -f xdp_tools_dump_native_snap.html &' (Expected 0,1, got 0)
:: [ 22:45:00 ] :: [  BEGIN   ] :: Running 'sleep 10'
:: [ 22:45:10 ] :: [   PASS   ] :: Command 'sleep 10' (Expected 0, got 0)
x86_64
:: [ 22:45:10 ] :: [  BEGIN   ] :: Running 'xdp_tools_dump_native_snap'

Actual results:

8: ens1f0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
8: ens1f0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
wait for ens1f0 sec 0
8: ens1f0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:de:ad:de:ad:01 brd ff:ff:ff:ff:ff:ff permaddr 0c:42:a1:9d:04:52
    prog/xdp id 83 name xdp_dispatcher tag 94d5f00c20184d17 jited 
    altname enp59s0f0
Wait 0 secs until port becomes UP
SYNC_NC: sync_set client snapshot_and_write_test_base
SYNC_NC: sent "snapshot_and_write_test_base" to netqe1.knqe.lab.eng.bos.redhat.com
SYNC_NC: sync_wait client snapshot_and_write_test_base
SYNC_NC: waiting "netqe1.knqe.lab.eng.bos.redhat.com"
SYNC_NC: got "snapshot_and_write_test_base" from netqe1.knqe.lab.eng.bos.redhat.com
listening on ens1f0, ingress XDP program ID 90 func xdpfilt_alw_all, capture mode entry, capture size 20 bytes


Expected results:
Testcase passes

Additional info:
https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/04/77650/7765031/13770441/159149284/taskout.log

Comment 2 Jon Trossbach 2023-05-09 15:04:01 UTC
Okay, weird update on this it could be that my machines. They are having trouble with xdp-tools-dump-native-promis sometimes on different cards. Unreproducibly, my machines are having trouble getting hung. Can't rule out a General x86 issue now.

Changing title as I have now seen this on i40e: https://beaker.engineering.redhat.com/jobs/7829687

And what seems like a related xdp-tools-dump-native-use-pcap issue: https://beaker.engineering.redhat.com/recipes/13865268#task159874824

These taken together point to a wider xdp-tools dumping issue possibly architecture related. Machines are PowerEdge R740.