Description of problem: When running DPDK checkup, there are some nodes, that when the traffic generator andthe VM are scheduled on, the checkup ends with packet loss. Version-Release number of selected component (if applicable): CNV 4.13.0 container-native-virtualization-kubevirt-dpdk-checkup-rhel9:v4.13.0-37 How reproducible: Most of the times (on specific nodes). Steps to Reproduce: 1. Create namespace for the job, and change context to the new namespace. $ oc create ns dpdk-checkup-ns $ oc project dpdk-checkup-ns 2. Label the worker nodes with "worker-dpdk" label. 3. Apply the resources manifests in the attached file in their numeric order: $ oc apply -f 1-dpdk-checkup-resources.yaml $ oc apply -f 2-dpdk-checkup-scc.yaml ... change the resources according to your cluster. Please note: Due to https://bugzilla.redhat.com/show_bug.cgi?id=2193235, you cannot set which nodes will be used for scheduling the VM and the traffic generator. Therefore, you must W/A it by either uncordoning 2 workers and leaving only one as schedulable, or removing the "dpdk-workers" label from 2 nodes and keeping it on only one node. 4. After the job is completed - check the ConfigMap: $ oc get cm dpdk-checkup-config -o yaml ... status.failureReason: 'not all generated packets had reached DPDK VM: Sent from traffic generator: 480000000; Received on DPDK VM: 110323573' status.result.DPDKRxPacketDrops: "0" status.result.DPDKRxTestPackets: "110323573" status.result.DPDKTxPacketDrops: "0" status.result.DPDKVMNode: cnv-qe-infra-06.cnvqe2.lab.eng.rdu2.redhat.com status.result.trafficGeneratorInErrorPackets: "0" status.result.trafficGeneratorNode: cnv-qe-infra-06.cnvqe2.lab.eng.rdu2.redhat.com status.result.trafficGeneratorOutputErrorPackets: "0" status.result.trafficGeneratorTxPackets: "480000000" status.startTimestamp: "2023-05-08T09:49:24Z" status.succeeded: "false" Actual results: <BUG> Note these fields: status.failureReason: 'not all generated packets had reached DPDK VM: Sent from traffic generator: 480000000; Received on DPDK VM: 110323573' status.succeeded: "false" Expected results: Successful job, no packet loss. Additional info: 1. The diff between Tx bytes and Rx byte can be seen in the job log: $ $ oc logs dpdk-checkup-8nhz9 ... 2023/05/08 10:08:47 GetPortStats JSON: { "id": "a7mhi4qm", "jsonrpc": "2.0", "result": { "ibytes": 0, "ierrors": 0, "ipackets": 0, "m_cpu_util": 0.0, "m_total_rx_bps": 0.0, "m_total_rx_pps": 0.0, "m_total_tx_bps": 4063406080.0, "m_total_tx_pps": 7469495.5, "obytes": 32640000000, "oerrors": 0, "opackets": 480000000 } } 2023/05/08 10:08:48 GetPortStats JSON: { "id": "ntnu7u0h", "jsonrpc": "2.0", "result": { "ibytes": 30720000000, "ierrors": 844, "ipackets": 480000000, "m_cpu_util": 0.0, "m_total_rx_bps": 1902393984.0, "m_total_rx_pps": 3715611.0, "m_total_tx_bps": 0.0, "m_total_tx_pps": 0.0, "obytes": 0, "oerrors": 0, "opackets": 0 (compare the obytes in the first summary with the ibytes in the second summary). 2. The issue was found on 2 separate clusters bm01-cnvqe2-rdu2 and bm02-cnvqe2-rdu2. On bm01-cnvqe2 the problematic node is cnv-qe-infra-06.cnvqe2.lab.eng.rdu2.redhat.com On bm02-cnvqe2 the checkup cannot currently run, so I'm not sure which was the problematic node(s).
I think we should put this bug on hold until we resolve https://bugzilla.redhat.com/2196459. Perhaps some nodes have more workload running on them, so we are more likely to land on a shared CPU with other processes. Correct me if my assumption is wrong.