Bug 2196224

Summary: [DPDK checkup] Packet loss when running VM/traffic generator on specific nodes
Product: Container Native Virtualization (CNV) Reporter: Yossi Segev <ysegev>
Component: NetworkingAssignee: Petr Horáček <phoracek>
Status: ON_QA --- QA Contact: Nir Rozen <nrozen>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13.0CC: ralavi
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2196459    
Bug Blocks:    

Description Yossi Segev 2023-05-08 10:32:54 UTC
Description of problem:
When running DPDK checkup, there are some nodes, that when the traffic generator andthe VM are scheduled on, the checkup ends with packet loss.


Version-Release number of selected component (if applicable):
CNV 4.13.0
container-native-virtualization-kubevirt-dpdk-checkup-rhel9:v4.13.0-37


How reproducible:
Most of the times (on specific nodes).


Steps to Reproduce:
1. Create namespace for the job, and change context to the new namespace.
$ oc create ns dpdk-checkup-ns
$ oc project dpdk-checkup-ns

2. Label the worker nodes with "worker-dpdk" label.

3. Apply the resources manifests in the attached file in their numeric order:
$ oc apply -f 1-dpdk-checkup-resources.yaml
$ oc apply -f 2-dpdk-checkup-scc.yaml
...
change the resources according to your cluster.

Please note:
Due to https://bugzilla.redhat.com/show_bug.cgi?id=2193235, you cannot set which nodes will be used for scheduling the VM and the traffic generator.
Therefore, you must W/A it by either uncordoning 2 workers and leaving only one as schedulable, or removing the "dpdk-workers" label from 2 nodes and keeping it on only one node.

4. After the job is completed - check the ConfigMap:
$ oc get cm dpdk-checkup-config -o yaml
...
  status.failureReason: 'not all generated packets had reached DPDK VM: Sent from
    traffic generator: 480000000; Received on DPDK VM: 110323573'
  status.result.DPDKRxPacketDrops: "0"
  status.result.DPDKRxTestPackets: "110323573"
  status.result.DPDKTxPacketDrops: "0"
  status.result.DPDKVMNode: cnv-qe-infra-06.cnvqe2.lab.eng.rdu2.redhat.com
  status.result.trafficGeneratorInErrorPackets: "0"
  status.result.trafficGeneratorNode: cnv-qe-infra-06.cnvqe2.lab.eng.rdu2.redhat.com
  status.result.trafficGeneratorOutputErrorPackets: "0"
  status.result.trafficGeneratorTxPackets: "480000000"
  status.startTimestamp: "2023-05-08T09:49:24Z"
  status.succeeded: "false"


Actual results:
<BUG> Note these fields:
  status.failureReason: 'not all generated packets had reached DPDK VM: Sent from
    traffic generator: 480000000; Received on DPDK VM: 110323573'
  status.succeeded: "false"


Expected results:
Successful job, no packet loss.


Additional info:
1. The diff between Tx bytes and Rx byte can be seen in the job log:
$ $ oc logs dpdk-checkup-8nhz9
...
2023/05/08 10:08:47 GetPortStats JSON: {
    "id": "a7mhi4qm",
    "jsonrpc": "2.0",
    "result": {
        "ibytes": 0,
        "ierrors": 0,
        "ipackets": 0,
        "m_cpu_util": 0.0,
        "m_total_rx_bps": 0.0,
        "m_total_rx_pps": 0.0,
        "m_total_tx_bps": 4063406080.0,
        "m_total_tx_pps": 7469495.5,
        "obytes": 32640000000,
        "oerrors": 0,
        "opackets": 480000000
    }
}
2023/05/08 10:08:48 GetPortStats JSON: {
    "id": "ntnu7u0h",
    "jsonrpc": "2.0",
    "result": {
        "ibytes": 30720000000,
        "ierrors": 844,
        "ipackets": 480000000,
        "m_cpu_util": 0.0,
        "m_total_rx_bps": 1902393984.0,
        "m_total_rx_pps": 3715611.0,
        "m_total_tx_bps": 0.0,
        "m_total_tx_pps": 0.0,
        "obytes": 0,
        "oerrors": 0,
        "opackets": 0

(compare the obytes in the first summary with the ibytes in the second summary).

2. The issue was found on 2 separate clusters bm01-cnvqe2-rdu2 and bm02-cnvqe2-rdu2.
On bm01-cnvqe2 the problematic node is cnv-qe-infra-06.cnvqe2.lab.eng.rdu2.redhat.com
On bm02-cnvqe2 the checkup cannot currently run, so I'm not sure which was the problematic node(s).

Comment 2 Petr Horáček 2023-05-09 12:54:45 UTC
I think we should put this bug on hold until we resolve https://bugzilla.redhat.com/2196459. Perhaps some nodes have more workload running on them, so we are more likely to land on a shared CPU with other processes. Correct me if my assumption is wrong.