2196224 – [DPDK checkup] Packet loss when running VM/traffic generator on specific nodes

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Bug 2196224 - [DPDK checkup] Packet loss when running VM/traffic generator on specific nodes

Summary: [DPDK checkup] Packet loss when running VM/traffic generator on specific nodes

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.13.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.15.0
Assignee:	Petr Horáček
QA Contact:	Nir Rozen
Docs Contact:
URL:
Whiteboard:
Depends On:	2196459
Blocks:
TreeView+	depends on / blocked

Reported:	2023-05-08 10:32 UTC by Yossi Segev
Modified:	2023-12-14 16:07 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-14 16:07:24 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-28654	0	None	None	None	2023-12-14 16:07:24 UTC

Description Yossi Segev 2023-05-08 10:32:54 UTC

Description of problem:
When running DPDK checkup, there are some nodes, that when the traffic generator andthe VM are scheduled on, the checkup ends with packet loss.


Version-Release number of selected component (if applicable):
CNV 4.13.0
container-native-virtualization-kubevirt-dpdk-checkup-rhel9:v4.13.0-37


How reproducible:
Most of the times (on specific nodes).


Steps to Reproduce:
1. Create namespace for the job, and change context to the new namespace.
$ oc create ns dpdk-checkup-ns
$ oc project dpdk-checkup-ns

2. Label the worker nodes with "worker-dpdk" label.

3. Apply the resources manifests in the attached file in their numeric order:
$ oc apply -f 1-dpdk-checkup-resources.yaml
$ oc apply -f 2-dpdk-checkup-scc.yaml
...
change the resources according to your cluster.

Please note:
Due to https://bugzilla.redhat.com/show_bug.cgi?id=2193235, you cannot set which nodes will be used for scheduling the VM and the traffic generator.
Therefore, you must W/A it by either uncordoning 2 workers and leaving only one as schedulable, or removing the "dpdk-workers" label from 2 nodes and keeping it on only one node.

4. After the job is completed - check the ConfigMap:
$ oc get cm dpdk-checkup-config -o yaml
...
  status.failureReason: 'not all generated packets had reached DPDK VM: Sent from
    traffic generator: 480000000; Received on DPDK VM: 110323573'
  status.result.DPDKRxPacketDrops: "0"
  status.result.DPDKRxTestPackets: "110323573"
  status.result.DPDKTxPacketDrops: "0"
  status.result.DPDKVMNode: cnv-qe-infra-06.cnvqe2.lab.eng.rdu2.redhat.com
  status.result.trafficGeneratorInErrorPackets: "0"
  status.result.trafficGeneratorNode: cnv-qe-infra-06.cnvqe2.lab.eng.rdu2.redhat.com
  status.result.trafficGeneratorOutputErrorPackets: "0"
  status.result.trafficGeneratorTxPackets: "480000000"
  status.startTimestamp: "2023-05-08T09:49:24Z"
  status.succeeded: "false"


Actual results:
<BUG> Note these fields:
  status.failureReason: 'not all generated packets had reached DPDK VM: Sent from
    traffic generator: 480000000; Received on DPDK VM: 110323573'
  status.succeeded: "false"


Expected results:
Successful job, no packet loss.


Additional info:
1. The diff between Tx bytes and Rx byte can be seen in the job log:
$ $ oc logs dpdk-checkup-8nhz9
...
2023/05/08 10:08:47 GetPortStats JSON: {
    "id": "a7mhi4qm",
    "jsonrpc": "2.0",
    "result": {
        "ibytes": 0,
        "ierrors": 0,
        "ipackets": 0,
        "m_cpu_util": 0.0,
        "m_total_rx_bps": 0.0,
        "m_total_rx_pps": 0.0,
        "m_total_tx_bps": 4063406080.0,
        "m_total_tx_pps": 7469495.5,
        "obytes": 32640000000,
        "oerrors": 0,
        "opackets": 480000000
    }
}
2023/05/08 10:08:48 GetPortStats JSON: {
    "id": "ntnu7u0h",
    "jsonrpc": "2.0",
    "result": {
        "ibytes": 30720000000,
        "ierrors": 844,
        "ipackets": 480000000,
        "m_cpu_util": 0.0,
        "m_total_rx_bps": 1902393984.0,
        "m_total_rx_pps": 3715611.0,
        "m_total_tx_bps": 0.0,
        "m_total_tx_pps": 0.0,
        "obytes": 0,
        "oerrors": 0,
        "opackets": 0

(compare the obytes in the first summary with the ibytes in the second summary).

2. The issue was found on 2 separate clusters bm01-cnvqe2-rdu2 and bm02-cnvqe2-rdu2.
On bm01-cnvqe2 the problematic node is cnv-qe-infra-06.cnvqe2.lab.eng.rdu2.redhat.com
On bm02-cnvqe2 the checkup cannot currently run, so I'm not sure which was the problematic node(s).

Comment 2 Petr Horáček 2023-05-09 12:54:45 UTC

I think we should put this bug on hold until we resolve https://bugzilla.redhat.com/2196459. Perhaps some nodes have more workload running on them, so we are more likely to land on a shared CPU with other processes. Correct me if my assumption is wrong.

Comment 3 Yossi Segev 2023-10-12 10:40:07 UTC

This BZ was moved to ON_QA by mistake.
It still happens on container-native-virtualization/kubevirt-dpdk-checkup-rhel9:v4.14.0-116 (with CNV 4.14.0).
I'm re-opening it.

Comment 4 Yossi Segev 2023-10-12 12:03:03 UTC

This issue depends on IsolateEmulatorThread being enabled, which is planned for 4.15.
Therefore, I am changing the target release of this issue to 4.15.

Comment 5 Yossi Segev 2023-10-12 12:03:15 UTC

This issue depends on IsolateEmulatorThread being enabled, which is planned for 4.15.
Therefore, I am changing the target release of this issue to 4.15.

Comment 6 Ram Lavi 2023-10-18 12:44:19 UTC

This PR will also improve performance: https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/197

Note You need to log in before you can comment on or make changes to this bug.