Bug 1808789 - [ovs-dpdk] [x710] toeplitz hashing not functioning when src_ip alone is changed for RSS
Summary: [ovs-dpdk] [x710] toeplitz hashing not functioning when src_ip alone is chang...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch
Version: FDP 20.C
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Timothy Redaelli
QA Contact: qding
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-01 09:15 UTC by Gowrishankar Muthukrishnan
Modified: 2023-07-13 07:25 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-1961 0 None None None 2022-05-09 15:11:01 UTC

Description Gowrishankar Muthukrishnan 2020-03-01 09:15:29 UTC
Description of problem:

In my osp 16 cluster, the computeovsdpdk node has ovs-dpdk (v2.11.0-35). After openstack deployment and bring up of tenant instances (trex VM and testpmd VM in this case), rxqs of phy nics get rebalanced across multiple pmd threads, along with other vhostuser rxqs of VMs. During the testing of multiple flow streams (of count 4 in each direction) to exactly map n_rxqs=4 set for phy nics), I observed that, a pmd which polls rxq of two different ports of X710 NIC poll only one rxq, but not both.

As seen in below test info, only one rxq of phy ports presents in pmd 22.
   
pmd thread numa_id 0 core_id 1:
  isolated : false
  port: vhu3cb8fa43-5c    queue-id:  0  pmd usage:  0 %
  port: vhu3cb8fa43-5c    queue-id:  3  pmd usage:  0 %
  port: vhu5e59e258-c2    queue-id:  0  pmd usage:  0 %
  port: vhu5e59e258-c2    queue-id:  3  pmd usage:  0 %
pmd thread numa_id 0 core_id 2:
  isolated : false
  port: dpdk-link1-port   queue-id:  1  pmd usage:  0 %
  port: vhu3cb8fa43-5c    queue-id:  5  pmd usage:  0 %
  port: vhu5e59e258-c2    queue-id:  5  pmd usage:  0 %
  port: vhu8948a6eb-80    queue-id:  6  pmd usage:  0 %
pmd thread numa_id 0 core_id 10:
  isolated : false
  port: dpdk-link1-port   queue-id:  2  pmd usage:  0 %
  port: vhu3cb8fa43-5c    queue-id:  4  pmd usage:  0 %
  port: vhu5e59e258-c2    queue-id:  4  pmd usage:  0 %
  port: vhu8948a6eb-80    queue-id:  7  pmd usage:  0 %
pmd thread numa_id 0 core_id 11:
  isolated : false
  port: dpdk-link1-port   queue-id:  0  pmd usage:  0 %
  port: vhu3cb8fa43-5c    queue-id:  7  pmd usage:  0 %
  port: vhu5e59e258-c2    queue-id:  7  pmd usage:  0 %
  port: vhu8948a6eb-80    queue-id:  4  pmd usage:  0 %
pmd thread numa_id 0 core_id 13:
  isolated : false
  port: dpdk-link2-port   queue-id:  1  pmd usage:  0 %
  port: dpdk-link2-port   queue-id:  3  pmd usage:  0 %
  port: vhu8948a6eb-80    queue-id:  0  pmd usage:  0 %
  port: vhu8948a6eb-80    queue-id:  3  pmd usage:  0 %
pmd thread numa_id 0 core_id 14:
  isolated : false
  port: vhu3cb8fa43-5c    queue-id:  1  pmd usage:  0 %
  port: vhu3cb8fa43-5c    queue-id:  2  pmd usage:  0 %
  port: vhu5e59e258-c2    queue-id:  1  pmd usage:  0 %
  port: vhu5e59e258-c2    queue-id:  2  pmd usage:  0 %
pmd thread numa_id 0 core_id 22:
  isolated : false
  port: dpdk-link1-port   queue-id:  3  pmd usage:  0 % <<<
  port: dpdk-link2-port   queue-id:  2  pmd usage:  0 % <<<
  port: vhu8948a6eb-80    queue-id:  1  pmd usage:  0 %
  port: vhu8948a6eb-80    queue-id:  2  pmd usage:  0 %
pmd thread numa_id 0 core_id 23:
  isolated : false
  port: dpdk-link2-port   queue-id:  0  pmd usage:  0 %
  port: vhu3cb8fa43-5c    queue-id:  6  pmd usage:  0 %
  port: vhu5e59e258-c2    queue-id:  6  pmd usage:  0 %
  port: vhu8948a6eb-80    queue-id:  5  pmd usage:  0 %

For an instance with 1Mpps traffic across both the directions of PVP, using below trex command in trex VM:

python -u /opt/trafficgen/trex-txrx.py --device-pairs=0:1 --active-device-pairs=0:1 --mirrored-log --measure-latency=0 --rate=1.0 --rate-unit=mpps --size=64 --runtime=120 --runtime-tolerance=5 --run-bidirec=1 --run-revunidirec=0 --num-flows=4 --dst-macs=fa:16:3e:7e:26:38,fa:16:3e:bc:43:c0 --vlan-ids=306,307 --use-src-ip-flows=1 --use-dst-ip-flows=0 --use-src-mac-flows=0 --use-dst-mac-flows=0 --use-src-port-flows=0 --use-dst-port-flows=0 --use-protocol-flows=0 --packet-protocol=UDP --stream-mode=continuous --max-loss-pct=0.0

I could generate below two sets of streams (of four in each direction):

sudo ovs-tcpdump -nn -e  -i dpdk-link1-port -c 1024 | awk '/IPv4/ {print $2" "$3" "$4" "$16 $17 $18 $19}'| sort | uniq

in one direction:
e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.46.32768>50.0.5.43.53:30840
e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.47.32768>50.0.5.43.53:30840
e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.48.32768>50.0.5.43.53:30840
e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.49.32768>50.0.5.43.53:30840

in other direction:
fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.43.32768>50.0.4.46.53:30840
fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.44.32768>50.0.4.46.53:30840
fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.45.32768>50.0.4.46.53:30840
fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.46.32768>50.0.4.46.53:30840


Checking pmd usages for phy nics alone:

$ sudo ovs-appctl dpif-netdev/pmd-rxq-show | egrep '(pmd thread|link1|link2)'
pmd thread numa_id 0 core_id 1:
pmd thread numa_id 0 core_id 2:
  port: dpdk-link1-port   queue-id:  1  pmd usage:  7 %
pmd thread numa_id 0 core_id 10:
  port: dpdk-link1-port   queue-id:  2  pmd usage: 12 %
pmd thread numa_id 0 core_id 11:
  port: dpdk-link1-port   queue-id:  0  pmd usage: 12 %
pmd thread numa_id 0 core_id 13:
  port: dpdk-link2-port   queue-id:  1  pmd usage: 18 %
  port: dpdk-link2-port   queue-id:  3  pmd usage:  0 %
pmd thread numa_id 0 core_id 14:
pmd thread numa_id 0 core_id 22:
  port: dpdk-link1-port   queue-id:  3  pmd usage:  7 %
  port: dpdk-link2-port   queue-id:  2  pmd usage:  0 % <<< not polled >>>
pmd thread numa_id 0 core_id 23:
  port: dpdk-link2-port   queue-id:  0  pmd usage: 18 %

There was no loss however, due to less packet rate:

testpmd (mac forward):

  Throughput (since last show)
  RX-packets: 19999999       TX-packets: 19999999       TX-dropped: 0
  ------- Forward Stats for RX Port= 1/Queue= 0 -> TX Port= 0/Queue= 0 -------
  RX-packets: 59999998       TX-packets: 59999998       TX-dropped: 0
  ------- Forward Stats for RX Port= 1/Queue= 1 -> TX Port= 0/Queue= 1 -------
  RX-packets: 59999998       TX-packets: 59999998       TX-dropped: 0
  ---------------------- Forward statistics for port 0  ----------------------
  RX-packets: 119999996      RX-dropped: 0             RX-total: 119999996
  TX-packets: 119999996      TX-dropped: 0             TX-total: 119999996
  ----------------------------------------------------------------------------

  ---------------------- Forward statistics for port 1  ----------------------
  RX-packets: 119999996      RX-dropped: 0             RX-total: 119999996
  TX-packets: 119999996      TX-dropped: 0             TX-total: 119999996
  ----------------------------------------------------------------------------

  +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
  RX-packets: 239999992      RX-dropped: 0             RX-total: 239999992
  TX-packets: 239999992      TX-dropped: 0             TX-total: 239999992
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

trex pkt gen:

Transmitting at 1.0mpps from port 0 to port 1 for 120 seconds...
Transmitting at 1.0mpps from port 0 to port 1 for 120 seconds...
Transmitting at 1.0mpps from port 1 to port 0 for 120 seconds...
Transmitting at 1.0mpps from port 1 to port 0 for 120 seconds...
Waiting...
ore":18.597078323364258,"force_quit":false,"timeout":false,"rx_cpu_util":0.0,"rx_pps":1997657.125,"queue_full":0,"cpu_util":1.9478102922439575,"tx_pps":1997628.25,"tx_bps":1086707456.0,"rx_drop_bps":0.0,"early_exit":false,"runtime":120.038953},"total":{"tx_util":14.06328104,"rx_bps":1086725888.0,"obytes":16319999456,"rx_pps":1997657.125,"ipackets":240000480,"oerrors":0,"rx_util":14.06351028,"opackets":239999992,"tx_pps":1997628.25,"tx_bps":1086707584.0,"ierrors":0,"rx_bps_L1":1406351028.0,"tx_bps_L1":1406328104.0,"ibytes":16320035824},"flow_stats":{"128":{"rx_bps":{"0":0.0,"1":0.0,"total":0.0},"loss":{"cnt":{"1->0":0.0,"total":0.0,"0->1":"N/A"},"pct":{"1->0":0.0,"total":0.0,"0->1":"N/A"}},"rx_pps":{"0":0.0,"1":0.0,"total":0.0},"rx_pkts":{"0":119999996,"1":0,"total":119999996},"rx_bytes":{"0":"N/A","1":"N/A","total":"N/A"},"tx_pkts":{"0":0,"1":119999996,"total":119999996},"tx_pps":{"0":0.0,"1":0.0,"total":0.0},"tx_bps":{"0":0.0,"1":0.0,"total":0.0},"tx_bytes":{"0":0,"1":8159999728,"total":8159999728},"rx_bps_l1":{"0":0.0,"1":0.0,"total":0.0},"tx_bps_l1":{"0":0.0,"1":0.0,"total":0.0}},"1":{"rx_bps":{"0":0.0,"1":0.0,"total":0.0},"loss":{"cnt":{"1->0":"N/A","total":0.0,"0->1":0.0},"pct":{"1->0":"N/A","total":0.0,"0->1":0.0}},"rx_pps":{"0":0.0,"1":0.0,"total":0.0},"rx_pkts":{"0":0,"1":119999996,"total":119999996},"rx_bytes":{"0":"N/A","1":"N/A","total":"N/A"},"tx_pkts":{"0":119999996,"1":0,"total":119999996},"tx_pps":{"0":0.0,"1":0.0,"total":0.0},"tx_bps":{"0":0.0,"1":0.0,"total":0.0},"tx_bytes":{"0":8159999728,"1":0,"total":8159999728},"rx_bps_l1":{"0":0.0,"1":0.0,"total":0.0},"tx_bps_l1":{"0":0.0,"1":0.0,"total":0.0}},"global":{"rx_err":{"0":0,"1":0},"tx_err":{"0":0,"1":0}}}}

As noticed, loss is 0. So, all the packets are polled in this low tx rate test, however rxq id 2 of dpdk-link2-port was not polled. 

Does it mean RSS did not hash the stream uniformly, when a rxq of peer port is polled by same pmd ?.

Version-Release number of selected component (if applicable):


Steps to Reproduce:
Deploy multiqueue (atleast 4) enabled phy nic in openstack cluster (RHOSP16).
Deploy trex VM on SRIOV compute and create traffic on VFs.
Deploy testpmd VM in ovsdpdk compute to forward trex traffic in mac mode.

Actual results:
Peer port's rxq not polled by PMD polling one port's rxq

Expected results:
Peer port's rxq has to be polled by PMD polling one port's rxq

Additional info:

Comment 1 Gowrishankar Muthukrishnan 2020-03-01 15:19:35 UTC
Confirmation of the problem by amount of cpu cycles consumed by the affected rxq which turns out to be zero as below.

During the PVP test, do rxq-rebalance using ovs-appctl which would report cpu cycles consumed by every rxq while rebalancing, in ovs-vswitchd.log. Check for the affected rxq.

For an instance, in below rxq assignment:
	
	pmd thread numa_id 0 core_id 1:
	  port: dpdk-link1-port   queue-id:  2  pmd usage:  0 %
	pmd thread numa_id 0 core_id 2:
	pmd thread numa_id 0 core_id 10:
	pmd thread numa_id 0 core_id 11:
	  port: dpdk-link1-port   queue-id:  0  pmd usage:  0 %
	pmd thread numa_id 0 core_id 13:
	  port: dpdk-link2-port   queue-id:  1  pmd usage:  0 %
	  port: dpdk-link2-port   queue-id:  3  pmd usage:  0 %
	pmd thread numa_id 0 core_id 14:
	  port: dpdk-link2-port   queue-id:  0  pmd usage:  0 %
	pmd thread numa_id 0 core_id 22:
	  port: dpdk-link1-port   queue-id:  3  pmd usage:  0 %
	  port: dpdk-link2-port   queue-id:  2  pmd usage:  0 % <<< rxq going to be affected >>>
	pmd thread numa_id 0 core_id 23:
	  port: dpdk-link1-port   queue-id:  1  pmd usage:  0 %

Start trex test and do rxq rebalance and check ovs-vswitchd.log.
	<trex tx>
        wait for mid of test
	sudo ovs-appctl dpif-netdev/pmd-rxq-rebalance

from log:

	2020-03-01T15:13:42.262Z|03069|dpif_netdev|INFO|Core 22 on numa node 0 assigned port 'dpdk-link2-port' rx queue 1 (measured processing cycles 24104865120).
	2020-03-01T15:13:42.262Z|03070|dpif_netdev|INFO|Core 13 on numa node 0 assigned port 'dpdk-link2-port' rx queue 0 (measured processing cycles 23809755945).
	2020-03-01T15:13:42.262Z|03071|dpif_netdev|INFO|Core 11 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 0 (measured processing cycles 21798052095).
	2020-03-01T15:13:42.262Z|03072|dpif_netdev|INFO|Core 23 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 1 (measured processing cycles 19537135227).
	2020-03-01T15:13:42.263Z|03073|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'dpdk-link1-port' rx queue 2 (measured processing cycles 16038933254).
	2020-03-01T15:13:42.263Z|03074|dpif_netdev|INFO|Core 10 on numa node 0 assigned port 'dpdk-link1-port' rx queue 0 (measured processing cycles 14247925302).
	2020-03-01T15:13:42.263Z|03075|dpif_netdev|INFO|Core 1 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 1 (measured processing cycles 13906586804).
	2020-03-01T15:13:42.263Z|03076|dpif_netdev|INFO|Core 14 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 0 (measured processing cycles 13707458680).
	2020-03-01T15:13:42.263Z|03077|dpif_netdev|INFO|Core 14 on numa node 0 assigned port 'dpdk-link1-port' rx queue 1 (measured processing cycles 9930522720).
	2020-03-01T15:13:42.263Z|03078|dpif_netdev|INFO|Core 1 on numa node 0 assigned port 'dpdk-link1-port' rx queue 3 (measured processing cycles 8948654726).
	2020-03-01T15:13:42.263Z|03079|dpif_netdev|INFO|Core 10 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 3 (measured processing cycles 8878290897).
	2020-03-01T15:13:42.263Z|03080|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 2 (measured processing cycles 8399106648).
	2020-03-01T15:13:42.263Z|03081|dpif_netdev|INFO|Core 23 on numa node 0 assigned port 'dpdk-link2-port' rx queue 3 (measured processing cycles 1782198).
	2020-03-01T15:13:42.263Z|03082|dpif_netdev|INFO|Core 11 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 7 (measured processing cycles 433610).
	2020-03-01T15:13:42.263Z|03083|dpif_netdev|INFO|Core 13 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 6 (measured processing cycles 353736).
	2020-03-01T15:13:42.263Z|03084|dpif_netdev|INFO|Core 22 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 0 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03085|dpif_netdev|INFO|Core 22 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 1 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03086|dpif_netdev|INFO|Core 13 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 2 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03087|dpif_netdev|INFO|Core 11 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 3 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03088|dpif_netdev|INFO|Core 23 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 4 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03089|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 5 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03090|dpif_netdev|INFO|Core 10 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 2 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03091|dpif_netdev|INFO|Core 1 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 3 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03092|dpif_netdev|INFO|Core 14 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 4 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03093|dpif_netdev|INFO|Core 14 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 5 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03094|dpif_netdev|INFO|Core 1 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 6 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03095|dpif_netdev|INFO|Core 10 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 7 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03096|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 4 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03097|dpif_netdev|INFO|Core 23 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 5 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03098|dpif_netdev|INFO|Core 11 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 6 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03099|dpif_netdev|INFO|Core 13 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 7 (measured processing cycles 0).
	2020-03-01T15:13:42.263Z|03100|dpif_netdev|INFO|Core 22 on numa node 0 assigned port 'dpdk-link2-port' rx queue 2 (measured processing cycles 0). <<< no poll >>>


As seen above, rxq 2 of dpdk-link2-port confirms for not consuming cpu cycles, so poll did not happen.

Comment 2 Ilya Maximets 2020-03-02 16:28:49 UTC
Hmm.  You have 4 traffic streams and 4 HW queues, it's very likely that HW RSS hash function will not distribute traffic evenly across all the avaialable queues, just because it is simple hashing.  You need to generate more flows or try to guess flow patterns that will be evenly distributed by your particular NIC.

Comment 3 Kevin Traynor 2020-03-02 16:33:14 UTC
Also, please note 0% does not indicate that the queue was not polled. 0% indicates that there were no packets received on that queue, which is most likely because of the explanation in comment2.

Comment 4 Gowrishankar Muthukrishnan 2020-03-03 14:28:12 UTC
(In reply to Ilya Maximets from comment #2)
> Hmm.  You have 4 traffic streams and 4 HW queues, it's very likely that HW
> RSS hash function will not distribute traffic evenly across all the
> avaialable queues, just because it is simple hashing.  You need to generate
> more flows or try to guess flow patterns that will be evenly distributed by
> your particular NIC.

Doubting when one port hashing correctly, while other not, I did some check wrt RSS spec in datasheet and how dpdk driver enables (wrt HTOEP set in GLQF_CTL reg). When I bring up compute node, I see "toeplitz" hashing alg set by default and not simple hash it is as we thought. (No ovs command to help, so I used testpmd on phy ports by stopping ovs-dpdk a while) but none of packet headers enabled for the hashing as well.


Configuring Port 0 (socket 0)
Port 0: E4:43:4B:5C:90:E2
Configuring Port 1 (socket 0)
Port 1: E4:43:4B:5C:90:E3
Checking link statuses...
Done

testpmd> get_hash_global_config 0
Hash function is Toeplitz
Symmetric hash is disabled globally for flow type ipv4-frag by port 0
Symmetric hash is disabled globally for flow type ipv4-tcp by port 0
Symmetric hash is disabled globally for flow type ipv4-udp by port 0
Symmetric hash is disabled globally for flow type ipv4-sctp by port 0
Symmetric hash is disabled globally for flow type ipv4-other by port 0
Symmetric hash is disabled globally for flow type ipv6-frag by port 0
Symmetric hash is disabled globally for flow type ipv6-tcp by port 0
Symmetric hash is disabled globally for flow type ipv6-udp by port 0
Symmetric hash is disabled globally for flow type ipv6-sctp by port 0
Symmetric hash is disabled globally for flow type ipv6-other by port 0
Symmetric hash is disabled globally for flow type l2_payload by port 0

<similar for port 1>

Not sure why all of the possible L3/L4 headers were disabled for this alg, as this alg uses a random key along with tuple set (Refer 7.1.10.1 Microsoft Toeplitz-based hash in X710 doc).

# unable to enable all L4 in ipv4 ..
testpmd> set_hash_global_config 0 toeplitz ipv4 enable 
i40e_hash_global_config_check(): i40e unsupported flow type bit(s) configured
Cannot set global hash configurations by port 0


testpmd> set_hash_global_config 1 toeplitz ipv4 enable 
i40e_hash_global_config_check(): i40e unsupported flow type bit(s) configured
Cannot set global hash configurations by port 1

# but one by one starting with udp enough for the test
testpmd> set_hash_global_config 0 toeplitz ipv4-udp enable 
i40e_write_global_rx_ctl(): i40e device 0000:18:00.2 changed global register [0x00269d7c]. original: 0x00000000, new: 0x00000001
Global hash configurations have been set successfully by port 0

testpmd> set_hash_global_config 1 toeplitz ipv4-udp enable 
Global hash configurations have been set successfully by port 1
testpmd> 

testpmd> get_hash_global_config 0
Hash function is Toeplitz
Symmetric hash is disabled globally for flow type ipv4-frag by port 0
Symmetric hash is disabled globally for flow type ipv4-tcp by port 0
Symmetric hash is enabled globally for flow type ipv4-udp by port 0 <<<< L3/L4 turned on >>>>
Symmetric hash is disabled globally for flow type ipv4-sctp by port 0
Symmetric hash is disabled globally for flow type ipv4-other by port 0
Symmetric hash is disabled globally for flow type ipv6-frag by port 0
Symmetric hash is disabled globally for flow type ipv6-tcp by port 0
Symmetric hash is disabled globally for flow type ipv6-udp by port 0
Symmetric hash is disabled globally for flow type ipv6-sctp by port 0
Symmetric hash is disabled globally for flow type ipv6-other by port 0
Symmetric hash is disabled globally for flow type l2_payload by port 0
testpmd> 

Similar config shown for port 1

Now I started ovs-dpdk (as we did not reset nic hw) and here are the results wrt pmd/rxq load:

$ sudo ovs-appctl dpif-netdev/pmd-rxq-show | egrep '(pmd thread|link1|link2)'
pmd thread numa_id 0 core_id 1:
  port: dpdk-link2-port   queue-id:  0  pmd usage:  4 %
pmd thread numa_id 0 core_id 2:
  port: dpdk-link1-port   queue-id:  0  pmd usage:  0 % <<<<
  port: dpdk-link2-port   queue-id:  3  pmd usage: 15 %
pmd thread numa_id 0 core_id 10:
  port: dpdk-link2-port   queue-id:  1  pmd usage:  5 %
pmd thread numa_id 0 core_id 11:
  port: dpdk-link1-port   queue-id:  1  pmd usage:  0 % <<<<
  port: dpdk-link1-port   queue-id:  2  pmd usage: 17 %
pmd thread numa_id 0 core_id 13:
  port: dpdk-link2-port   queue-id:  2  pmd usage: 15 %
pmd thread numa_id 0 core_id 14:
pmd thread numa_id 0 core_id 22:
  port: dpdk-link1-port   queue-id:  3  pmd usage: 17 %
pmd thread numa_id 0 core_id 23:

RSS works but only some packets seems to be processed by dpdk-link1-port's queues, so toeplitz hashing is not superb for this test to equal load balance.

2020-03-03T13:55:23.733Z|00497|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'dpdk-link1-port' rx queue 0 (measured processing cycles 988124).
2020-03-03T13:55:23.733Z|00500|dpif_netdev|INFO|Core 11 on numa node 0 assigned port 'dpdk-link1-port' rx queue 1 (measured processing cycles 148506).

Then, I changed the hashing alg to simple-xor and here are the results (all queues well balanced !)..

testpmd> get_hash_global_config 0
Hash function is Toeplitz
Symmetric hash is disabled globally for flow type ipv4-frag by port 0
Symmetric hash is disabled globally for flow type ipv4-tcp by port 0
Symmetric hash is enabled globally for flow type ipv4-udp by port 0
Symmetric hash is disabled globally for flow type ipv4-sctp by port 0
Symmetric hash is disabled globally for flow type ipv4-other by port 0
Symmetric hash is disabled globally for flow type ipv6-frag by port 0
Symmetric hash is disabled globally for flow type ipv6-tcp by port 0
Symmetric hash is disabled globally for flow type ipv6-udp by port 0
Symmetric hash is disabled globally for flow type ipv6-sctp by port 0
Symmetric hash is disabled globally for flow type ipv6-other by port 0
Symmetric hash is disabled globally for flow type l2_payload by port 0

# switch to simple-xor
testpmd> set_hash_global_config 0 simple_xor ipv4 enable 
i40e_hash_global_config_check(): i40e unsupported flow type bit(s) configured
Cannot set global hash configurations by port 0

testpmd> set_hash_global_config 0 simple_xor ipv4-udp enable  
i40e_write_global_rx_ctl(): i40e device 0000:18:00.2 changed global register [0x00269d7c]. original: 0x00000000, new: 0x00000001
i40e_write_global_rx_ctl(): i40e device 0000:18:00.2 changed global register [0x00269ba4]. original: 0x01fe0002, new: 0x01fe0000
Global hash configurations have been set successfully by port 0

testpmd> get_hash_global_config 0
Hash function is Simple XOR
Symmetric hash is disabled globally for flow type ipv4-frag by port 0
Symmetric hash is disabled globally for flow type ipv4-tcp by port 0
Symmetric hash is enabled globally for flow type ipv4-udp by port 0  <<<<
Symmetric hash is disabled globally for flow type ipv4-sctp by port 0
Symmetric hash is disabled globally for flow type ipv4-other by port 0
Symmetric hash is disabled globally for flow type ipv6-frag by port 0
Symmetric hash is disabled globally for flow type ipv6-tcp by port 0
Symmetric hash is disabled globally for flow type ipv6-udp by port 0
Symmetric hash is disabled globally for flow type ipv6-sctp by port 0
Symmetric hash is disabled globally for flow type ipv6-other by port 0
Symmetric hash is disabled globally for flow type l2_payload by port 0

testpmd> set_hash_global_config 1 simple_xor ipv4-udp enable
Global hash configurations have been set successfully by port 1

testpmd> get_hash_global_config 1
Hash function is Simple XOR
Symmetric hash is disabled globally for flow type ipv4-frag by port 1
Symmetric hash is disabled globally for flow type ipv4-tcp by port 1
Symmetric hash is enabled globally for flow type ipv4-udp by port 1 <<<
Symmetric hash is disabled globally for flow type ipv4-sctp by port 1
Symmetric hash is disabled globally for flow type ipv4-other by port 1
Symmetric hash is disabled globally for flow type ipv6-frag by port 1
Symmetric hash is disabled globally for flow type ipv6-tcp by port 1
Symmetric hash is disabled globally for flow type ipv6-udp by port 1
Symmetric hash is disabled globally for flow type ipv6-sctp by port 1
Symmetric hash is disabled globally for flow type ipv6-other by port 1
Symmetric hash is disabled globally for flow type l2_payload by port 1

$ sudo ovs-appctl dpif-netdev/pmd-rxq-show | egrep '(pmd thread|link1|link2)'
pmd thread numa_id 0 core_id 1:
  port: dpdk-link1-port   queue-id:  0  pmd usage: 11 %
pmd thread numa_id 0 core_id 2:
  port: dpdk-link2-port   queue-id:  2  pmd usage: 13 %
pmd thread numa_id 0 core_id 10:
  port: dpdk-link1-port   queue-id:  2  pmd usage: 13 %
  port: dpdk-link2-port   queue-id:  3  pmd usage: 11 %
pmd thread numa_id 0 core_id 11:
  port: dpdk-link1-port   queue-id:  1  pmd usage: 12 %
  port: dpdk-link2-port   queue-id:  1  pmd usage: 12 %
pmd thread numa_id 0 core_id 13:
  port: dpdk-link2-port   queue-id:  0  pmd usage: 11 %
pmd thread numa_id 0 core_id 14:
pmd thread numa_id 0 core_id 22:
pmd thread numa_id 0 core_id 23:
  port: dpdk-link1-port   queue-id:  3  pmd usage: 11 %


So, it turned out to be an issue with Toeplitz hashing that is turned on by default for this hardware. So, in summary:

(1) Toeplitz hashing is on by default which did not enable any of the L3/L4 headers for the calculation as reported above in terms of observed imbalance.
(2) Simple-xor is doing RSS equaling the load distribution of hw queues. So, should this be turned on instead, as a default hashing scheme for the deployment ?

Thanks.

Comment 5 Ilya Maximets 2020-03-03 14:53:40 UTC
(In reply to Gowrishankar Muthukrishnan from comment #4)
> (In reply to Ilya Maximets from comment #2)
> > Hmm.  You have 4 traffic streams and 4 HW queues, it's very likely that HW
> > RSS hash function will not distribute traffic evenly across all the
> > avaialable queues, just because it is simple hashing.  You need to generate
> > more flows or try to guess flow patterns that will be evenly distributed by
> > your particular NIC.
> 
> Doubting when one port hashing correctly, while other not

There is no "correct" or "incorrect".
Hashes are hashes.  You can't expect that 4 flows will be perfectly distributed
across 4 HW queues.  If one hash algorithm gives you that result doesn't mean
it will work better in a real-world scenario and even with different 4 flows.

> I did some check
> wrt RSS spec in datasheet and how dpdk driver enables (wrt HTOEP set in
> GLQF_CTL reg). When I bring up compute node, I see "toeplitz" hashing alg
> set by default and not simple hash it is as we thought. (No ovs command to
> help, so I used testpmd on phy ports by stopping ovs-dpdk a while) but none
> of packet headers enabled for the hashing as well.

Re-loading of DPDK driver (stopping the OVS and starting testpmd) most likely
drops most of device configurations.

> Now I started ovs-dpdk (as we did not reset nic hw) and here are the results
> wrt pmd/rxq load:

OVS explicitly sets rss configuration while initializing the port, so I doubt
that any of above configurations survives OVS restart.

Are you sure that you're generating exact same traffic streams in all cases
(ips, macs, udp/tcp ports)?

Comment 6 Gowrishankar Muthukrishnan 2020-03-04 12:45:22 UTC
(In reply to Ilya Maximets from comment #5)
> (In reply to Gowrishankar Muthukrishnan from comment #4)
> > (In reply to Ilya Maximets from comment #2)
> > > Hmm.  You have 4 traffic streams and 4 HW queues, it's very likely that HW
> > > RSS hash function will not distribute traffic evenly across all the
> > > avaialable queues, just because it is simple hashing.  You need to generate
> > > more flows or try to guess flow patterns that will be evenly distributed by
> > > your particular NIC.
> > 
> > Doubting when one port hashing correctly, while other not
> 
> There is no "correct" or "incorrect".
> Hashes are hashes.  You can't expect that 4 flows will be perfectly
> distributed
> across 4 HW queues.  If one hash algorithm gives you that result doesn't mean
> it will work better in a real-world scenario and even with different 4 flows.
> 

I agree, but until the hash collision occurs right ?. For the validation of RSS we have to have control on the input traffic streams. In L2/L3/L4 tuple, I change only src IP for four times, to generate 4 different streams, keeping rest of the headers constant, so hashing will not collide on calculated keys as per my understanding.

> > I did some check
> > wrt RSS spec in datasheet and how dpdk driver enables (wrt HTOEP set in
> > GLQF_CTL reg). When I bring up compute node, I see "toeplitz" hashing alg
> > set by default and not simple hash it is as we thought. (No ovs command to
> > help, so I used testpmd on phy ports by stopping ovs-dpdk a while) but none
> > of packet headers enabled for the hashing as well.
> 
> Re-loading of DPDK driver (stopping the OVS and starting testpmd) most likely
> drops most of device configurations.

Agreed, but would you think GLQF reg too be reset ?. I could able to retain the setting I updated, even after starting and stopping ovs-dpdk (but until system itself is reboot).
Here is what I did:

1. system boots and ovs-dpdk started.
2. I stopped ovs-dpdk and launched testpmd, with vfio-pci still loaded
3. inspection/updates performed as I did for hashing
4. stopped testpmd and started ovs-dpdk
5. run the tests to observe hashing behaviour
6. stopped ovs-dpdk and started testpmd
7. observed my hash setting did not change as well (same as set in step 3)
8. stopped testpmd and started ovs-dpdk.

> 
> > Now I started ovs-dpdk (as we did not reset nic hw) and here are the results
> > wrt pmd/rxq load:
> 
> OVS explicitly sets rss configuration while initializing the port, so I doubt
> that any of above configurations survives OVS restart.
> 

I could not find anywhere calling eth_dev_ops->filter_ctrl in ovs. Any pointer where I can check for it ?.

> Are you sure that you're generating exact same traffic streams in all cases
> (ips, macs, udp/tcp ports)?

Yes, I am not changing the traffic pattern I set for the comparison.

port 0 -> 1
e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.46.32768>50.0.5.43.53:30840
e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.47.32768>50.0.5.43.53:30840
e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.48.32768>50.0.5.43.53:30840
e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.49.32768>50.0.5.43.53:30840

port 1 -> 0
fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.43.32768>50.0.4.46.53:30840
fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.44.32768>50.0.4.46.53:30840
fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.45.32768>50.0.4.46.53:30840
fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.46.32768>50.0.4.46.53:30840

It is same trex command used to create these streams both with simple-xor as well as toeplitz hashing cases.

However, I just notice onething in generated traffic src IPs: one of the src IPs "50.0.5.43" in (port 1 -> 0) is always a dst IP in (port 0 -> 1) streams. Not sure, if that would be a concern for toeplitz  (however knowing that one is in ingress, other is egress). Meanwhile, I'll come back by keep entirely different IPs in both directions of streams for this test.

Comment 9 Ilya Maximets 2020-03-22 19:56:44 UTC
(In reply to Gowrishankar Muthukrishnan from comment #6)
> (In reply to Ilya Maximets from comment #5)
> > (In reply to Gowrishankar Muthukrishnan from comment #4)
> > > (In reply to Ilya Maximets from comment #2)
> > > > Hmm.  You have 4 traffic streams and 4 HW queues, it's very likely that HW
> > > > RSS hash function will not distribute traffic evenly across all the
> > > > avaialable queues, just because it is simple hashing.  You need to generate
> > > > more flows or try to guess flow patterns that will be evenly distributed by
> > > > your particular NIC.
> > > 
> > > Doubting when one port hashing correctly, while other not
> > 
> > There is no "correct" or "incorrect".
> > Hashes are hashes.  You can't expect that 4 flows will be perfectly
> > distributed
> > across 4 HW queues.  If one hash algorithm gives you that result doesn't mean
> > it will work better in a real-world scenario and even with different 4 flows.
> > 
> 
> I agree, but until the hash collision occurs right ?. For the validation of
> RSS we have to have control on the input traffic streams. In L2/L3/L4 tuple,
> I change only src IP for four times, to generate 4 different streams,
> keeping rest of the headers constant, so hashing will not collide on
> calculated keys as per my understanding.

Since you have only 4 queues, the queue will be chosen by using last 2 bits of
the hash value.  So, your hash function effectively converts packet fields to
the number from range [0, 3].  Ideal hash function should return hash value that
should look like random value.  In this case probability of the case where 4
random flows hashed to 4 different queues is 4/4 * 3/4 * 2/4 * 1/4 = 9.375%.
In all other ~90% cases there will be at least one collision.
Of course, our hash function is not ideal and if you're using a simple XOR, you
may just pre-calculate values and choose good flow patterns.  But, anyway.

> 
> > > I did some check
> > > wrt RSS spec in datasheet and how dpdk driver enables (wrt HTOEP set in
> > > GLQF_CTL reg). When I bring up compute node, I see "toeplitz" hashing alg
> > > set by default and not simple hash it is as we thought. (No ovs command to
> > > help, so I used testpmd on phy ports by stopping ovs-dpdk a while) but none
> > > of packet headers enabled for the hashing as well.
> > 
> > Re-loading of DPDK driver (stopping the OVS and starting testpmd) most likely
> > drops most of device configurations.
> 
> Agreed, but would you think GLQF reg too be reset ?. I could able to retain
> the setting I updated, even after starting and stopping ovs-dpdk (but until
> system itself is reboot).

I looked through the DPDK code and you might be right here.  OVS doesn't
call filter_ctl, so it's possible that GLQF is not reset between application
restarts.

> Here is what I did:
> 
> 1. system boots and ovs-dpdk started.
> 2. I stopped ovs-dpdk and launched testpmd, with vfio-pci still loaded
> 3. inspection/updates performed as I did for hashing
> 4. stopped testpmd and started ovs-dpdk
> 5. run the tests to observe hashing behaviour
> 6. stopped ovs-dpdk and started testpmd
> 7. observed my hash setting did not change as well (same as set in step 3)
> 8. stopped testpmd and started ovs-dpdk.
> 
> > 
> > > Now I started ovs-dpdk (as we did not reset nic hw) and here are the results
> > > wrt pmd/rxq load:
> > 
> > OVS explicitly sets rss configuration while initializing the port, so I doubt
> > that any of above configurations survives OVS restart.
> > 
> 
> I could not find anywhere calling eth_dev_ops->filter_ctrl in ovs. Any
> pointer where I can check for it ?.

OVS sets rss_conf, but this only  changes the I40E_PFQF_HENA.  So, yes, some
configuration on this NIC might be preserved.  But this might not work for
other NIC types.

Comment 10 Gowrishankar Muthukrishnan 2020-04-02 07:49:43 UTC
(In reply to Ilya Maximets from comment #9)
> Since you have only 4 queues, the queue will be chosen by using last 2 bits
> of
> the hash value.  So, your hash function effectively converts packet fields to
> the number from range [0, 3].  Ideal hash function should return hash value
> that
> should look like random value.  In this case probability of the case where 4
> random flows hashed to 4 different queues is 4/4 * 3/4 * 2/4 * 1/4 = 9.375%.
> In all other ~90% cases there will be at least one collision.
> Of course, our hash function is not ideal and if you're using a simple XOR,
> you
> may just pre-calculate values and choose good flow patterns.  But, anyway.
> 

Is this then limitation of toeplitz hashing (default one chosen by nic) ?
If so, should this be not documented for the awareness of issue ?.

..
<cut>
>
> > Agreed, but would you think GLQF reg too be reset ?. I could able to retain
> > the setting I updated, even after starting and stopping ovs-dpdk (but until
> > system itself is reboot).
> 
> I looked through the DPDK code and you might be right here.  OVS doesn't
> call filter_ctl, so it's possible that GLQF is not reset between application
> restarts.

Should this be documented elsewhere, for someone troubleshooting this type of load balance issues ?

> > 
> > I could not find anywhere calling eth_dev_ops->filter_ctrl in ovs. Any
> > pointer where I can check for it ?.
> 
> OVS sets rss_conf, but this only  changes the I40E_PFQF_HENA.  So, yes, some
> configuration on this NIC might be preserved.  But this might not work for
> other NIC types.

Agreed.

Comment 11 Ilya Maximets 2020-04-20 21:00:44 UTC
(In reply to Gowrishankar Muthukrishnan from comment #10)
> (In reply to Ilya Maximets from comment #9)
> > Since you have only 4 queues, the queue will be chosen by using last 2 bits
> > of
> > the hash value.  So, your hash function effectively converts packet fields to
> > the number from range [0, 3].  Ideal hash function should return hash value
> > that
> > should look like random value.  In this case probability of the case where 4
> > random flows hashed to 4 different queues is 4/4 * 3/4 * 2/4 * 1/4 = 9.375%.
> > In all other ~90% cases there will be at least one collision.
> > Of course, our hash function is not ideal and if you're using a simple XOR,
> > you
> > may just pre-calculate values and choose good flow patterns.  But, anyway.
> > 
> 
> Is this then limitation of toeplitz hashing (default one chosen by nic) ?

No. It's not a limitation.  It's a default behaviour of any hash.  XOR works
in this case just occasionally.

> If so, should this be not documented for the awareness of issue ?.
> 
> ..
> <cut>
> >
> > > Agreed, but would you think GLQF reg too be reset ?. I could able to retain
> > > the setting I updated, even after starting and stopping ovs-dpdk (but until
> > > system itself is reboot).
> > 
> > I looked through the DPDK code and you might be right here.  OVS doesn't
> > call filter_ctl, so it's possible that GLQF is not reset between application
> > restarts.
> 
> Should this be documented elsewhere, for someone troubleshooting this type
> of load balance issues ?

I think, we could consider this as a sort of driver issue in DPDK.  I'll ask
DPDK maintainers, what they think about this. 

> 
> > > 
> > > I could not find anywhere calling eth_dev_ops->filter_ctrl in ovs. Any
> > > pointer where I can check for it ?.
> > 
> > OVS sets rss_conf, but this only  changes the I40E_PFQF_HENA.  So, yes, some
> > configuration on this NIC might be preserved.  But this might not work for
> > other NIC types.
> 
> Agreed.

Comment 16 Gowrishankar Muthukrishnan 2020-05-11 12:52:10 UTC
fw info on the NIC tested incase required:

$ sudo ethtool -i eno1
driver: i40e
version: 2.8.20-k
firmware-version: 6.80 0x80003d74 18.8.9
expansion-rom-version: 
bus-info: 0000:18:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes


Note You need to log in before you can comment on or make changes to this bug.