Description of problem: In my osp 16 cluster, the computeovsdpdk node has ovs-dpdk (v2.11.0-35). After openstack deployment and bring up of tenant instances (trex VM and testpmd VM in this case), rxqs of phy nics get rebalanced across multiple pmd threads, along with other vhostuser rxqs of VMs. During the testing of multiple flow streams (of count 4 in each direction) to exactly map n_rxqs=4 set for phy nics), I observed that, a pmd which polls rxq of two different ports of X710 NIC poll only one rxq, but not both. As seen in below test info, only one rxq of phy ports presents in pmd 22. pmd thread numa_id 0 core_id 1: isolated : false port: vhu3cb8fa43-5c queue-id: 0 pmd usage: 0 % port: vhu3cb8fa43-5c queue-id: 3 pmd usage: 0 % port: vhu5e59e258-c2 queue-id: 0 pmd usage: 0 % port: vhu5e59e258-c2 queue-id: 3 pmd usage: 0 % pmd thread numa_id 0 core_id 2: isolated : false port: dpdk-link1-port queue-id: 1 pmd usage: 0 % port: vhu3cb8fa43-5c queue-id: 5 pmd usage: 0 % port: vhu5e59e258-c2 queue-id: 5 pmd usage: 0 % port: vhu8948a6eb-80 queue-id: 6 pmd usage: 0 % pmd thread numa_id 0 core_id 10: isolated : false port: dpdk-link1-port queue-id: 2 pmd usage: 0 % port: vhu3cb8fa43-5c queue-id: 4 pmd usage: 0 % port: vhu5e59e258-c2 queue-id: 4 pmd usage: 0 % port: vhu8948a6eb-80 queue-id: 7 pmd usage: 0 % pmd thread numa_id 0 core_id 11: isolated : false port: dpdk-link1-port queue-id: 0 pmd usage: 0 % port: vhu3cb8fa43-5c queue-id: 7 pmd usage: 0 % port: vhu5e59e258-c2 queue-id: 7 pmd usage: 0 % port: vhu8948a6eb-80 queue-id: 4 pmd usage: 0 % pmd thread numa_id 0 core_id 13: isolated : false port: dpdk-link2-port queue-id: 1 pmd usage: 0 % port: dpdk-link2-port queue-id: 3 pmd usage: 0 % port: vhu8948a6eb-80 queue-id: 0 pmd usage: 0 % port: vhu8948a6eb-80 queue-id: 3 pmd usage: 0 % pmd thread numa_id 0 core_id 14: isolated : false port: vhu3cb8fa43-5c queue-id: 1 pmd usage: 0 % port: vhu3cb8fa43-5c queue-id: 2 pmd usage: 0 % port: vhu5e59e258-c2 queue-id: 1 pmd usage: 0 % port: vhu5e59e258-c2 queue-id: 2 pmd usage: 0 % pmd thread numa_id 0 core_id 22: isolated : false port: dpdk-link1-port queue-id: 3 pmd usage: 0 % <<< port: dpdk-link2-port queue-id: 2 pmd usage: 0 % <<< port: vhu8948a6eb-80 queue-id: 1 pmd usage: 0 % port: vhu8948a6eb-80 queue-id: 2 pmd usage: 0 % pmd thread numa_id 0 core_id 23: isolated : false port: dpdk-link2-port queue-id: 0 pmd usage: 0 % port: vhu3cb8fa43-5c queue-id: 6 pmd usage: 0 % port: vhu5e59e258-c2 queue-id: 6 pmd usage: 0 % port: vhu8948a6eb-80 queue-id: 5 pmd usage: 0 % For an instance with 1Mpps traffic across both the directions of PVP, using below trex command in trex VM: python -u /opt/trafficgen/trex-txrx.py --device-pairs=0:1 --active-device-pairs=0:1 --mirrored-log --measure-latency=0 --rate=1.0 --rate-unit=mpps --size=64 --runtime=120 --runtime-tolerance=5 --run-bidirec=1 --run-revunidirec=0 --num-flows=4 --dst-macs=fa:16:3e:7e:26:38,fa:16:3e:bc:43:c0 --vlan-ids=306,307 --use-src-ip-flows=1 --use-dst-ip-flows=0 --use-src-mac-flows=0 --use-dst-mac-flows=0 --use-src-port-flows=0 --use-dst-port-flows=0 --use-protocol-flows=0 --packet-protocol=UDP --stream-mode=continuous --max-loss-pct=0.0 I could generate below two sets of streams (of four in each direction): sudo ovs-tcpdump -nn -e -i dpdk-link1-port -c 1024 | awk '/IPv4/ {print $2" "$3" "$4" "$16 $17 $18 $19}'| sort | uniq in one direction: e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.46.32768>50.0.5.43.53:30840 e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.47.32768>50.0.5.43.53:30840 e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.48.32768>50.0.5.43.53:30840 e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.49.32768>50.0.5.43.53:30840 in other direction: fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.43.32768>50.0.4.46.53:30840 fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.44.32768>50.0.4.46.53:30840 fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.45.32768>50.0.4.46.53:30840 fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.46.32768>50.0.4.46.53:30840 Checking pmd usages for phy nics alone: $ sudo ovs-appctl dpif-netdev/pmd-rxq-show | egrep '(pmd thread|link1|link2)' pmd thread numa_id 0 core_id 1: pmd thread numa_id 0 core_id 2: port: dpdk-link1-port queue-id: 1 pmd usage: 7 % pmd thread numa_id 0 core_id 10: port: dpdk-link1-port queue-id: 2 pmd usage: 12 % pmd thread numa_id 0 core_id 11: port: dpdk-link1-port queue-id: 0 pmd usage: 12 % pmd thread numa_id 0 core_id 13: port: dpdk-link2-port queue-id: 1 pmd usage: 18 % port: dpdk-link2-port queue-id: 3 pmd usage: 0 % pmd thread numa_id 0 core_id 14: pmd thread numa_id 0 core_id 22: port: dpdk-link1-port queue-id: 3 pmd usage: 7 % port: dpdk-link2-port queue-id: 2 pmd usage: 0 % <<< not polled >>> pmd thread numa_id 0 core_id 23: port: dpdk-link2-port queue-id: 0 pmd usage: 18 % There was no loss however, due to less packet rate: testpmd (mac forward): Throughput (since last show) RX-packets: 19999999 TX-packets: 19999999 TX-dropped: 0 ------- Forward Stats for RX Port= 1/Queue= 0 -> TX Port= 0/Queue= 0 ------- RX-packets: 59999998 TX-packets: 59999998 TX-dropped: 0 ------- Forward Stats for RX Port= 1/Queue= 1 -> TX Port= 0/Queue= 1 ------- RX-packets: 59999998 TX-packets: 59999998 TX-dropped: 0 ---------------------- Forward statistics for port 0 ---------------------- RX-packets: 119999996 RX-dropped: 0 RX-total: 119999996 TX-packets: 119999996 TX-dropped: 0 TX-total: 119999996 ---------------------------------------------------------------------------- ---------------------- Forward statistics for port 1 ---------------------- RX-packets: 119999996 RX-dropped: 0 RX-total: 119999996 TX-packets: 119999996 TX-dropped: 0 TX-total: 119999996 ---------------------------------------------------------------------------- +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++ RX-packets: 239999992 RX-dropped: 0 RX-total: 239999992 TX-packets: 239999992 TX-dropped: 0 TX-total: 239999992 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ trex pkt gen: Transmitting at 1.0mpps from port 0 to port 1 for 120 seconds... Transmitting at 1.0mpps from port 0 to port 1 for 120 seconds... Transmitting at 1.0mpps from port 1 to port 0 for 120 seconds... Transmitting at 1.0mpps from port 1 to port 0 for 120 seconds... Waiting... ore":18.597078323364258,"force_quit":false,"timeout":false,"rx_cpu_util":0.0,"rx_pps":1997657.125,"queue_full":0,"cpu_util":1.9478102922439575,"tx_pps":1997628.25,"tx_bps":1086707456.0,"rx_drop_bps":0.0,"early_exit":false,"runtime":120.038953},"total":{"tx_util":14.06328104,"rx_bps":1086725888.0,"obytes":16319999456,"rx_pps":1997657.125,"ipackets":240000480,"oerrors":0,"rx_util":14.06351028,"opackets":239999992,"tx_pps":1997628.25,"tx_bps":1086707584.0,"ierrors":0,"rx_bps_L1":1406351028.0,"tx_bps_L1":1406328104.0,"ibytes":16320035824},"flow_stats":{"128":{"rx_bps":{"0":0.0,"1":0.0,"total":0.0},"loss":{"cnt":{"1->0":0.0,"total":0.0,"0->1":"N/A"},"pct":{"1->0":0.0,"total":0.0,"0->1":"N/A"}},"rx_pps":{"0":0.0,"1":0.0,"total":0.0},"rx_pkts":{"0":119999996,"1":0,"total":119999996},"rx_bytes":{"0":"N/A","1":"N/A","total":"N/A"},"tx_pkts":{"0":0,"1":119999996,"total":119999996},"tx_pps":{"0":0.0,"1":0.0,"total":0.0},"tx_bps":{"0":0.0,"1":0.0,"total":0.0},"tx_bytes":{"0":0,"1":8159999728,"total":8159999728},"rx_bps_l1":{"0":0.0,"1":0.0,"total":0.0},"tx_bps_l1":{"0":0.0,"1":0.0,"total":0.0}},"1":{"rx_bps":{"0":0.0,"1":0.0,"total":0.0},"loss":{"cnt":{"1->0":"N/A","total":0.0,"0->1":0.0},"pct":{"1->0":"N/A","total":0.0,"0->1":0.0}},"rx_pps":{"0":0.0,"1":0.0,"total":0.0},"rx_pkts":{"0":0,"1":119999996,"total":119999996},"rx_bytes":{"0":"N/A","1":"N/A","total":"N/A"},"tx_pkts":{"0":119999996,"1":0,"total":119999996},"tx_pps":{"0":0.0,"1":0.0,"total":0.0},"tx_bps":{"0":0.0,"1":0.0,"total":0.0},"tx_bytes":{"0":8159999728,"1":0,"total":8159999728},"rx_bps_l1":{"0":0.0,"1":0.0,"total":0.0},"tx_bps_l1":{"0":0.0,"1":0.0,"total":0.0}},"global":{"rx_err":{"0":0,"1":0},"tx_err":{"0":0,"1":0}}}} As noticed, loss is 0. So, all the packets are polled in this low tx rate test, however rxq id 2 of dpdk-link2-port was not polled. Does it mean RSS did not hash the stream uniformly, when a rxq of peer port is polled by same pmd ?. Version-Release number of selected component (if applicable): Steps to Reproduce: Deploy multiqueue (atleast 4) enabled phy nic in openstack cluster (RHOSP16). Deploy trex VM on SRIOV compute and create traffic on VFs. Deploy testpmd VM in ovsdpdk compute to forward trex traffic in mac mode. Actual results: Peer port's rxq not polled by PMD polling one port's rxq Expected results: Peer port's rxq has to be polled by PMD polling one port's rxq Additional info:
Confirmation of the problem by amount of cpu cycles consumed by the affected rxq which turns out to be zero as below. During the PVP test, do rxq-rebalance using ovs-appctl which would report cpu cycles consumed by every rxq while rebalancing, in ovs-vswitchd.log. Check for the affected rxq. For an instance, in below rxq assignment: pmd thread numa_id 0 core_id 1: port: dpdk-link1-port queue-id: 2 pmd usage: 0 % pmd thread numa_id 0 core_id 2: pmd thread numa_id 0 core_id 10: pmd thread numa_id 0 core_id 11: port: dpdk-link1-port queue-id: 0 pmd usage: 0 % pmd thread numa_id 0 core_id 13: port: dpdk-link2-port queue-id: 1 pmd usage: 0 % port: dpdk-link2-port queue-id: 3 pmd usage: 0 % pmd thread numa_id 0 core_id 14: port: dpdk-link2-port queue-id: 0 pmd usage: 0 % pmd thread numa_id 0 core_id 22: port: dpdk-link1-port queue-id: 3 pmd usage: 0 % port: dpdk-link2-port queue-id: 2 pmd usage: 0 % <<< rxq going to be affected >>> pmd thread numa_id 0 core_id 23: port: dpdk-link1-port queue-id: 1 pmd usage: 0 % Start trex test and do rxq rebalance and check ovs-vswitchd.log. <trex tx> wait for mid of test sudo ovs-appctl dpif-netdev/pmd-rxq-rebalance from log: 2020-03-01T15:13:42.262Z|03069|dpif_netdev|INFO|Core 22 on numa node 0 assigned port 'dpdk-link2-port' rx queue 1 (measured processing cycles 24104865120). 2020-03-01T15:13:42.262Z|03070|dpif_netdev|INFO|Core 13 on numa node 0 assigned port 'dpdk-link2-port' rx queue 0 (measured processing cycles 23809755945). 2020-03-01T15:13:42.262Z|03071|dpif_netdev|INFO|Core 11 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 0 (measured processing cycles 21798052095). 2020-03-01T15:13:42.262Z|03072|dpif_netdev|INFO|Core 23 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 1 (measured processing cycles 19537135227). 2020-03-01T15:13:42.263Z|03073|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'dpdk-link1-port' rx queue 2 (measured processing cycles 16038933254). 2020-03-01T15:13:42.263Z|03074|dpif_netdev|INFO|Core 10 on numa node 0 assigned port 'dpdk-link1-port' rx queue 0 (measured processing cycles 14247925302). 2020-03-01T15:13:42.263Z|03075|dpif_netdev|INFO|Core 1 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 1 (measured processing cycles 13906586804). 2020-03-01T15:13:42.263Z|03076|dpif_netdev|INFO|Core 14 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 0 (measured processing cycles 13707458680). 2020-03-01T15:13:42.263Z|03077|dpif_netdev|INFO|Core 14 on numa node 0 assigned port 'dpdk-link1-port' rx queue 1 (measured processing cycles 9930522720). 2020-03-01T15:13:42.263Z|03078|dpif_netdev|INFO|Core 1 on numa node 0 assigned port 'dpdk-link1-port' rx queue 3 (measured processing cycles 8948654726). 2020-03-01T15:13:42.263Z|03079|dpif_netdev|INFO|Core 10 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 3 (measured processing cycles 8878290897). 2020-03-01T15:13:42.263Z|03080|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 2 (measured processing cycles 8399106648). 2020-03-01T15:13:42.263Z|03081|dpif_netdev|INFO|Core 23 on numa node 0 assigned port 'dpdk-link2-port' rx queue 3 (measured processing cycles 1782198). 2020-03-01T15:13:42.263Z|03082|dpif_netdev|INFO|Core 11 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 7 (measured processing cycles 433610). 2020-03-01T15:13:42.263Z|03083|dpif_netdev|INFO|Core 13 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 6 (measured processing cycles 353736). 2020-03-01T15:13:42.263Z|03084|dpif_netdev|INFO|Core 22 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 0 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03085|dpif_netdev|INFO|Core 22 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 1 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03086|dpif_netdev|INFO|Core 13 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 2 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03087|dpif_netdev|INFO|Core 11 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 3 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03088|dpif_netdev|INFO|Core 23 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 4 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03089|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'vhu3cb8fa43-5c' rx queue 5 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03090|dpif_netdev|INFO|Core 10 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 2 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03091|dpif_netdev|INFO|Core 1 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 3 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03092|dpif_netdev|INFO|Core 14 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 4 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03093|dpif_netdev|INFO|Core 14 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 5 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03094|dpif_netdev|INFO|Core 1 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 6 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03095|dpif_netdev|INFO|Core 10 on numa node 0 assigned port 'vhu8948a6eb-80' rx queue 7 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03096|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 4 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03097|dpif_netdev|INFO|Core 23 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 5 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03098|dpif_netdev|INFO|Core 11 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 6 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03099|dpif_netdev|INFO|Core 13 on numa node 0 assigned port 'vhu5e59e258-c2' rx queue 7 (measured processing cycles 0). 2020-03-01T15:13:42.263Z|03100|dpif_netdev|INFO|Core 22 on numa node 0 assigned port 'dpdk-link2-port' rx queue 2 (measured processing cycles 0). <<< no poll >>> As seen above, rxq 2 of dpdk-link2-port confirms for not consuming cpu cycles, so poll did not happen.
Hmm. You have 4 traffic streams and 4 HW queues, it's very likely that HW RSS hash function will not distribute traffic evenly across all the avaialable queues, just because it is simple hashing. You need to generate more flows or try to guess flow patterns that will be evenly distributed by your particular NIC.
Also, please note 0% does not indicate that the queue was not polled. 0% indicates that there were no packets received on that queue, which is most likely because of the explanation in comment2.
(In reply to Ilya Maximets from comment #2) > Hmm. You have 4 traffic streams and 4 HW queues, it's very likely that HW > RSS hash function will not distribute traffic evenly across all the > avaialable queues, just because it is simple hashing. You need to generate > more flows or try to guess flow patterns that will be evenly distributed by > your particular NIC. Doubting when one port hashing correctly, while other not, I did some check wrt RSS spec in datasheet and how dpdk driver enables (wrt HTOEP set in GLQF_CTL reg). When I bring up compute node, I see "toeplitz" hashing alg set by default and not simple hash it is as we thought. (No ovs command to help, so I used testpmd on phy ports by stopping ovs-dpdk a while) but none of packet headers enabled for the hashing as well. Configuring Port 0 (socket 0) Port 0: E4:43:4B:5C:90:E2 Configuring Port 1 (socket 0) Port 1: E4:43:4B:5C:90:E3 Checking link statuses... Done testpmd> get_hash_global_config 0 Hash function is Toeplitz Symmetric hash is disabled globally for flow type ipv4-frag by port 0 Symmetric hash is disabled globally for flow type ipv4-tcp by port 0 Symmetric hash is disabled globally for flow type ipv4-udp by port 0 Symmetric hash is disabled globally for flow type ipv4-sctp by port 0 Symmetric hash is disabled globally for flow type ipv4-other by port 0 Symmetric hash is disabled globally for flow type ipv6-frag by port 0 Symmetric hash is disabled globally for flow type ipv6-tcp by port 0 Symmetric hash is disabled globally for flow type ipv6-udp by port 0 Symmetric hash is disabled globally for flow type ipv6-sctp by port 0 Symmetric hash is disabled globally for flow type ipv6-other by port 0 Symmetric hash is disabled globally for flow type l2_payload by port 0 <similar for port 1> Not sure why all of the possible L3/L4 headers were disabled for this alg, as this alg uses a random key along with tuple set (Refer 7.1.10.1 Microsoft Toeplitz-based hash in X710 doc). # unable to enable all L4 in ipv4 .. testpmd> set_hash_global_config 0 toeplitz ipv4 enable i40e_hash_global_config_check(): i40e unsupported flow type bit(s) configured Cannot set global hash configurations by port 0 testpmd> set_hash_global_config 1 toeplitz ipv4 enable i40e_hash_global_config_check(): i40e unsupported flow type bit(s) configured Cannot set global hash configurations by port 1 # but one by one starting with udp enough for the test testpmd> set_hash_global_config 0 toeplitz ipv4-udp enable i40e_write_global_rx_ctl(): i40e device 0000:18:00.2 changed global register [0x00269d7c]. original: 0x00000000, new: 0x00000001 Global hash configurations have been set successfully by port 0 testpmd> set_hash_global_config 1 toeplitz ipv4-udp enable Global hash configurations have been set successfully by port 1 testpmd> testpmd> get_hash_global_config 0 Hash function is Toeplitz Symmetric hash is disabled globally for flow type ipv4-frag by port 0 Symmetric hash is disabled globally for flow type ipv4-tcp by port 0 Symmetric hash is enabled globally for flow type ipv4-udp by port 0 <<<< L3/L4 turned on >>>> Symmetric hash is disabled globally for flow type ipv4-sctp by port 0 Symmetric hash is disabled globally for flow type ipv4-other by port 0 Symmetric hash is disabled globally for flow type ipv6-frag by port 0 Symmetric hash is disabled globally for flow type ipv6-tcp by port 0 Symmetric hash is disabled globally for flow type ipv6-udp by port 0 Symmetric hash is disabled globally for flow type ipv6-sctp by port 0 Symmetric hash is disabled globally for flow type ipv6-other by port 0 Symmetric hash is disabled globally for flow type l2_payload by port 0 testpmd> Similar config shown for port 1 Now I started ovs-dpdk (as we did not reset nic hw) and here are the results wrt pmd/rxq load: $ sudo ovs-appctl dpif-netdev/pmd-rxq-show | egrep '(pmd thread|link1|link2)' pmd thread numa_id 0 core_id 1: port: dpdk-link2-port queue-id: 0 pmd usage: 4 % pmd thread numa_id 0 core_id 2: port: dpdk-link1-port queue-id: 0 pmd usage: 0 % <<<< port: dpdk-link2-port queue-id: 3 pmd usage: 15 % pmd thread numa_id 0 core_id 10: port: dpdk-link2-port queue-id: 1 pmd usage: 5 % pmd thread numa_id 0 core_id 11: port: dpdk-link1-port queue-id: 1 pmd usage: 0 % <<<< port: dpdk-link1-port queue-id: 2 pmd usage: 17 % pmd thread numa_id 0 core_id 13: port: dpdk-link2-port queue-id: 2 pmd usage: 15 % pmd thread numa_id 0 core_id 14: pmd thread numa_id 0 core_id 22: port: dpdk-link1-port queue-id: 3 pmd usage: 17 % pmd thread numa_id 0 core_id 23: RSS works but only some packets seems to be processed by dpdk-link1-port's queues, so toeplitz hashing is not superb for this test to equal load balance. 2020-03-03T13:55:23.733Z|00497|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'dpdk-link1-port' rx queue 0 (measured processing cycles 988124). 2020-03-03T13:55:23.733Z|00500|dpif_netdev|INFO|Core 11 on numa node 0 assigned port 'dpdk-link1-port' rx queue 1 (measured processing cycles 148506). Then, I changed the hashing alg to simple-xor and here are the results (all queues well balanced !).. testpmd> get_hash_global_config 0 Hash function is Toeplitz Symmetric hash is disabled globally for flow type ipv4-frag by port 0 Symmetric hash is disabled globally for flow type ipv4-tcp by port 0 Symmetric hash is enabled globally for flow type ipv4-udp by port 0 Symmetric hash is disabled globally for flow type ipv4-sctp by port 0 Symmetric hash is disabled globally for flow type ipv4-other by port 0 Symmetric hash is disabled globally for flow type ipv6-frag by port 0 Symmetric hash is disabled globally for flow type ipv6-tcp by port 0 Symmetric hash is disabled globally for flow type ipv6-udp by port 0 Symmetric hash is disabled globally for flow type ipv6-sctp by port 0 Symmetric hash is disabled globally for flow type ipv6-other by port 0 Symmetric hash is disabled globally for flow type l2_payload by port 0 # switch to simple-xor testpmd> set_hash_global_config 0 simple_xor ipv4 enable i40e_hash_global_config_check(): i40e unsupported flow type bit(s) configured Cannot set global hash configurations by port 0 testpmd> set_hash_global_config 0 simple_xor ipv4-udp enable i40e_write_global_rx_ctl(): i40e device 0000:18:00.2 changed global register [0x00269d7c]. original: 0x00000000, new: 0x00000001 i40e_write_global_rx_ctl(): i40e device 0000:18:00.2 changed global register [0x00269ba4]. original: 0x01fe0002, new: 0x01fe0000 Global hash configurations have been set successfully by port 0 testpmd> get_hash_global_config 0 Hash function is Simple XOR Symmetric hash is disabled globally for flow type ipv4-frag by port 0 Symmetric hash is disabled globally for flow type ipv4-tcp by port 0 Symmetric hash is enabled globally for flow type ipv4-udp by port 0 <<<< Symmetric hash is disabled globally for flow type ipv4-sctp by port 0 Symmetric hash is disabled globally for flow type ipv4-other by port 0 Symmetric hash is disabled globally for flow type ipv6-frag by port 0 Symmetric hash is disabled globally for flow type ipv6-tcp by port 0 Symmetric hash is disabled globally for flow type ipv6-udp by port 0 Symmetric hash is disabled globally for flow type ipv6-sctp by port 0 Symmetric hash is disabled globally for flow type ipv6-other by port 0 Symmetric hash is disabled globally for flow type l2_payload by port 0 testpmd> set_hash_global_config 1 simple_xor ipv4-udp enable Global hash configurations have been set successfully by port 1 testpmd> get_hash_global_config 1 Hash function is Simple XOR Symmetric hash is disabled globally for flow type ipv4-frag by port 1 Symmetric hash is disabled globally for flow type ipv4-tcp by port 1 Symmetric hash is enabled globally for flow type ipv4-udp by port 1 <<< Symmetric hash is disabled globally for flow type ipv4-sctp by port 1 Symmetric hash is disabled globally for flow type ipv4-other by port 1 Symmetric hash is disabled globally for flow type ipv6-frag by port 1 Symmetric hash is disabled globally for flow type ipv6-tcp by port 1 Symmetric hash is disabled globally for flow type ipv6-udp by port 1 Symmetric hash is disabled globally for flow type ipv6-sctp by port 1 Symmetric hash is disabled globally for flow type ipv6-other by port 1 Symmetric hash is disabled globally for flow type l2_payload by port 1 $ sudo ovs-appctl dpif-netdev/pmd-rxq-show | egrep '(pmd thread|link1|link2)' pmd thread numa_id 0 core_id 1: port: dpdk-link1-port queue-id: 0 pmd usage: 11 % pmd thread numa_id 0 core_id 2: port: dpdk-link2-port queue-id: 2 pmd usage: 13 % pmd thread numa_id 0 core_id 10: port: dpdk-link1-port queue-id: 2 pmd usage: 13 % port: dpdk-link2-port queue-id: 3 pmd usage: 11 % pmd thread numa_id 0 core_id 11: port: dpdk-link1-port queue-id: 1 pmd usage: 12 % port: dpdk-link2-port queue-id: 1 pmd usage: 12 % pmd thread numa_id 0 core_id 13: port: dpdk-link2-port queue-id: 0 pmd usage: 11 % pmd thread numa_id 0 core_id 14: pmd thread numa_id 0 core_id 22: pmd thread numa_id 0 core_id 23: port: dpdk-link1-port queue-id: 3 pmd usage: 11 % So, it turned out to be an issue with Toeplitz hashing that is turned on by default for this hardware. So, in summary: (1) Toeplitz hashing is on by default which did not enable any of the L3/L4 headers for the calculation as reported above in terms of observed imbalance. (2) Simple-xor is doing RSS equaling the load distribution of hw queues. So, should this be turned on instead, as a default hashing scheme for the deployment ? Thanks.
(In reply to Gowrishankar Muthukrishnan from comment #4) > (In reply to Ilya Maximets from comment #2) > > Hmm. You have 4 traffic streams and 4 HW queues, it's very likely that HW > > RSS hash function will not distribute traffic evenly across all the > > avaialable queues, just because it is simple hashing. You need to generate > > more flows or try to guess flow patterns that will be evenly distributed by > > your particular NIC. > > Doubting when one port hashing correctly, while other not There is no "correct" or "incorrect". Hashes are hashes. You can't expect that 4 flows will be perfectly distributed across 4 HW queues. If one hash algorithm gives you that result doesn't mean it will work better in a real-world scenario and even with different 4 flows. > I did some check > wrt RSS spec in datasheet and how dpdk driver enables (wrt HTOEP set in > GLQF_CTL reg). When I bring up compute node, I see "toeplitz" hashing alg > set by default and not simple hash it is as we thought. (No ovs command to > help, so I used testpmd on phy ports by stopping ovs-dpdk a while) but none > of packet headers enabled for the hashing as well. Re-loading of DPDK driver (stopping the OVS and starting testpmd) most likely drops most of device configurations. > Now I started ovs-dpdk (as we did not reset nic hw) and here are the results > wrt pmd/rxq load: OVS explicitly sets rss configuration while initializing the port, so I doubt that any of above configurations survives OVS restart. Are you sure that you're generating exact same traffic streams in all cases (ips, macs, udp/tcp ports)?
(In reply to Ilya Maximets from comment #5) > (In reply to Gowrishankar Muthukrishnan from comment #4) > > (In reply to Ilya Maximets from comment #2) > > > Hmm. You have 4 traffic streams and 4 HW queues, it's very likely that HW > > > RSS hash function will not distribute traffic evenly across all the > > > avaialable queues, just because it is simple hashing. You need to generate > > > more flows or try to guess flow patterns that will be evenly distributed by > > > your particular NIC. > > > > Doubting when one port hashing correctly, while other not > > There is no "correct" or "incorrect". > Hashes are hashes. You can't expect that 4 flows will be perfectly > distributed > across 4 HW queues. If one hash algorithm gives you that result doesn't mean > it will work better in a real-world scenario and even with different 4 flows. > I agree, but until the hash collision occurs right ?. For the validation of RSS we have to have control on the input traffic streams. In L2/L3/L4 tuple, I change only src IP for four times, to generate 4 different streams, keeping rest of the headers constant, so hashing will not collide on calculated keys as per my understanding. > > I did some check > > wrt RSS spec in datasheet and how dpdk driver enables (wrt HTOEP set in > > GLQF_CTL reg). When I bring up compute node, I see "toeplitz" hashing alg > > set by default and not simple hash it is as we thought. (No ovs command to > > help, so I used testpmd on phy ports by stopping ovs-dpdk a while) but none > > of packet headers enabled for the hashing as well. > > Re-loading of DPDK driver (stopping the OVS and starting testpmd) most likely > drops most of device configurations. Agreed, but would you think GLQF reg too be reset ?. I could able to retain the setting I updated, even after starting and stopping ovs-dpdk (but until system itself is reboot). Here is what I did: 1. system boots and ovs-dpdk started. 2. I stopped ovs-dpdk and launched testpmd, with vfio-pci still loaded 3. inspection/updates performed as I did for hashing 4. stopped testpmd and started ovs-dpdk 5. run the tests to observe hashing behaviour 6. stopped ovs-dpdk and started testpmd 7. observed my hash setting did not change as well (same as set in step 3) 8. stopped testpmd and started ovs-dpdk. > > > Now I started ovs-dpdk (as we did not reset nic hw) and here are the results > > wrt pmd/rxq load: > > OVS explicitly sets rss configuration while initializing the port, so I doubt > that any of above configurations survives OVS restart. > I could not find anywhere calling eth_dev_ops->filter_ctrl in ovs. Any pointer where I can check for it ?. > Are you sure that you're generating exact same traffic streams in all cases > (ips, macs, udp/tcp ports)? Yes, I am not changing the traffic pattern I set for the comparison. port 0 -> 1 e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.46.32768>50.0.5.43.53:30840 e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.47.32768>50.0.5.43.53:30840 e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.48.32768>50.0.5.43.53:30840 e4:43:4b:5e:1c:22 > fa:16:3e:7e:26:38, 50.0.4.49.32768>50.0.5.43.53:30840 port 1 -> 0 fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.43.32768>50.0.4.46.53:30840 fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.44.32768>50.0.4.46.53:30840 fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.45.32768>50.0.4.46.53:30840 fa:16:3e:7e:26:38 > e4:43:4b:5e:1c:22, 50.0.5.46.32768>50.0.4.46.53:30840 It is same trex command used to create these streams both with simple-xor as well as toeplitz hashing cases. However, I just notice onething in generated traffic src IPs: one of the src IPs "50.0.5.43" in (port 1 -> 0) is always a dst IP in (port 0 -> 1) streams. Not sure, if that would be a concern for toeplitz (however knowing that one is in ingress, other is egress). Meanwhile, I'll come back by keep entirely different IPs in both directions of streams for this test.
(In reply to Gowrishankar Muthukrishnan from comment #6) > (In reply to Ilya Maximets from comment #5) > > (In reply to Gowrishankar Muthukrishnan from comment #4) > > > (In reply to Ilya Maximets from comment #2) > > > > Hmm. You have 4 traffic streams and 4 HW queues, it's very likely that HW > > > > RSS hash function will not distribute traffic evenly across all the > > > > avaialable queues, just because it is simple hashing. You need to generate > > > > more flows or try to guess flow patterns that will be evenly distributed by > > > > your particular NIC. > > > > > > Doubting when one port hashing correctly, while other not > > > > There is no "correct" or "incorrect". > > Hashes are hashes. You can't expect that 4 flows will be perfectly > > distributed > > across 4 HW queues. If one hash algorithm gives you that result doesn't mean > > it will work better in a real-world scenario and even with different 4 flows. > > > > I agree, but until the hash collision occurs right ?. For the validation of > RSS we have to have control on the input traffic streams. In L2/L3/L4 tuple, > I change only src IP for four times, to generate 4 different streams, > keeping rest of the headers constant, so hashing will not collide on > calculated keys as per my understanding. Since you have only 4 queues, the queue will be chosen by using last 2 bits of the hash value. So, your hash function effectively converts packet fields to the number from range [0, 3]. Ideal hash function should return hash value that should look like random value. In this case probability of the case where 4 random flows hashed to 4 different queues is 4/4 * 3/4 * 2/4 * 1/4 = 9.375%. In all other ~90% cases there will be at least one collision. Of course, our hash function is not ideal and if you're using a simple XOR, you may just pre-calculate values and choose good flow patterns. But, anyway. > > > > I did some check > > > wrt RSS spec in datasheet and how dpdk driver enables (wrt HTOEP set in > > > GLQF_CTL reg). When I bring up compute node, I see "toeplitz" hashing alg > > > set by default and not simple hash it is as we thought. (No ovs command to > > > help, so I used testpmd on phy ports by stopping ovs-dpdk a while) but none > > > of packet headers enabled for the hashing as well. > > > > Re-loading of DPDK driver (stopping the OVS and starting testpmd) most likely > > drops most of device configurations. > > Agreed, but would you think GLQF reg too be reset ?. I could able to retain > the setting I updated, even after starting and stopping ovs-dpdk (but until > system itself is reboot). I looked through the DPDK code and you might be right here. OVS doesn't call filter_ctl, so it's possible that GLQF is not reset between application restarts. > Here is what I did: > > 1. system boots and ovs-dpdk started. > 2. I stopped ovs-dpdk and launched testpmd, with vfio-pci still loaded > 3. inspection/updates performed as I did for hashing > 4. stopped testpmd and started ovs-dpdk > 5. run the tests to observe hashing behaviour > 6. stopped ovs-dpdk and started testpmd > 7. observed my hash setting did not change as well (same as set in step 3) > 8. stopped testpmd and started ovs-dpdk. > > > > > > Now I started ovs-dpdk (as we did not reset nic hw) and here are the results > > > wrt pmd/rxq load: > > > > OVS explicitly sets rss configuration while initializing the port, so I doubt > > that any of above configurations survives OVS restart. > > > > I could not find anywhere calling eth_dev_ops->filter_ctrl in ovs. Any > pointer where I can check for it ?. OVS sets rss_conf, but this only changes the I40E_PFQF_HENA. So, yes, some configuration on this NIC might be preserved. But this might not work for other NIC types.
(In reply to Ilya Maximets from comment #9) > Since you have only 4 queues, the queue will be chosen by using last 2 bits > of > the hash value. So, your hash function effectively converts packet fields to > the number from range [0, 3]. Ideal hash function should return hash value > that > should look like random value. In this case probability of the case where 4 > random flows hashed to 4 different queues is 4/4 * 3/4 * 2/4 * 1/4 = 9.375%. > In all other ~90% cases there will be at least one collision. > Of course, our hash function is not ideal and if you're using a simple XOR, > you > may just pre-calculate values and choose good flow patterns. But, anyway. > Is this then limitation of toeplitz hashing (default one chosen by nic) ? If so, should this be not documented for the awareness of issue ?. .. <cut> > > > Agreed, but would you think GLQF reg too be reset ?. I could able to retain > > the setting I updated, even after starting and stopping ovs-dpdk (but until > > system itself is reboot). > > I looked through the DPDK code and you might be right here. OVS doesn't > call filter_ctl, so it's possible that GLQF is not reset between application > restarts. Should this be documented elsewhere, for someone troubleshooting this type of load balance issues ? > > > > I could not find anywhere calling eth_dev_ops->filter_ctrl in ovs. Any > > pointer where I can check for it ?. > > OVS sets rss_conf, but this only changes the I40E_PFQF_HENA. So, yes, some > configuration on this NIC might be preserved. But this might not work for > other NIC types. Agreed.
(In reply to Gowrishankar Muthukrishnan from comment #10) > (In reply to Ilya Maximets from comment #9) > > Since you have only 4 queues, the queue will be chosen by using last 2 bits > > of > > the hash value. So, your hash function effectively converts packet fields to > > the number from range [0, 3]. Ideal hash function should return hash value > > that > > should look like random value. In this case probability of the case where 4 > > random flows hashed to 4 different queues is 4/4 * 3/4 * 2/4 * 1/4 = 9.375%. > > In all other ~90% cases there will be at least one collision. > > Of course, our hash function is not ideal and if you're using a simple XOR, > > you > > may just pre-calculate values and choose good flow patterns. But, anyway. > > > > Is this then limitation of toeplitz hashing (default one chosen by nic) ? No. It's not a limitation. It's a default behaviour of any hash. XOR works in this case just occasionally. > If so, should this be not documented for the awareness of issue ?. > > .. > <cut> > > > > > Agreed, but would you think GLQF reg too be reset ?. I could able to retain > > > the setting I updated, even after starting and stopping ovs-dpdk (but until > > > system itself is reboot). > > > > I looked through the DPDK code and you might be right here. OVS doesn't > > call filter_ctl, so it's possible that GLQF is not reset between application > > restarts. > > Should this be documented elsewhere, for someone troubleshooting this type > of load balance issues ? I think, we could consider this as a sort of driver issue in DPDK. I'll ask DPDK maintainers, what they think about this. > > > > > > > I could not find anywhere calling eth_dev_ops->filter_ctrl in ovs. Any > > > pointer where I can check for it ?. > > > > OVS sets rss_conf, but this only changes the I40E_PFQF_HENA. So, yes, some > > configuration on this NIC might be preserved. But this might not work for > > other NIC types. > > Agreed.
fw info on the NIC tested incase required: $ sudo ethtool -i eno1 driver: i40e version: 2.8.20-k firmware-version: 6.80 0x80003d74 18.8.9 expansion-rom-version: bus-info: 0000:18:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes