Bug 1654824
Summary: | [dpdk] Ramrod Failure errors when running testpmd with qede | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Jean-Tsung Hsiao <jhsiao> | |
Component: | dpdk | Assignee: | David Marchand <dmarchan> | |
Status: | CLOSED NEXTRELEASE | QA Contact: | Jean-Tsung Hsiao <jhsiao> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | --- | CC: | arahman, ctrautma, dmarchan, jhsiao, kzhang, mrundle, ovs-qe, rasesh.mody, rkhan, shahed.shaikh, shshaikh, tredaelli | |
Target Milestone: | pre-dev-freeze | Keywords: | Triaged | |
Target Release: | 8.1 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1738789 (view as bug list) | Environment: | ||
Last Closed: | 2020-12-15 11:45:25 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Jean-Tsung Hsiao
2018-11-29 18:49:18 UTC
We could not recreate the issue in our lab setup and driver logs are not sufficient for further debugging. Please collect and provide additional debug data, firmware traces, for further analysis. (In reply to Rasesh Mody from comment #1) > We could not recreate the issue in our lab setup and driver logs are not > sufficient for further debugging. Please collect and provide additional > debug data, firmware traces, for further analysis. I believe Tim already collected some info. we are able to recreate the issue using upstream kernel. We are in the process of bisecting to find the culprit patch, which we believe is outside of our drivers. This is taking time as we have trouble booting some of the bisected kernels. Thanks Ameen Has anything further been found? Not yet. We need to resume this work. Same issue still exists with dpdk-18.11-4. [root@netqe10 ~]# testpmd -w 0000:83:00.0 -w 0000:83:00.1 -- -i EAL: Detected 24 lcore(s) EAL: Detected 2 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Probing VFIO support... EAL: VFIO support initialized EAL: PCI device 0000:83:00.0 on NUMA socket 1 EAL: probe driver: 1077:1656 net_qede EAL: using IOMMU type 1 (Type 1) EAL: PCI device 0000:83:00.1 on NUMA socket 1 EAL: probe driver: 1077:1656 net_qede Interactive-mode selected testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=331456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=331456, size=2176, socket=1 testpmd: preferred mempool ops selected: ring_mp_mc Configuring Port 0 (socket 1) Port 0: 00:0E:1E:D3:F6:56 Configuring Port 1 (socket 1) [QEDE PMD: (83:00.1:dpdk-port-1-0)]ecore_spq_block:Ramrod is stuck [CID ff000000 cmd 01 proto 04 echo 0002] [qede_hw_err_notify:296(83:00.1:dpdk-port-1-0)]HW error occurred [Ramrod Failure] [qede_start_vport:405(83:00.1:dpdk-port-1)]Start V-PORT failed -2 Port1 dev_configure = -1 Fail to configure port 1 EAL: Error - exiting with code: 1 Cause: Start ports failed [root@netqe10 ~]# rpm -q dpdk dpdk-18.11-4.el8.x86_64 [root@netqe10 ~]# uname -a Linux netqe10.knqe.lab.eng.bos.redhat.com 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux [root@netqe10 ~]# NIC QL41000: [root@netqe30 ~]# ethtool -i ens1f0 driver: qede version: 8.33.0.20 firmware-version: mfw 8.18.18.0 storm 8.37.2.0 expansion-rom-version: bus-info: 0000:3b:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: yes [root@netqe30 ~]# NIC QL45000: [root@netqe10 ~]# ethtool -i enp131s0f0 driver: qede version: 8.33.0.20 firmware-version: mfw 8.34.8.0 storm 8.37.2.0 expansion-rom-version: bus-info: 0000:83:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: yes [root@netqe10 ~]# I think this bug is related to https://bugzilla.redhat.com/show_bug.cgi?id=1551605 Mentioned bug complains about "No irq handler for vector" logs introduced after 4.14.0 kernel and seen in 4.15+ kernel versions. In case of qede, Ramrod failure is seen after "No irq handler for vector" and also matches the kernel version history. # echo quit | ./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4 -- -i EAL: Detected 32 lcore(s) EAL: Detected 2 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Probing VFIO support... EAL: VFIO support initialized EAL: PCI device 0000:04:00.0 on NUMA socket 0 EAL: probe driver: 1077:8070 net_qede EAL: using IOMMU type 1 (Type 1) EAL: PCI device 0000:04:00.1 on NUMA socket 0 EAL: probe driver: 1077:8070 net_qede EAL: PCI device 0000:21:00.0 on NUMA socket 1 EAL: probe driver: 1077:1644 net_qede Interactive-mode selected testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc Configuring Port 0 (socket 0) Port 0: F4:E9:D4:ED:20:04 Configuring Port 1 (socket 0) Port 1: F4:E9:D4:ED:20:05 Checking link statuses... Done testpmd> quit Stopping port 0... Stopping ports... Done Stopping port 1... Stopping ports... 2019 Apr 12 21:44:07 dpdk-1 do_IRQ: 24.35 No irq handler for vector >>>>>>>>>>>>>>>>>>>> CHECK THIS [QEDE PMD: (04:00.1:dpdk-port-1-0)]ecore_spq_block:Ramrod is stuck [CID ff100010 cmd 05 proto 04 echo 000d] [qede_hw_err_notify:296(04:00.1:dpdk-port-1-0)]HW error occurred [Ramrod Failure] [qede_rx_queue_stop:376(04:00.1:dpdk-port-1)]RX queue 0 stop fails Done Shutting down port 0... Closing ports... Port 0: link state change event Done (In reply to Shahed Shaikh from comment #9) > I think this bug is related to > https://bugzilla.redhat.com/show_bug.cgi?id=1551605 > Mentioned bug complains about "No irq handler for vector" logs introduced > after 4.14.0 kernel and seen in 4.15+ kernel versions. > > > In case of qede, Ramrod failure is seen after "No irq handler for vector" > and also matches the kernel version history. > > # echo quit | ./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4 -- -i > EAL: Detected 32 lcore(s) > EAL: Detected 2 NUMA nodes > EAL: Multi-process socket /var/run/dpdk/rte/mp_socket > EAL: Probing VFIO support... > EAL: VFIO support initialized > EAL: PCI device 0000:04:00.0 on NUMA socket 0 > EAL: probe driver: 1077:8070 net_qede > EAL: using IOMMU type 1 (Type 1) > EAL: PCI device 0000:04:00.1 on NUMA socket 0 > EAL: probe driver: 1077:8070 net_qede > EAL: PCI device 0000:21:00.0 on NUMA socket 1 > EAL: probe driver: 1077:1644 net_qede > Interactive-mode selected > testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, > socket=0 > testpmd: preferred mempool ops selected: ring_mp_mc > Configuring Port 0 (socket 0) > Port 0: F4:E9:D4:ED:20:04 > Configuring Port 1 (socket 0) > Port 1: F4:E9:D4:ED:20:05 > Checking link statuses... > Done > testpmd> quit > > Stopping port 0... > Stopping ports... > Done > > Stopping port 1... > Stopping ports... > 2019 Apr 12 21:44:07 dpdk-1 do_IRQ: 24.35 No irq handler for vector > >>>>>>>>>>>>>>>>>>>> CHECK THIS > [QEDE PMD: (04:00.1:dpdk-port-1-0)]ecore_spq_block:Ramrod is stuck [CID > ff100010 cmd 05 proto 04 echo 000d] > [qede_hw_err_notify:296(04:00.1:dpdk-port-1-0)]HW error occurred [Ramrod > Failure] > [qede_rx_queue_stop:376(04:00.1:dpdk-port-1)]RX queue 0 stop fails > Done > > Shutting down port 0... > Closing ports... > > Port 0: link state change event > Done This is from my test bed: [root@netqe10 ~]# dmesg | grep -i irq [158764.159833] do_IRQ: 9.34 No irq handler for vector [213663.060199] do_IRQ: 11.34 No irq handler for vector [214273.155858] do_IRQ: 23.35 No irq handler for vector [214394.269106] do_IRQ: 13.36 No irq handler for vector [215753.766134] do_IRQ: 23.36 No irq handler for vector [217920.661218] do_IRQ: 1.36 No irq handler for vector [217962.638033] do_IRQ: 13.34 No irq handler for vector [218173.138996] do_IRQ: 21.34 No irq handler for vector [218429.827628] do_IRQ: 15.35 No irq handler for vector [218492.267655] do_IRQ: 13.35 No irq handler for vector [218514.555786] do_IRQ: 15.35 No irq handler for vector [218593.863876] do_IRQ: 11.36 No irq handler for vector [218983.311982] do_IRQ: 17.34 No irq handler for vector Thanks for the update! Jean NOTE: The issue is not 100% reproducible. But, when the issue happened, the "No irq handler for vector" came with it. [root@netqe10 ~]# testpmd -w 0000:83:00.0 -w 0000:83:00.1 -- -i EAL: Detected 24 lcore(s) EAL: Detected 2 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Probing VFIO support... EAL: VFIO support initialized EAL: PCI device 0000:83:00.0 on NUMA socket 1 EAL: probe driver: 1077:1656 net_qede EAL: using IOMMU type 1 (Type 1) [QEDE PMD: ()]ecore_fw_assertion:FW assertion! [qede_hw_err_notify:296()]HW error occurred [FW Assertion] [QEDE PMD: ()]ecore_int_deassertion_aeu_bit:`General Attention 32': Fatal attention [qede_hw_err_notify:296()]HW error occurred [HW Attention] [ecore_int_deassertion_aeu_bit:972()]`General Attention 32' - Disabled future attentions EAL: PCI device 0000:83:00.1 on NUMA socket 1 EAL: probe driver: 1077:1656 net_qede Interactive-mode selected testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=331456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=331456, size=2176, socket=1 testpmd: preferred mempool ops selected: ring_mp_mc Configuring Port 0 (socket 1) Port 0: 00:0E:1E:D3:F6:56 Configuring Port 1 (socket 1) [QEDE PMD: (83:00.1:dpdk-port-1-0)]ecore_spq_block:Ramrod is stuck [CID ff000000 cmd 01 proto 04 echo 0002] [qede_hw_err_notify:296(83:00.1:dpdk-port-1-0)]HW error occurred [Ramrod Failure] [qede_start_vport:405(83:00.1:dpdk-port-1)]Start V-PORT failed -2 Port1 dev_configure = -1 Fail to configure port 1 EAL: Error - exiting with code: 1 Cause: Start ports failed [root@netqe10 ~]# [root@netqe10 ~]# dmesg | grep -i irq [158764.159833] do_IRQ: 9.34 No irq handler for vector [213663.060199] do_IRQ: 11.34 No irq handler for vector [214273.155858] do_IRQ: 23.35 No irq handler for vector [214394.269106] do_IRQ: 13.36 No irq handler for vector [215753.766134] do_IRQ: 23.36 No irq handler for vector [217920.661218] do_IRQ: 1.36 No irq handler for vector [217962.638033] do_IRQ: 13.34 No irq handler for vector [218173.138996] do_IRQ: 21.34 No irq handler for vector [218429.827628] do_IRQ: 15.35 No irq handler for vector [218492.267655] do_IRQ: 13.35 No irq handler for vector [218514.555786] do_IRQ: 15.35 No irq handler for vector [218593.863876] do_IRQ: 11.36 No irq handler for vector [218983.311982] do_IRQ: 17.34 No irq handler for vector [235604.424099] do_IRQ: 9.37 No irq handler for vector [root@netqe10 ~]# testpmd -w 0000:83:00.0 -w 0000:83:00.1 -- -i EAL: Detected 24 lcore(s) EAL: Detected 2 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Probing VFIO support... EAL: VFIO support initialized EAL: PCI device 0000:83:00.0 on NUMA socket 1 EAL: probe driver: 1077:1656 net_qede EAL: using IOMMU type 1 (Type 1) EAL: PCI device 0000:83:00.1 on NUMA socket 1 EAL: probe driver: 1077:1656 net_qede Interactive-mode selected testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=331456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=331456, size=2176, socket=1 testpmd: preferred mempool ops selected: ring_mp_mc Configuring Port 0 (socket 1) Port 0: 00:0E:1E:D3:F6:56 Configuring Port 1 (socket 1) Port 1: 00:0E:1E:D3:F6:57 Checking link statuses... Done testpmd> [root@netqe10 ~]# dmesg | grep -i irq [158764.159833] do_IRQ: 9.34 No irq handler for vector [213663.060199] do_IRQ: 11.34 No irq handler for vector [214273.155858] do_IRQ: 23.35 No irq handler for vector [214394.269106] do_IRQ: 13.36 No irq handler for vector [215753.766134] do_IRQ: 23.36 No irq handler for vector [217920.661218] do_IRQ: 1.36 No irq handler for vector [217962.638033] do_IRQ: 13.34 No irq handler for vector [218173.138996] do_IRQ: 21.34 No irq handler for vector [218429.827628] do_IRQ: 15.35 No irq handler for vector [218492.267655] do_IRQ: 13.35 No irq handler for vector [218514.555786] do_IRQ: 15.35 No irq handler for vector [218593.863876] do_IRQ: 11.36 No irq handler for vector [218983.311982] do_IRQ: 17.34 No irq handler for vector [235604.424099] do_IRQ: 9.37 No irq handler for vector After other analysis it seems something related to MSI/MSI-X. If I launch testpmd without using MSI or MSI-X (--vfio-intr legacy) I cannot replicate the problem anymore. Of course this is NOT the solution since legacy mode cannot be used on SR-IOV. (In reply to Timothy Redaelli from comment #12) > After other analysis it seems something related to MSI/MSI-X. > > If I launch testpmd without using MSI or MSI-X (--vfio-intr legacy) I cannot > replicate the problem anymore. > > Of course this is NOT the solution since legacy mode cannot be used on > SR-IOV. The workaround looks solid. What would be the next step for this bug? Hi Ameen, We've been using qed_init_values-8.37.7.0.bin for both Rhel-7 and Rhel-8. Is this correct ? Thanks! Jean Yes. A given version of the DPDK driver works with a given version of qed_init_values-<version>.bin. The same issue, Ramrod is stuck, still exists with openvswitch2.11-2.11.0-9.el8fdp. Our OVS-dpdk tunnelling automation over qede failed 4 out of 9 tests. The loop below can reproduce the issue easily: while [ 1 ]; do date; systemctl stop openvswitch; sleep 3; systemctl start openvswitch; ovs-vsctl show; done > ovs_start_stop.log 2>&1 & Running the loop for about an hour shows that the qede failure rate(Ramrod is stuck) is about 10% --- 35 out of 351 tests. NIC under test in this case is Qlogic FastLinQ QL45212H 25GbE Adapter. With QLogic FastLinQ QL41262H 25GbE Adapter the failure rate is much smaller. Running 1+ day only got 2 such failures out of 10975 --- less than 0.02%. This is being debugged in https://bugzilla.redhat.com/show_bug.cgi?id=1704202 Posted a fix upstream. Rasesh, Mody, can you have a look at http://patchwork.dpdk.org/patch/55310/ ? (In reply to David Marchand from comment #20) > Posted a fix upstream. > > Rasesh, Mody, can you have a look at http://patchwork.dpdk.org/patch/55310/ ? Hi David, The change looks good, acked the fix. Thanks. Ok, thanks. I will take this bz and handle the downstream side of it once upstream merges it. Dropped the (incorrect) patch at the driver level and fixed the issue at the dpdk vfio infrastructure level. http://patchwork.dpdk.org/patch/55867/ Started testpmd 100 times on the netqe10 server which demonstrates this issue with the QL45000 nic: - without the patch, got the issue 14 times, - with the patch, no issue. Shahed, Rasesh, could you have a try at this patch in msix and legacy interrupt mode? As mentioned in Comment #3, we suspected something outside the driver. But we couldn't prove it. Thank you David for helping us with this. I will have Rasesh or Shahed verify the fix. We should also have Jean-Tsung Hsiao (submitter of this bug) verify this. (In reply to David Marchand from comment #23) > Dropped the (incorrect) patch at the driver level and fixed the issue at the > dpdk vfio infrastructure level. > http://patchwork.dpdk.org/patch/55867/ > > Started testpmd 100 times on the netqe10 server which demonstrates this > issue with the QL45000 nic: > - without the patch, got the issue 14 times, > - with the patch, no issue. > > Shahed, Rasesh, could you have a try at this patch in msix and legacy > interrupt mode? Hi David I have tested the patch with msix and legacy interrupt mode. Did not see any issue :) Do you want me to add tested-by tag to your patch on dpdk mailing list? This is still a rfc, but I don't mind getting a Tested-by yes. Thank you. Posted a non-rfc patch, no change from the one you tested, thanks Shahed. This bz has been reported against the dpdk 18.11 package. I will clone it and address the issue in openvswitch2.11, since Jean reported the issue as well. I would expect the same issue to happen in dpdk-17.11 and so in openvswitch 2.9 as well. I don't have the hw which seems to trigger the issue that easily. Can any of you confirm the issue can be seen with those versions? (In reply to David Marchand from comment #28) > This bz has been reported against the dpdk 18.11 package. > I will clone it and address the issue in openvswitch2.11, since Jean > reported the issue as well. > > I would expect the same issue to happen in dpdk-17.11 and so in openvswitch > 2.9 as well. > > I don't have the hw which seems to trigger the issue that easily. > Can any of you confirm the issue can be seen with those versions? I got the HW. Let me run my reproducer again. Using dpdk-18.11-8.el8.x86_64 I was able to reproduce the issue 2 out of 5 times --- Under kernel 8.0.0 kernel, 4.18.0-80.el8.x86_64. (In reply to Jean-Tsung Hsiao from comment #30) > Using dpdk-18.11-8.el8.x86_64 I was able to reproduce the issue 2 out of 5 > times --- Under kernel 8.0.0 kernel, 4.18.0-80.el8.x86_64. Reproducer: testpmd -w 0000:84:00.0 -w 0000:84:00.1 -- -i [root@netqe10 ~]# driverctl -v list-overrides 0000:84:00.0 vfio-pci (FastLinQ QL45000 Series 25GbE Controller (FastLinQ QL45212H 25GbE Adapter)) 0000:84:00.1 vfio-pci (FastLinQ QL45000 Series 25GbE Controller (FastLinQ QL45212H 25GbE Adapter)) [root@netqe10 ~]# Using dpdk-17.11-15 under Rhel-7.6, I was unable to reproduce the issue out of 10 tries. Thanks Jean. Then I'll consider only 18.11 is affected. One more data point: Can't reproduce the issue in the folliwng env: [root@netqe10 ~]# rpm -q dpdk dpdk-18.11.2-1.el7.x86_64 [root@netqe10 ~]# uname -r 3.10.0-1061.el7.x86_64 [root@netqe10 ~]# This problem has been fixed in 19.11 which is packaged in rhel 8.2+. Marking as fixed in next release. |