Bug 1654824

Summary:	[dpdk] Ramrod Failure errors when running testpmd with qede
Product:	Red Hat Enterprise Linux 8	Reporter:	Jean-Tsung Hsiao <jhsiao>
Component:	dpdk	Assignee:	David Marchand <dmarchan>
Status:	CLOSED NEXTRELEASE	QA Contact:	Jean-Tsung Hsiao <jhsiao>
Severity:	high	Docs Contact:
Priority:	high
Version:	---	CC:	arahman, ctrautma, dmarchan, jhsiao, kzhang, mrundle, ovs-qe, rasesh.mody, rkhan, shahed.shaikh, shshaikh, tredaelli
Target Milestone:	pre-dev-freeze	Keywords:	Triaged
Target Release:	8.1	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1738789 (view as bug list)		Environment:
Last Closed:	2020-12-15 11:45:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jean-Tsung Hsiao 2018-11-29 18:49:18 UTC

Description of problem: [CAVIUM] Got Ramrod Failure when running testpmd with qede

[root@netqe10 jhsiao]# sh testpmd_qede.sh
EAL: Detected 24 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:83:00.0 on NUMA socket 1
EAL:   probe driver: 1077:1656 net_qede
EAL:   using IOMMU type 1 (Type 1)
EAL: PCI device 0000:83:00.1 on NUMA socket 1
EAL:   probe driver: 1077:1656 net_qede
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=163456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=163456, size=2176, socket=1
testpmd: preferred mempool ops selected: ring_mp_mc
Configuring Port 0 (socket 1)
Port 0: 00:0E:1E:D3:F6:56
Configuring Port 1 (socket 1)
[QEDE PMD: (83:00.1:dpdk-port-1-0)]ecore_spq_block:Ramrod is stuck [CID ff000000 cmd 01 proto 04 echo 0002]
[qede_hw_err_notify:296(83:00.1:dpdk-port-1-0)]HW error occurred [Ramrod Failure]
[qede_start_vport:405(83:00.1:dpdk-port-1)]Start V-PORT failed -2
Port1 dev_configure = -1
Fail to configure port 1
EAL: Error - exiting with code: 1
  Cause: Start ports failed

Version-Release number of selected component (if applicable):

[root@netqe10 jhsiao]# uname -a
Linux netqe10.knqe.lab.eng.bos.redhat.com 4.18.0-32.el8.x86_64 #1 SMP Sat Oct 27 19:26:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@netqe10 jhsiao]# rpm -q dpdk
dpdk-18.11-2.el8.x86_64

How reproducible: reproducible


Steps to Reproduce: See above
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Rasesh Mody 2018-12-21 23:22:55 UTC

We could not recreate the issue in our lab setup and driver logs are not sufficient for further debugging. Please collect and provide additional debug data, firmware traces, for further analysis.

Comment 2 Jean-Tsung Hsiao 2019-01-29 15:53:48 UTC

(In reply to Rasesh Mody from comment #1)
> We could not recreate the issue in our lab setup and driver logs are not
> sufficient for further debugging. Please collect and provide additional
> debug data, firmware traces, for further analysis.

I believe Tim already collected some info.

Comment 3 Ameen Rahman 2019-03-01 18:01:52 UTC

we are able to recreate the issue using upstream kernel. We are in the process of bisecting to find the culprit patch, which we believe is outside of our drivers. This is taking time as we have trouble booting some of the bisected kernels.

Comment 4 Rashid Khan 2019-03-01 19:19:14 UTC

Thanks Ameen

Comment 5 mrundle 2019-04-04 17:12:32 UTC

Has anything further been found?

Comment 6 Ameen Rahman 2019-04-04 17:47:57 UTC

Not yet. We need to resume this work.

Comment 7 Jean-Tsung Hsiao 2019-04-11 23:20:47 UTC

Same issue still exists with dpdk-18.11-4.

[root@netqe10 ~]# testpmd -w 0000:83:00.0 -w 0000:83:00.1 -- -i
EAL: Detected 24 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:83:00.0 on NUMA socket 1
EAL:   probe driver: 1077:1656 net_qede
EAL:   using IOMMU type 1 (Type 1)
EAL: PCI device 0000:83:00.1 on NUMA socket 1
EAL:   probe driver: 1077:1656 net_qede
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=331456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=331456, size=2176, socket=1
testpmd: preferred mempool ops selected: ring_mp_mc
Configuring Port 0 (socket 1)
Port 0: 00:0E:1E:D3:F6:56
Configuring Port 1 (socket 1)
[QEDE PMD: (83:00.1:dpdk-port-1-0)]ecore_spq_block:Ramrod is stuck [CID ff000000 cmd 01 proto 04 echo 0002]
[qede_hw_err_notify:296(83:00.1:dpdk-port-1-0)]HW error occurred [Ramrod Failure]
[qede_start_vport:405(83:00.1:dpdk-port-1)]Start V-PORT failed -2
Port1 dev_configure = -1
Fail to configure port 1
EAL: Error - exiting with code: 1
  Cause: Start ports failed
[root@netqe10 ~]# rpm -q dpdk
dpdk-18.11-4.el8.x86_64
[root@netqe10 ~]# uname -a
Linux netqe10.knqe.lab.eng.bos.redhat.com 4.18.0-80.el8.x86_64 #1 SMP Wed Mar 13 12:02:46 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@netqe10 ~]#

Comment 8 Jean-Tsung Hsiao 2019-04-12 03:03:22 UTC

NIC QL41000:
[root@netqe30 ~]# ethtool -i ens1f0
driver: qede
version: 8.33.0.20
firmware-version: mfw 8.18.18.0 storm 8.37.2.0
expansion-rom-version: 
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: yes
[root@netqe30 ~]# 

NIC QL45000:
[root@netqe10 ~]# ethtool -i enp131s0f0
driver: qede
version: 8.33.0.20
firmware-version: mfw 8.34.8.0 storm 8.37.2.0
expansion-rom-version: 
bus-info: 0000:83:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: yes
[root@netqe10 ~]#

Comment 9 Shahed Shaikh 2019-04-12 16:20:04 UTC

I think this bug is related to https://bugzilla.redhat.com/show_bug.cgi?id=1551605 
Mentioned bug complains about "No irq handler for vector" logs introduced after 4.14.0 kernel and seen in 4.15+ kernel versions.


In case of qede, Ramrod failure is seen after "No irq handler for vector" and also matches the kernel version history.

# echo quit | ./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4 -- -i
EAL: Detected 32 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:04:00.0 on NUMA socket 0
EAL:   probe driver: 1077:8070 net_qede
EAL:   using IOMMU type 1 (Type 1)
EAL: PCI device 0000:04:00.1 on NUMA socket 0
EAL:   probe driver: 1077:8070 net_qede
EAL: PCI device 0000:21:00.0 on NUMA socket 1
EAL:   probe driver: 1077:1644 net_qede
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Configuring Port 0 (socket 0)
Port 0: F4:E9:D4:ED:20:04
Configuring Port 1 (socket 0)
Port 1: F4:E9:D4:ED:20:05
Checking link statuses...
Done
testpmd> quit

Stopping port 0...
Stopping ports...
Done

Stopping port 1...
Stopping ports...
2019 Apr 12 21:44:07 dpdk-1 do_IRQ: 24.35 No irq handler for vector  >>>>>>>>>>>>>>>>>>>> CHECK THIS
[QEDE PMD: (04:00.1:dpdk-port-1-0)]ecore_spq_block:Ramrod is stuck [CID ff100010 cmd 05 proto 04 echo 000d]
[qede_hw_err_notify:296(04:00.1:dpdk-port-1-0)]HW error occurred [Ramrod Failure]
[qede_rx_queue_stop:376(04:00.1:dpdk-port-1)]RX queue 0 stop fails
Done

Shutting down port 0...
Closing ports...

Port 0: link state change event
Done

Comment 10 Jean-Tsung Hsiao 2019-04-12 19:03:53 UTC

(In reply to Shahed Shaikh from comment #9)
> I think this bug is related to
> https://bugzilla.redhat.com/show_bug.cgi?id=1551605 
> Mentioned bug complains about "No irq handler for vector" logs introduced
> after 4.14.0 kernel and seen in 4.15+ kernel versions.
> 
> 
> In case of qede, Ramrod failure is seen after "No irq handler for vector"
> and also matches the kernel version history.
> 
> # echo quit | ./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4 -- -i
> EAL: Detected 32 lcore(s)
> EAL: Detected 2 NUMA nodes
> EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
> EAL: Probing VFIO support...
> EAL: VFIO support initialized
> EAL: PCI device 0000:04:00.0 on NUMA socket 0
> EAL:   probe driver: 1077:8070 net_qede
> EAL:   using IOMMU type 1 (Type 1)
> EAL: PCI device 0000:04:00.1 on NUMA socket 0
> EAL:   probe driver: 1077:8070 net_qede
> EAL: PCI device 0000:21:00.0 on NUMA socket 1
> EAL:   probe driver: 1077:1644 net_qede
> Interactive-mode selected
> testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176,
> socket=0
> testpmd: preferred mempool ops selected: ring_mp_mc
> Configuring Port 0 (socket 0)
> Port 0: F4:E9:D4:ED:20:04
> Configuring Port 1 (socket 0)
> Port 1: F4:E9:D4:ED:20:05
> Checking link statuses...
> Done
> testpmd> quit
> 
> Stopping port 0...
> Stopping ports...
> Done
> 
> Stopping port 1...
> Stopping ports...
> 2019 Apr 12 21:44:07 dpdk-1 do_IRQ: 24.35 No irq handler for vector 
> >>>>>>>>>>>>>>>>>>>> CHECK THIS
> [QEDE PMD: (04:00.1:dpdk-port-1-0)]ecore_spq_block:Ramrod is stuck [CID
> ff100010 cmd 05 proto 04 echo 000d]
> [qede_hw_err_notify:296(04:00.1:dpdk-port-1-0)]HW error occurred [Ramrod
> Failure]
> [qede_rx_queue_stop:376(04:00.1:dpdk-port-1)]RX queue 0 stop fails
> Done
> 
> Shutting down port 0...
> Closing ports...
> 
> Port 0: link state change event
> Done

This is from my test bed:

[root@netqe10 ~]# dmesg | grep -i irq
[158764.159833] do_IRQ: 9.34 No irq handler for vector
[213663.060199] do_IRQ: 11.34 No irq handler for vector
[214273.155858] do_IRQ: 23.35 No irq handler for vector
[214394.269106] do_IRQ: 13.36 No irq handler for vector
[215753.766134] do_IRQ: 23.36 No irq handler for vector
[217920.661218] do_IRQ: 1.36 No irq handler for vector
[217962.638033] do_IRQ: 13.34 No irq handler for vector
[218173.138996] do_IRQ: 21.34 No irq handler for vector
[218429.827628] do_IRQ: 15.35 No irq handler for vector
[218492.267655] do_IRQ: 13.35 No irq handler for vector
[218514.555786] do_IRQ: 15.35 No irq handler for vector
[218593.863876] do_IRQ: 11.36 No irq handler for vector
[218983.311982] do_IRQ: 17.34 No irq handler for vector

Thanks for the update!
Jean

Comment 11 Jean-Tsung Hsiao 2019-04-12 19:10:35 UTC

NOTE: The issue is not 100% reproducible. But, when the issue happened, the "No irq handler for vector" came with it.


[root@netqe10 ~]# testpmd -w 0000:83:00.0 -w 0000:83:00.1 -- -i
EAL: Detected 24 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:83:00.0 on NUMA socket 1
EAL:   probe driver: 1077:1656 net_qede
EAL:   using IOMMU type 1 (Type 1)
[QEDE PMD: ()]ecore_fw_assertion:FW assertion!
[qede_hw_err_notify:296()]HW error occurred [FW Assertion]
[QEDE PMD: ()]ecore_int_deassertion_aeu_bit:`General Attention 32': Fatal attention
[qede_hw_err_notify:296()]HW error occurred [HW Attention]
[ecore_int_deassertion_aeu_bit:972()]`General Attention 32' - Disabled future attentions
EAL: PCI device 0000:83:00.1 on NUMA socket 1
EAL:   probe driver: 1077:1656 net_qede
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=331456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=331456, size=2176, socket=1
testpmd: preferred mempool ops selected: ring_mp_mc
Configuring Port 0 (socket 1)
Port 0: 00:0E:1E:D3:F6:56
Configuring Port 1 (socket 1)
[QEDE PMD: (83:00.1:dpdk-port-1-0)]ecore_spq_block:Ramrod is stuck [CID ff000000 cmd 01 proto 04 echo 0002]
[qede_hw_err_notify:296(83:00.1:dpdk-port-1-0)]HW error occurred [Ramrod Failure]
[qede_start_vport:405(83:00.1:dpdk-port-1)]Start V-PORT failed -2
Port1 dev_configure = -1
Fail to configure port 1
EAL: Error - exiting with code: 1
  Cause: Start ports failed
[root@netqe10 ~]# 

[root@netqe10 ~]# dmesg | grep -i irq
[158764.159833] do_IRQ: 9.34 No irq handler for vector
[213663.060199] do_IRQ: 11.34 No irq handler for vector
[214273.155858] do_IRQ: 23.35 No irq handler for vector
[214394.269106] do_IRQ: 13.36 No irq handler for vector
[215753.766134] do_IRQ: 23.36 No irq handler for vector
[217920.661218] do_IRQ: 1.36 No irq handler for vector
[217962.638033] do_IRQ: 13.34 No irq handler for vector
[218173.138996] do_IRQ: 21.34 No irq handler for vector
[218429.827628] do_IRQ: 15.35 No irq handler for vector
[218492.267655] do_IRQ: 13.35 No irq handler for vector
[218514.555786] do_IRQ: 15.35 No irq handler for vector
[218593.863876] do_IRQ: 11.36 No irq handler for vector
[218983.311982] do_IRQ: 17.34 No irq handler for vector
[235604.424099] do_IRQ: 9.37 No irq handler for vector


[root@netqe10 ~]# testpmd -w 0000:83:00.0 -w 0000:83:00.1 -- -i
EAL: Detected 24 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:83:00.0 on NUMA socket 1
EAL:   probe driver: 1077:1656 net_qede
EAL:   using IOMMU type 1 (Type 1)
EAL: PCI device 0000:83:00.1 on NUMA socket 1
EAL:   probe driver: 1077:1656 net_qede
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=331456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=331456, size=2176, socket=1
testpmd: preferred mempool ops selected: ring_mp_mc
Configuring Port 0 (socket 1)
Port 0: 00:0E:1E:D3:F6:56
Configuring Port 1 (socket 1)
Port 1: 00:0E:1E:D3:F6:57
Checking link statuses...
Done
testpmd> 

[root@netqe10 ~]# dmesg | grep -i irq
[158764.159833] do_IRQ: 9.34 No irq handler for vector
[213663.060199] do_IRQ: 11.34 No irq handler for vector
[214273.155858] do_IRQ: 23.35 No irq handler for vector
[214394.269106] do_IRQ: 13.36 No irq handler for vector
[215753.766134] do_IRQ: 23.36 No irq handler for vector
[217920.661218] do_IRQ: 1.36 No irq handler for vector
[217962.638033] do_IRQ: 13.34 No irq handler for vector
[218173.138996] do_IRQ: 21.34 No irq handler for vector
[218429.827628] do_IRQ: 15.35 No irq handler for vector
[218492.267655] do_IRQ: 13.35 No irq handler for vector
[218514.555786] do_IRQ: 15.35 No irq handler for vector
[218593.863876] do_IRQ: 11.36 No irq handler for vector
[218983.311982] do_IRQ: 17.34 No irq handler for vector
[235604.424099] do_IRQ: 9.37 No irq handler for vector

Comment 12 Timothy Redaelli 2019-04-16 09:35:57 UTC

After other analysis it seems something related to MSI/MSI-X.

If I launch testpmd without using MSI or MSI-X (--vfio-intr legacy) I cannot replicate the problem anymore.

Of course this is NOT the solution since legacy mode cannot be used on SR-IOV.

Comment 14 Jean-Tsung Hsiao 2019-04-16 13:30:44 UTC

(In reply to Timothy Redaelli from comment #12)
> After other analysis it seems something related to MSI/MSI-X.
> 
> If I launch testpmd without using MSI or MSI-X (--vfio-intr legacy) I cannot
> replicate the problem anymore.
> 
> Of course this is NOT the solution since legacy mode cannot be used on
> SR-IOV.

The workaround looks solid.

Comment 15 Ameen Rahman 2019-04-16 22:33:19 UTC

What would be the next step for  this bug?

Comment 16 Jean-Tsung Hsiao 2019-04-18 17:03:12 UTC

Hi Ameen,

We've been using qed_init_values-8.37.7.0.bin for both Rhel-7 and Rhel-8.

Is this correct ?

Thanks!

Jean

Comment 17 Ameen Rahman 2019-04-23 19:58:04 UTC

Yes. A given version of the DPDK driver works with a given version of qed_init_values-<version>.bin.

Comment 18 Jean-Tsung Hsiao 2019-05-13 17:09:46 UTC

 
The same issue, Ramrod is stuck, still exists with openvswitch2.11-2.11.0-9.el8fdp.

Our OVS-dpdk tunnelling automation over qede failed 4 out of 9 tests.

The loop below can reproduce the issue easily:

while [ 1 ]; do date; systemctl stop openvswitch; sleep 3; systemctl start openvswitch; ovs-vsctl show; done > ovs_start_stop.log 2>&1 &

Running the loop for about an hour shows that the qede failure rate(Ramrod is stuck) is about 10% --- 35 out of 351 tests. NIC under test in this case is Qlogic FastLinQ QL45212H 25GbE Adapter.

With QLogic FastLinQ QL41262H 25GbE Adapter the failure rate is much smaller. Running 1+ day only got 2 such failures out of 10975 --- less than 0.02%.

Comment 19 Ameen Rahman 2019-05-13 20:15:57 UTC

This is being debugged in https://bugzilla.redhat.com/show_bug.cgi?id=1704202

Comment 20 David Marchand 2019-06-25 13:41:38 UTC

Posted a fix upstream.

Rasesh, Mody, can you have a look at http://patchwork.dpdk.org/patch/55310/ ?

Comment 21 Rasesh Mody 2019-06-25 22:54:52 UTC

(In reply to David Marchand from comment #20)
> Posted a fix upstream.
> 
> Rasesh, Mody, can you have a look at http://patchwork.dpdk.org/patch/55310/ ?

Hi David,

The change looks good, acked the fix. 

Thanks.

Comment 22 David Marchand 2019-06-26 07:17:38 UTC

Ok, thanks.
I will take this bz and handle the downstream side of it once upstream merges it.

Comment 23 David Marchand 2019-07-03 08:52:34 UTC

Dropped the (incorrect) patch at the driver level and fixed the issue at the dpdk vfio infrastructure level.
http://patchwork.dpdk.org/patch/55867/

Started testpmd 100 times on the netqe10 server which demonstrates this issue with the QL45000 nic:
- without the patch, got the issue 14 times,
- with the patch, no issue.

Shahed, Rasesh, could you have a try at this patch in msix and legacy interrupt mode?

Comment 24 Ameen Rahman 2019-07-04 04:49:09 UTC

As mentioned in Comment #3, we suspected something outside the driver. But we couldn't prove it. Thank you David for helping us with this. I will have Rasesh or Shahed verify the fix. We should also have Jean-Tsung Hsiao (submitter of this bug) verify this.

Comment 25 Shahed Shaikh 2019-07-04 10:19:51 UTC

(In reply to David Marchand from comment #23)
> Dropped the (incorrect) patch at the driver level and fixed the issue at the
> dpdk vfio infrastructure level.
> http://patchwork.dpdk.org/patch/55867/
> 
> Started testpmd 100 times on the netqe10 server which demonstrates this
> issue with the QL45000 nic:
> - without the patch, got the issue 14 times,
> - with the patch, no issue.
> 
> Shahed, Rasesh, could you have a try at this patch in msix and legacy
> interrupt mode?

Hi David
I have tested the patch with msix and legacy interrupt mode. Did not see any issue :)
Do you want me to add tested-by tag to your patch on dpdk mailing list?

Comment 26 David Marchand 2019-07-04 10:42:09 UTC

This is still a rfc, but I don't mind getting a Tested-by yes.
Thank you.

Comment 27 David Marchand 2019-07-10 12:39:37 UTC

Posted a non-rfc patch, no change from the one you tested, thanks Shahed.

Comment 28 David Marchand 2019-08-07 13:30:40 UTC

This bz has been reported against the dpdk 18.11 package.
I will clone it and address the issue in openvswitch2.11, since Jean reported the issue as well.

I would expect the same issue to happen in dpdk-17.11 and so in openvswitch 2.9 as well.

I don't have the hw which seems to trigger the issue that easily.
Can any of you confirm the issue can be seen with those versions?

Comment 29 Jean-Tsung Hsiao 2019-08-07 14:05:22 UTC

(In reply to David Marchand from comment #28)
> This bz has been reported against the dpdk 18.11 package.
> I will clone it and address the issue in openvswitch2.11, since Jean
> reported the issue as well.
> 
> I would expect the same issue to happen in dpdk-17.11 and so in openvswitch
> 2.9 as well.
> 
> I don't have the hw which seems to trigger the issue that easily.
> Can any of you confirm the issue can be seen with those versions?

I got the HW. Let me run my reproducer again.

Comment 30 Jean-Tsung Hsiao 2019-08-07 20:28:34 UTC

Using dpdk-18.11-8.el8.x86_64 I was able to reproduce the issue 2 out of 5 times --- Under kernel 8.0.0 kernel, 4.18.0-80.el8.x86_64.

Comment 31 Jean-Tsung Hsiao 2019-08-07 21:19:17 UTC

(In reply to Jean-Tsung Hsiao from comment #30)
> Using dpdk-18.11-8.el8.x86_64 I was able to reproduce the issue 2 out of 5
> times --- Under kernel 8.0.0 kernel, 4.18.0-80.el8.x86_64.

Reproducer:

testpmd -w 0000:84:00.0 -w 0000:84:00.1 -- -i

[root@netqe10 ~]# driverctl -v list-overrides
0000:84:00.0 vfio-pci (FastLinQ QL45000 Series 25GbE Controller (FastLinQ QL45212H 25GbE Adapter))
0000:84:00.1 vfio-pci (FastLinQ QL45000 Series 25GbE Controller (FastLinQ QL45212H 25GbE Adapter))
[root@netqe10 ~]#

Comment 32 Jean-Tsung Hsiao 2019-08-07 22:14:53 UTC

Using dpdk-17.11-15 under Rhel-7.6, I was unable to reproduce the issue out of 10 tries.

Comment 33 David Marchand 2019-08-08 06:41:57 UTC

Thanks Jean.
Then I'll consider only 18.11 is affected.

Comment 34 Jean-Tsung Hsiao 2019-08-08 15:30:19 UTC

One more data point:

Can't reproduce the issue in the folliwng env:

[root@netqe10 ~]# rpm -q dpdk
dpdk-18.11.2-1.el7.x86_64
[root@netqe10 ~]# uname -r
3.10.0-1061.el7.x86_64
[root@netqe10 ~]#

Comment 37 David Marchand 2020-12-15 11:45:25 UTC

This problem has been fixed in 19.11 which is packaged in rhel 8.2+.
Marking as fixed in next release.