Bug 1767678 - [mlx5_core] OVS offload: mlx5_core 0000:04:00.0: cmd_work_handler:855:(pid 87647): failed to allocate command entry
Summary: [mlx5_core] OVS offload: mlx5_core 0000:04:00.0: cmd_work_handler:855:(pid 87...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch
Version: FDP 19.G
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Alaa Hleihel (NVIDIA Mellanox)
QA Contact: Amit Supugade
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-01 02:06 UTC by qding
Modified: 2021-04-30 07:51 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-20 10:05:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
testout for log (256.56 KB, application/zip)
2019-11-01 02:06 UTC, qding
no flags Details
console log (322.34 KB, application/zip)
2019-11-01 02:06 UTC, qding
no flags Details
console2.log (47.45 KB, text/plain)
2019-11-21 02:06 UTC, qding
no flags Details
taskout2.log (22.19 KB, text/plain)
2019-11-21 02:07 UTC, qding
no flags Details

Description qding 2019-11-01 02:06:04 UTC
Created attachment 1631299 [details]
testout for log

Description of problem:



[17144.914469] mlx5_core 0000:04:00.0: wait_func:976:(pid 55347): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 
[17144.928744] mlx5_core 0000:04:00.0: del_hw_fte:497:(pid 55347): flow steering can't delete fte in index 0 of flow group id 49 
[17204.936184] mlx5_core 0000:04:00.0: wait_func:976:(pid 55347): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 
[17204.950456] mlx5_core 0000:04:00.0: del_hw_fte:497:(pid 55347): flow steering can't delete fte in index 2 of flow group id 49 
[17204.963557] mlx5_core 0000:04:00.0: mlx5_cmd_check:745:(pid 55347): DEALLOC_PACKET_REFORMAT_CONTEXT(0x93e) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x179e84) 
[17264.979916] mlx5_core 0000:04:00.0: wait_func:976:(pid 55347): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 
[17264.994189] mlx5_core 0000:04:00.0: del_hw_fte:497:(pid 55347): flow steering can't delete fte in index 3 of flow group id 49 
[17325.001587] mlx5_core 0000:04:00.0: wait_func:976:(pid 55347): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 
[17325.015862] mlx5_core 0000:04:00.0: del_hw_fte:497:(pid 55347): flow steering can't delete fte in index 1 of flow group id 49 
[17325.028629] mlx5_core 0000:04:00.0: mlx5_cmd_check:745:(pid 55347): DESTROY_FLOW_GROUP(0x934) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x563e2f) 
[17325.045131] mlx5_core 0000:04:00.0: del_hw_flow_group:535:(pid 55347): flow steering can't destroy fg 49 of ft 3145728 
[17325.061441] mlx5_core 0000:04:00.0: mlx5_cmd_check:745:(pid 55347): DESTROY_FLOW_TABLE(0x931) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x66d6c3) 
[17325.077457] mlx5_core 0000:04:00.0: del_hw_flow_table:406:(pid 55347): flow steering can't destroy ft 
[17325.087939] mlx5_core 0000:04:00.0: mlx5_cmd_check:745:(pid 87647): DEALLOC_FLOW_COUNTER(0x93a) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x4bc68e) 


beaker jobs:
https://beaker.engineering.redhat.com/jobs/3869652
https://beaker.engineering.redhat.com/jobs/3867011

Version-Release number of selected component (if applicable):

[root@dell-per730-04 tc_offload]# uname -r
3.10.0-1099.el7.x86_64
[root@dell-per730-04 tc_offload]# rpm -q openvswitch2.11
openvswitch2.11-2.11.0-26.el7fdp.x86_64
[root@dell-per730-04 tc_offload]# 

How reproducible: two beaker jobs, but cannot reproduce it manually


Steps to Reproduce:
1.
2.
3.

Actual results:
system crashes

Expected results:


Additional info:

[root@dell-per730-04 tc_offload]# ethtool -i p6p1
driver: mlx5_core
version: 5.0-0
firmware-version: 16.26.1040 (MT_0000000009)
expansion-rom-version: 
bus-info: 0000:04:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
[root@dell-per730-04 tc_offload]# lspci -m -s 04:00.0
04:00.0 "Ethernet controller" "Mellanox Technologies" "MT28800 Family [ConnectX-5 Ex]" "Mellanox Technologies" "Device 0002"
[root@dell-per730-04 tc_offload]#

Comment 1 qding 2019-11-01 02:06:40 UTC
Created attachment 1631311 [details]
console log

Comment 2 Marcelo Ricardo Leitner 2019-11-01 02:42:56 UTC
Sounds like a FW issue.

Comment 3 Marcelo Ricardo Leitner 2019-11-14 14:15:25 UTC
Ping Alaa.

Comment 4 Alaa Hleihel (NVIDIA Mellanox) 2019-11-18 16:41:33 UTC
Hi qding

does this still reproduce ?
if so, were you able to find the steps for reproducing it ?

Thanks,
Alaa

Comment 5 qding 2019-11-21 01:27:16 UTC
(In reply to Alaa Hleihel (Mellanox) from comment #4)
> Hi qding
> 
> does this still reproduce ?

Still reproduced with the latest rhel7.8 distro RHEL-7.8-20191119.0 in which the kernel is  kernel-3.10.0-1111.el7. Please see the beaker job https://beaker.engineering.redhat.com/jobs/3906372.

> if so, were you able to find the steps for reproducing it ?
> 

Unfortunately I haven't found steps to reproduce it but beaker jobs.

Thanks
Qijun

Comment 6 qding 2019-11-21 02:06:53 UTC
Created attachment 1638270 [details]
console2.log

Comment 7 qding 2019-11-21 02:07:26 UTC
Created attachment 1638271 [details]
taskout2.log

Comment 8 Alaa Hleihel (NVIDIA Mellanox) 2019-11-25 14:22:26 UTC
(In reply to qding from comment #5)
> (In reply to Alaa Hleihel (Mellanox) from comment #4)
> > Hi qding
> > 
> > does this still reproduce ?
> 
> Still reproduced with the latest rhel7.8 distro RHEL-7.8-20191119.0 in which
> the kernel is  kernel-3.10.0-1111.el7. Please see the beaker job
> https://beaker.engineering.redhat.com/jobs/3906372.
> 
> > if so, were you able to find the steps for reproducing it ?
> > 
> 
> Unfortunately I haven't found steps to reproduce it but beaker jobs.
> 
> Thanks
> Qijun

I see, thanks for the update.

Any chance to add the name of the test to dmesg output? 
maybe this will help us find the test that is triggering the issue,
something like this at the start and end of each test:
# echo "START: $0" > /dev/kmsg
# echo "END: $0" > /dev/kmsg

Comment 9 qding 2020-02-04 01:34:43 UTC
(In reply to Alaa Hleihel (Mellanox) from comment #8)
> (In reply to qding from comment #5)
> > (In reply to Alaa Hleihel (Mellanox) from comment #4)
> > > Hi qding
> > > 
> > > does this still reproduce ?
> > 
> > Still reproduced with the latest rhel7.8 distro RHEL-7.8-20191119.0 in which
> > the kernel is  kernel-3.10.0-1111.el7. Please see the beaker job
> > https://beaker.engineering.redhat.com/jobs/3906372.
> > 
> > > if so, were you able to find the steps for reproducing it ?
> > > 
> > 
> > Unfortunately I haven't found steps to reproduce it but beaker jobs.
> > 
> > Thanks
> > Qijun
> 
> I see, thanks for the update.
> 
> Any chance to add the name of the test to dmesg output? 
> maybe this will help us find the test that is triggering the issue,
> something like this at the start and end of each test:
> # echo "START: $0" > /dev/kmsg
> # echo "END: $0" > /dev/kmsg

The log file is as https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2020/02/40487/4048746/7860482/console.log
and the corresponding beaker task is https://beaker.engineering.redhat.com/jobs/4048746,
and this time is found with rhel8 latest distro and the latest openvswitch2.12-2.12.0-21.el8fdp.

the log when the problem happens as below. the test is for mirror to IPv4 GRE.

[23001.674455] START ovs_test_tc_offload_mirror_VF1_VF2_to_gre4()... 
[23037.002941] restraintd[4906]: *** Current Time: Mon Feb 03 10:54:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23036.624970] device ens1f0_0 entered promiscuous mode 
[23036.645669] device ens1f0_1 entered promiscuous mode 
[23036.665516] device gre_sys entered promiscuous mode 
[-- MARK -- Mon Feb  3 15:55:00 2020] 
[23097.010345] restraintd[4906]: *** Current Time: Mon Feb 03 10:55:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23116.163175] device gre_sys left promiscuous mode 
[23116.205363] device gre_sys entered promiscuous mode 
[23157.031139] restraintd[4906]: *** Current Time: Mon Feb 03 10:56:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23186.368626] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47863): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 
[23186.381404] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47863): flow steering can't delete fte in index 838860 of flow group id 119 
[23217.030082] restraintd[4906]: *** Current Time: Mon Feb 03 10:57:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23247.802778] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47872): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 
[23247.815557] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47872): flow steering can't delete fte in index 1 of flow group id 118 
[23277.029276] restraintd[4906]: *** Current Time: Mon Feb 03 10:58:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23309.237334] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47872): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 
[23309.250114] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47872): flow steering can't delete fte in index 3 of flow group id 118 
[23309.261870] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47872): DEALLOC_PACKET_REFORMAT_CONTEXT(0x93e) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x179e84) 
[23337.027635] restraintd[4906]: *** Current Time: Mon Feb 03 10:59:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[-- MARK -- Mon Feb  3 16:00:00 2020] 
[23370.671726] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 
[23370.684508] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow steering can't delete fte in index 0 of flow group id 118 
[23397.062166] restraintd[4906]: *** Current Time: Mon Feb 03 11:00:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23432.106215] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 
[23432.119003] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow steering can't delete fte in index 2 of flow group id 118 
[23432.130555] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873): DESTROY_FLOW_GROUP(0x934) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x563e2f) 
[23432.145319] mlx5_core 0000:3b:00.0: del_hw_flow_group:537:(pid 47873): flow steering can't destroy fg 118 of ft 3145728 
[23457.025603] restraintd[4906]: *** Current Time: Mon Feb 03 11:01:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23493.540964] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 
[23493.553745] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow steering can't delete fte in index 838861 of flow group id 119 
[23493.565669] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873): DESTROY_FLOW_GROUP(0x934) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x563e2f) 
[23493.580432] mlx5_core 0000:3b:00.0: del_hw_flow_group:537:(pid 47873): flow steering can't destroy fg 119 of ft 3145728 
[23493.596152] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873): DESTROY_FLOW_TABLE(0x931) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x66d6c3) 
[23493.610497] mlx5_core 0000:3b:00.0: del_hw_flow_table:408:(pid 47873): flow steering can't destroy ft 
[23493.619945] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 23948): DEALLOC_FLOW_COUNTER(0x93a) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x4bc68e) 
[23517.003766] restraintd[4906]: *** Current Time: Mon Feb 03 11:02:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23577.014488] restraintd[4906]: *** Current Time: Mon Feb 03 11:03:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23597.785217] device gre_sys left promiscuous mode 
[23597.820456] device gre_sys entered promiscuous mode 
[23637.009296] restraintd[4906]: *** Current Time: Mon Feb 03 11:04:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[-- MARK -- Mon Feb  3 16:05:00 2020] 
[23697.028572] restraintd[4906]: *** Current Time: Mon Feb 03 11:05:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23757.036548] restraintd[4906]: *** Current Time: Mon Feb 03 11:06:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23817.027052] restraintd[4906]: *** Current Time: Mon Feb 03 11:07:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23877.043372] restraintd[4906]: *** Current Time: Mon Feb 03 11:08:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[23937.038729] restraintd[4906]: *** Current Time: Mon Feb 03 11:09:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[-- MARK -- Mon Feb  3 16:10:00 2020] 
[23997.040096] restraintd[4906]: *** Current Time: Mon Feb 03 11:10:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[24057.038179] restraintd[4906]: *** Current Time: Mon Feb 03 11:11:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[24117.037448] restraintd[4906]: *** Current Time: Mon Feb 03 11:12:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[24177.035836] restraintd[4906]: *** Current Time: Mon Feb 03 11:13:49 2020  Localwatchdog at: Tue Feb 04 04:43:49 2020 
[24208.253102] device ens1f0_0 left promiscuous mode 
[24208.271561] device ens1f0_1 left promiscuous mode 
[24208.285492] device gre_sys left promiscuous mode 
[24208.319878] END ovs_test_tc_offload_mirror_VF1_VF2_to_gre4() 
[24214.096662] START ovs_test_tc_offload_mirror_VF1_VF2_to_vxlan4()...

Comment 10 Alaa Hleihel (NVIDIA Mellanox) 2020-02-04 12:37:31 UTC
(In reply to qding from comment #9)
> (In reply to Alaa Hleihel (Mellanox) from comment #8)
> > (In reply to qding from comment #5)
> > > (In reply to Alaa Hleihel (Mellanox) from comment #4)
> > > > Hi qding
> > > > 
> > > > does this still reproduce ?
> > > 
> > > Still reproduced with the latest rhel7.8 distro RHEL-7.8-20191119.0 in which
> > > the kernel is  kernel-3.10.0-1111.el7. Please see the beaker job
> > > https://beaker.engineering.redhat.com/jobs/3906372.
> > > 
> > > > if so, were you able to find the steps for reproducing it ?
> > > > 
> > > 
> > > Unfortunately I haven't found steps to reproduce it but beaker jobs.
> > > 
> > > Thanks
> > > Qijun
> > 
> > I see, thanks for the update.
> > 
> > Any chance to add the name of the test to dmesg output? 
> > maybe this will help us find the test that is triggering the issue,
> > something like this at the start and end of each test:
> > # echo "START: $0" > /dev/kmsg
> > # echo "END: $0" > /dev/kmsg
> 
> The log file is as
> https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2020/02/
> 40487/4048746/7860482/console.log
> and the corresponding beaker task is
> https://beaker.engineering.redhat.com/jobs/4048746,
> and this time is found with rhel8 latest distro and the latest
> openvswitch2.12-2.12.0-21.el8fdp.
> 
> the log when the problem happens as below. the test is for mirror to IPv4
> GRE.
> 
> [23001.674455] START ovs_test_tc_offload_mirror_VF1_VF2_to_gre4()... 
> [23037.002941] restraintd[4906]: *** Current Time: Mon Feb 03 10:54:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23036.624970] device ens1f0_0 entered promiscuous mode 
> [23036.645669] device ens1f0_1 entered promiscuous mode 
> [23036.665516] device gre_sys entered promiscuous mode 
> [-- MARK -- Mon Feb  3 15:55:00 2020] 
> [23097.010345] restraintd[4906]: *** Current Time: Mon Feb 03 10:55:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23116.163175] device gre_sys left promiscuous mode 
> [23116.205363] device gre_sys entered promiscuous mode 
> [23157.031139] restraintd[4906]: *** Current Time: Mon Feb 03 10:56:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23186.368626] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47863):
> DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command
> resource 
> [23186.381404] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47863): flow
> steering can't delete fte in index 838860 of flow group id 119 
> [23217.030082] restraintd[4906]: *** Current Time: Mon Feb 03 10:57:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23247.802778] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47872):
> DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command
> resource 
> [23247.815557] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47872): flow
> steering can't delete fte in index 1 of flow group id 118 
> [23277.029276] restraintd[4906]: *** Current Time: Mon Feb 03 10:58:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23309.237334] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47872):
> DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command
> resource 
> [23309.250114] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47872): flow
> steering can't delete fte in index 3 of flow group id 118 
> [23309.261870] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47872):
> DEALLOC_PACKET_REFORMAT_CONTEXT(0x93e) op_mod(0x0) failed, status bad
> resource state(0x9), syndrome (0x179e84) 
> [23337.027635] restraintd[4906]: *** Current Time: Mon Feb 03 10:59:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [-- MARK -- Mon Feb  3 16:00:00 2020] 
> [23370.671726] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873):
> DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command
> resource 
> [23370.684508] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow
> steering can't delete fte in index 0 of flow group id 118 
> [23397.062166] restraintd[4906]: *** Current Time: Mon Feb 03 11:00:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23432.106215] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873):
> DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command
> resource 
> [23432.119003] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow
> steering can't delete fte in index 2 of flow group id 118 
> [23432.130555] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873):
> DESTROY_FLOW_GROUP(0x934) op_mod(0x0) failed, status bad resource
> state(0x9), syndrome (0x563e2f) 
> [23432.145319] mlx5_core 0000:3b:00.0: del_hw_flow_group:537:(pid 47873):
> flow steering can't destroy fg 118 of ft 3145728 
> [23457.025603] restraintd[4906]: *** Current Time: Mon Feb 03 11:01:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23493.540964] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873):
> DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command
> resource 
> [23493.553745] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow
> steering can't delete fte in index 838861 of flow group id 119 
> [23493.565669] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873):
> DESTROY_FLOW_GROUP(0x934) op_mod(0x0) failed, status bad resource
> state(0x9), syndrome (0x563e2f) 
> [23493.580432] mlx5_core 0000:3b:00.0: del_hw_flow_group:537:(pid 47873):
> flow steering can't destroy fg 119 of ft 3145728 
> [23493.596152] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873):
> DESTROY_FLOW_TABLE(0x931) op_mod(0x0) failed, status bad parameter(0x3),
> syndrome (0x66d6c3) 
> [23493.610497] mlx5_core 0000:3b:00.0: del_hw_flow_table:408:(pid 47873):
> flow steering can't destroy ft 
> [23493.619945] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 23948):
> DEALLOC_FLOW_COUNTER(0x93a) op_mod(0x0) failed, status bad resource
> state(0x9), syndrome (0x4bc68e) 
> [23517.003766] restraintd[4906]: *** Current Time: Mon Feb 03 11:02:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23577.014488] restraintd[4906]: *** Current Time: Mon Feb 03 11:03:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23597.785217] device gre_sys left promiscuous mode 
> [23597.820456] device gre_sys entered promiscuous mode 
> [23637.009296] restraintd[4906]: *** Current Time: Mon Feb 03 11:04:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [-- MARK -- Mon Feb  3 16:05:00 2020] 
> [23697.028572] restraintd[4906]: *** Current Time: Mon Feb 03 11:05:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23757.036548] restraintd[4906]: *** Current Time: Mon Feb 03 11:06:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23817.027052] restraintd[4906]: *** Current Time: Mon Feb 03 11:07:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23877.043372] restraintd[4906]: *** Current Time: Mon Feb 03 11:08:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [23937.038729] restraintd[4906]: *** Current Time: Mon Feb 03 11:09:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [-- MARK -- Mon Feb  3 16:10:00 2020] 
> [23997.040096] restraintd[4906]: *** Current Time: Mon Feb 03 11:10:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [24057.038179] restraintd[4906]: *** Current Time: Mon Feb 03 11:11:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [24117.037448] restraintd[4906]: *** Current Time: Mon Feb 03 11:12:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [24177.035836] restraintd[4906]: *** Current Time: Mon Feb 03 11:13:49 2020 
> Localwatchdog at: Tue Feb 04 04:43:49 2020 
> [24208.253102] device ens1f0_0 left promiscuous mode 
> [24208.271561] device ens1f0_1 left promiscuous mode 
> [24208.285492] device gre_sys left promiscuous mode 
> [24208.319878] END ovs_test_tc_offload_mirror_VF1_VF2_to_gre4() 
> [24214.096662] START ovs_test_tc_offload_mirror_VF1_VF2_to_vxlan4()...


Great, thanks for the update!

Comment 12 Alaa Hleihel (NVIDIA Mellanox) 2021-02-18 09:17:00 UTC
Hi,

> [17144.914469] mlx5_core 0000:04:00.0: wait_func:976:(pid 55347): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource 

Firmware commands are hitting timeout, so likely it's a FW bug.
We fixed a similar issue just recently in BZ #1916563.

Can you retest using the latest FW version (xx.29.2002)?

Thanks,
Alaa

Comment 13 qding 2021-02-20 10:05:29 UTC
No such issue with the lastest firmware-version: 16.29.2002 (MT_0000000012). So to close the bug.
Beaker job:
CX-5: https://beaker.engineering.redhat.com/jobs/5111774
CX-6 Dx: https://beaker.engineering.redhat.com/jobs/5109347

Comment 14 Alaa Hleihel (NVIDIA Mellanox) 2021-02-22 10:05:23 UTC
(In reply to qding from comment #13)
> No such issue with the lastest firmware-version: 16.29.2002 (MT_0000000012).
> So to close the bug.
> Beaker job:
> CX-5: https://beaker.engineering.redhat.com/jobs/5111774
> CX-6 Dx: https://beaker.engineering.redhat.com/jobs/5109347

Great, thanks a lot !


Note You need to log in before you can comment on or make changes to this bug.