Bug 1767678
| Summary: | [mlx5_core] OVS offload: mlx5_core 0000:04:00.0: cmd_work_handler:855:(pid 87647): failed to allocate command entry | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | qding | ||||||||||
| Component: | openvswitch | Assignee: | Alaa Hleihel (NVIDIA Mellanox) <ahleihel> | ||||||||||
| openvswitch sub component: | ovs-hw-offload | QA Contact: | Amit Supugade <asupugad> | ||||||||||
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||||||
| Severity: | high | ||||||||||||
| Priority: | high | CC: | ahleihel, atragler, ctrautma, jfargen, knweiss, mleitner, qding | ||||||||||
| Version: | FDP 19.G | ||||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2021-02-20 10:05:29 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
qding
2019-11-01 02:06:04 UTC
Created attachment 1631311 [details]
console log
Sounds like a FW issue. Ping Alaa. Hi qding does this still reproduce ? if so, were you able to find the steps for reproducing it ? Thanks, Alaa (In reply to Alaa Hleihel (Mellanox) from comment #4) > Hi qding > > does this still reproduce ? Still reproduced with the latest rhel7.8 distro RHEL-7.8-20191119.0 in which the kernel is kernel-3.10.0-1111.el7. Please see the beaker job https://beaker.engineering.redhat.com/jobs/3906372. > if so, were you able to find the steps for reproducing it ? > Unfortunately I haven't found steps to reproduce it but beaker jobs. Thanks Qijun Created attachment 1638270 [details]
console2.log
Created attachment 1638271 [details]
taskout2.log
(In reply to qding from comment #5) > (In reply to Alaa Hleihel (Mellanox) from comment #4) > > Hi qding > > > > does this still reproduce ? > > Still reproduced with the latest rhel7.8 distro RHEL-7.8-20191119.0 in which > the kernel is kernel-3.10.0-1111.el7. Please see the beaker job > https://beaker.engineering.redhat.com/jobs/3906372. > > > if so, were you able to find the steps for reproducing it ? > > > > Unfortunately I haven't found steps to reproduce it but beaker jobs. > > Thanks > Qijun I see, thanks for the update. Any chance to add the name of the test to dmesg output? maybe this will help us find the test that is triggering the issue, something like this at the start and end of each test: # echo "START: $0" > /dev/kmsg # echo "END: $0" > /dev/kmsg (In reply to Alaa Hleihel (Mellanox) from comment #8) > (In reply to qding from comment #5) > > (In reply to Alaa Hleihel (Mellanox) from comment #4) > > > Hi qding > > > > > > does this still reproduce ? > > > > Still reproduced with the latest rhel7.8 distro RHEL-7.8-20191119.0 in which > > the kernel is kernel-3.10.0-1111.el7. Please see the beaker job > > https://beaker.engineering.redhat.com/jobs/3906372. > > > > > if so, were you able to find the steps for reproducing it ? > > > > > > > Unfortunately I haven't found steps to reproduce it but beaker jobs. > > > > Thanks > > Qijun > > I see, thanks for the update. > > Any chance to add the name of the test to dmesg output? > maybe this will help us find the test that is triggering the issue, > something like this at the start and end of each test: > # echo "START: $0" > /dev/kmsg > # echo "END: $0" > /dev/kmsg The log file is as https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2020/02/40487/4048746/7860482/console.log and the corresponding beaker task is https://beaker.engineering.redhat.com/jobs/4048746, and this time is found with rhel8 latest distro and the latest openvswitch2.12-2.12.0-21.el8fdp. the log when the problem happens as below. the test is for mirror to IPv4 GRE. [23001.674455] START ovs_test_tc_offload_mirror_VF1_VF2_to_gre4()... [23037.002941] restraintd[4906]: *** Current Time: Mon Feb 03 10:54:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23036.624970] device ens1f0_0 entered promiscuous mode [23036.645669] device ens1f0_1 entered promiscuous mode [23036.665516] device gre_sys entered promiscuous mode [-- MARK -- Mon Feb 3 15:55:00 2020] [23097.010345] restraintd[4906]: *** Current Time: Mon Feb 03 10:55:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23116.163175] device gre_sys left promiscuous mode [23116.205363] device gre_sys entered promiscuous mode [23157.031139] restraintd[4906]: *** Current Time: Mon Feb 03 10:56:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23186.368626] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47863): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource [23186.381404] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47863): flow steering can't delete fte in index 838860 of flow group id 119 [23217.030082] restraintd[4906]: *** Current Time: Mon Feb 03 10:57:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23247.802778] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47872): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource [23247.815557] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47872): flow steering can't delete fte in index 1 of flow group id 118 [23277.029276] restraintd[4906]: *** Current Time: Mon Feb 03 10:58:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23309.237334] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47872): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource [23309.250114] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47872): flow steering can't delete fte in index 3 of flow group id 118 [23309.261870] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47872): DEALLOC_PACKET_REFORMAT_CONTEXT(0x93e) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x179e84) [23337.027635] restraintd[4906]: *** Current Time: Mon Feb 03 10:59:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [-- MARK -- Mon Feb 3 16:00:00 2020] [23370.671726] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource [23370.684508] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow steering can't delete fte in index 0 of flow group id 118 [23397.062166] restraintd[4906]: *** Current Time: Mon Feb 03 11:00:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23432.106215] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource [23432.119003] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow steering can't delete fte in index 2 of flow group id 118 [23432.130555] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873): DESTROY_FLOW_GROUP(0x934) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x563e2f) [23432.145319] mlx5_core 0000:3b:00.0: del_hw_flow_group:537:(pid 47873): flow steering can't destroy fg 118 of ft 3145728 [23457.025603] restraintd[4906]: *** Current Time: Mon Feb 03 11:01:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23493.540964] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource [23493.553745] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow steering can't delete fte in index 838861 of flow group id 119 [23493.565669] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873): DESTROY_FLOW_GROUP(0x934) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x563e2f) [23493.580432] mlx5_core 0000:3b:00.0: del_hw_flow_group:537:(pid 47873): flow steering can't destroy fg 119 of ft 3145728 [23493.596152] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873): DESTROY_FLOW_TABLE(0x931) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x66d6c3) [23493.610497] mlx5_core 0000:3b:00.0: del_hw_flow_table:408:(pid 47873): flow steering can't destroy ft [23493.619945] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 23948): DEALLOC_FLOW_COUNTER(0x93a) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x4bc68e) [23517.003766] restraintd[4906]: *** Current Time: Mon Feb 03 11:02:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23577.014488] restraintd[4906]: *** Current Time: Mon Feb 03 11:03:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23597.785217] device gre_sys left promiscuous mode [23597.820456] device gre_sys entered promiscuous mode [23637.009296] restraintd[4906]: *** Current Time: Mon Feb 03 11:04:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [-- MARK -- Mon Feb 3 16:05:00 2020] [23697.028572] restraintd[4906]: *** Current Time: Mon Feb 03 11:05:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23757.036548] restraintd[4906]: *** Current Time: Mon Feb 03 11:06:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23817.027052] restraintd[4906]: *** Current Time: Mon Feb 03 11:07:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23877.043372] restraintd[4906]: *** Current Time: Mon Feb 03 11:08:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [23937.038729] restraintd[4906]: *** Current Time: Mon Feb 03 11:09:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [-- MARK -- Mon Feb 3 16:10:00 2020] [23997.040096] restraintd[4906]: *** Current Time: Mon Feb 03 11:10:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [24057.038179] restraintd[4906]: *** Current Time: Mon Feb 03 11:11:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [24117.037448] restraintd[4906]: *** Current Time: Mon Feb 03 11:12:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [24177.035836] restraintd[4906]: *** Current Time: Mon Feb 03 11:13:49 2020 Localwatchdog at: Tue Feb 04 04:43:49 2020 [24208.253102] device ens1f0_0 left promiscuous mode [24208.271561] device ens1f0_1 left promiscuous mode [24208.285492] device gre_sys left promiscuous mode [24208.319878] END ovs_test_tc_offload_mirror_VF1_VF2_to_gre4() [24214.096662] START ovs_test_tc_offload_mirror_VF1_VF2_to_vxlan4()... (In reply to qding from comment #9) > (In reply to Alaa Hleihel (Mellanox) from comment #8) > > (In reply to qding from comment #5) > > > (In reply to Alaa Hleihel (Mellanox) from comment #4) > > > > Hi qding > > > > > > > > does this still reproduce ? > > > > > > Still reproduced with the latest rhel7.8 distro RHEL-7.8-20191119.0 in which > > > the kernel is kernel-3.10.0-1111.el7. Please see the beaker job > > > https://beaker.engineering.redhat.com/jobs/3906372. > > > > > > > if so, were you able to find the steps for reproducing it ? > > > > > > > > > > Unfortunately I haven't found steps to reproduce it but beaker jobs. > > > > > > Thanks > > > Qijun > > > > I see, thanks for the update. > > > > Any chance to add the name of the test to dmesg output? > > maybe this will help us find the test that is triggering the issue, > > something like this at the start and end of each test: > > # echo "START: $0" > /dev/kmsg > > # echo "END: $0" > /dev/kmsg > > The log file is as > https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2020/02/ > 40487/4048746/7860482/console.log > and the corresponding beaker task is > https://beaker.engineering.redhat.com/jobs/4048746, > and this time is found with rhel8 latest distro and the latest > openvswitch2.12-2.12.0-21.el8fdp. > > the log when the problem happens as below. the test is for mirror to IPv4 > GRE. > > [23001.674455] START ovs_test_tc_offload_mirror_VF1_VF2_to_gre4()... > [23037.002941] restraintd[4906]: *** Current Time: Mon Feb 03 10:54:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23036.624970] device ens1f0_0 entered promiscuous mode > [23036.645669] device ens1f0_1 entered promiscuous mode > [23036.665516] device gre_sys entered promiscuous mode > [-- MARK -- Mon Feb 3 15:55:00 2020] > [23097.010345] restraintd[4906]: *** Current Time: Mon Feb 03 10:55:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23116.163175] device gre_sys left promiscuous mode > [23116.205363] device gre_sys entered promiscuous mode > [23157.031139] restraintd[4906]: *** Current Time: Mon Feb 03 10:56:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23186.368626] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47863): > DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command > resource > [23186.381404] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47863): flow > steering can't delete fte in index 838860 of flow group id 119 > [23217.030082] restraintd[4906]: *** Current Time: Mon Feb 03 10:57:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23247.802778] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47872): > DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command > resource > [23247.815557] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47872): flow > steering can't delete fte in index 1 of flow group id 118 > [23277.029276] restraintd[4906]: *** Current Time: Mon Feb 03 10:58:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23309.237334] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47872): > DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command > resource > [23309.250114] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47872): flow > steering can't delete fte in index 3 of flow group id 118 > [23309.261870] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47872): > DEALLOC_PACKET_REFORMAT_CONTEXT(0x93e) op_mod(0x0) failed, status bad > resource state(0x9), syndrome (0x179e84) > [23337.027635] restraintd[4906]: *** Current Time: Mon Feb 03 10:59:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [-- MARK -- Mon Feb 3 16:00:00 2020] > [23370.671726] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873): > DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command > resource > [23370.684508] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow > steering can't delete fte in index 0 of flow group id 118 > [23397.062166] restraintd[4906]: *** Current Time: Mon Feb 03 11:00:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23432.106215] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873): > DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command > resource > [23432.119003] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow > steering can't delete fte in index 2 of flow group id 118 > [23432.130555] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873): > DESTROY_FLOW_GROUP(0x934) op_mod(0x0) failed, status bad resource > state(0x9), syndrome (0x563e2f) > [23432.145319] mlx5_core 0000:3b:00.0: del_hw_flow_group:537:(pid 47873): > flow steering can't destroy fg 118 of ft 3145728 > [23457.025603] restraintd[4906]: *** Current Time: Mon Feb 03 11:01:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23493.540964] mlx5_core 0000:3b:00.0: wait_func:985:(pid 47873): > DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command > resource > [23493.553745] mlx5_core 0000:3b:00.0: del_hw_fte:499:(pid 47873): flow > steering can't delete fte in index 838861 of flow group id 119 > [23493.565669] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873): > DESTROY_FLOW_GROUP(0x934) op_mod(0x0) failed, status bad resource > state(0x9), syndrome (0x563e2f) > [23493.580432] mlx5_core 0000:3b:00.0: del_hw_flow_group:537:(pid 47873): > flow steering can't destroy fg 119 of ft 3145728 > [23493.596152] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 47873): > DESTROY_FLOW_TABLE(0x931) op_mod(0x0) failed, status bad parameter(0x3), > syndrome (0x66d6c3) > [23493.610497] mlx5_core 0000:3b:00.0: del_hw_flow_table:408:(pid 47873): > flow steering can't destroy ft > [23493.619945] mlx5_core 0000:3b:00.0: mlx5_cmd_check:752:(pid 23948): > DEALLOC_FLOW_COUNTER(0x93a) op_mod(0x0) failed, status bad resource > state(0x9), syndrome (0x4bc68e) > [23517.003766] restraintd[4906]: *** Current Time: Mon Feb 03 11:02:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23577.014488] restraintd[4906]: *** Current Time: Mon Feb 03 11:03:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23597.785217] device gre_sys left promiscuous mode > [23597.820456] device gre_sys entered promiscuous mode > [23637.009296] restraintd[4906]: *** Current Time: Mon Feb 03 11:04:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [-- MARK -- Mon Feb 3 16:05:00 2020] > [23697.028572] restraintd[4906]: *** Current Time: Mon Feb 03 11:05:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23757.036548] restraintd[4906]: *** Current Time: Mon Feb 03 11:06:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23817.027052] restraintd[4906]: *** Current Time: Mon Feb 03 11:07:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23877.043372] restraintd[4906]: *** Current Time: Mon Feb 03 11:08:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [23937.038729] restraintd[4906]: *** Current Time: Mon Feb 03 11:09:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [-- MARK -- Mon Feb 3 16:10:00 2020] > [23997.040096] restraintd[4906]: *** Current Time: Mon Feb 03 11:10:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [24057.038179] restraintd[4906]: *** Current Time: Mon Feb 03 11:11:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [24117.037448] restraintd[4906]: *** Current Time: Mon Feb 03 11:12:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [24177.035836] restraintd[4906]: *** Current Time: Mon Feb 03 11:13:49 2020 > Localwatchdog at: Tue Feb 04 04:43:49 2020 > [24208.253102] device ens1f0_0 left promiscuous mode > [24208.271561] device ens1f0_1 left promiscuous mode > [24208.285492] device gre_sys left promiscuous mode > [24208.319878] END ovs_test_tc_offload_mirror_VF1_VF2_to_gre4() > [24214.096662] START ovs_test_tc_offload_mirror_VF1_VF2_to_vxlan4()... Great, thanks for the update! Hi, > [17144.914469] mlx5_core 0000:04:00.0: wait_func:976:(pid 55347): DELETE_FLOW_TABLE_ENTRY(0x938) timeout. Will cause a leak of a command resource Firmware commands are hitting timeout, so likely it's a FW bug. We fixed a similar issue just recently in BZ #1916563. Can you retest using the latest FW version (xx.29.2002)? Thanks, Alaa No such issue with the lastest firmware-version: 16.29.2002 (MT_0000000012). So to close the bug. Beaker job: CX-5: https://beaker.engineering.redhat.com/jobs/5111774 CX-6 Dx: https://beaker.engineering.redhat.com/jobs/5109347 (In reply to qding from comment #13) > No such issue with the lastest firmware-version: 16.29.2002 (MT_0000000012). > So to close the bug. > Beaker job: > CX-5: https://beaker.engineering.redhat.com/jobs/5111774 > CX-6 Dx: https://beaker.engineering.redhat.com/jobs/5109347 Great, thanks a lot ! |