Bug 2018360 - [OVS offload] task hung with RHEL9
Summary: [OVS offload] task hung with RHEL9
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch2.15
Version: FDP 21.I
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Amir Tzin (Mellanox)
QA Contact: qding
URL:
Whiteboard:
Depends On:
Blocks: 1896414
TreeView+ depends on / blocked
 
Reported: 2021-10-29 02:25 UTC by qding
Modified: 2023-07-18 02:08 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-18 02:08:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
console.log (641.60 KB, text/plain)
2021-10-29 02:33 UTC, qding
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-1631 0 None None None 2021-10-29 02:27:38 UTC

Description qding 2021-10-29 02:25:03 UTC
Description of problem:

[ 3618.220187] mlx5_core 0000:3b:00.2: enabling device (0000 -> 0002) 
[ 3618.226534] mlx5_core 0000:3b:00.2: firmware version: 16.31.1014 
[ 3618.419782] mlx5_core 0000:3b:00.2: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps 
[ 3618.439904] mlx5_core 0000:3b:00.2: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3618.560072] mlx5_core 0000:3b:00.2: Supported tc offload range - chains: 1, prios: 16 
[ 3618.567921] mlx5_core 0000:3b:00.2: mlx5_tc_ct_init:2146:(pid 78892): tc ct offload not supported, firmware level support is missing 
[ 3618.588194] mlx5_core 0000:3b:00.2 enp59s0f0v0: renamed from eth2 
[ 3618.712584] mlx5_core 0000:3b:00.2 enp59s0f0v0: Link up 
[ 3618.832076] device eth0 left promiscuous mode 
[ 3618.836677] device enp59s0f0np0 left promiscuous mode 
[ 3618.841801] device ovsbr0 left promiscuous mode 
[ 3618.848475] IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0v0: link becomes ready 
[ 3618.877200] device ovs-system left promiscuous mode 
[ 3620.491395] pci 0000:3b:00.2: Removing from iommu group 150 
[ 3620.497130] pci 0000:3b:00.3: Removing from iommu group 151 
[ 3621.545238] mlx5_core 0000:3b:00.0: E-Switch: Disable: mode(OFFLOADS), nvfs(2), active vports(3) 
[ 3622.165863] mlx5_core 0000:3b:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3622.370685] mlx5_core 0000:3b:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295 
[ 3622.837668] mlx5_core 0000:3b:00.0 enp59s0f0np0: Link up 
[ 3638.612107] mlx5_core 0000:3b:00.0: E-Switch: Enable: mode(LEGACY), nvfs(2), active vports(3) 
[ 3638.727549] pci 0000:3b:00.2: [15b3:1018] type 00 class 0x020000 
[ 3638.733641] pci 0000:3b:00.2: enabling Extended Tags 
[ 3638.739776] pci 0000:3b:00.2: Adding to iommu group 150 
[ 3638.746033] mlx5_core 0000:3b:00.2: enabling device (0000 -> 0002) 
[ 3638.752362] mlx5_core 0000:3b:00.2: firmware version: 16.31.1014 
[ 3638.946724] mlx5_core 0000:3b:00.2: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps 
[ 3638.966916] mlx5_core 0000:3b:00.2: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3639.129545] mlx5_core 0000:3b:00.2: Supported tc offload range - chains: 1, prios: 16 
[ 3639.137389] mlx5_core 0000:3b:00.2: mlx5_tc_ct_init:2146:(pid 78892): tc ct offload not supported, firmware level support is missing 
[ 3639.157274] mlx5_core 0000:3b:00.2 enp59s0f0v0: renamed from eth0 
[ 3639.190393] pci 0000:3b:00.3: [15b3:1018] type 00 class 0x020000 
[ 3639.196500] pci 0000:3b:00.3: enabling Extended Tags 
[ 3639.202654] pci 0000:3b:00.3: Adding to iommu group 151 
[ 3639.208455] mlx5_core 0000:3b:00.3: enabling device (0000 -> 0002) 
[ 3639.214787] mlx5_core 0000:3b:00.3: firmware version: 16.31.1014 
[ 3639.293321] mlx5_core 0000:3b:00.2 enp59s0f0v0: Link up 
[ 3639.300644] IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0v0: link becomes ready 
[ 3639.420981] mlx5_core 0000:3b:00.3: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps 
[ 3639.441578] mlx5_core 0000:3b:00.3: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3639.605954] mlx5_core 0000:3b:00.3: Supported tc offload range - chains: 1, prios: 16 
[ 3639.613799] mlx5_core 0000:3b:00.3: mlx5_tc_ct_init:2146:(pid 78892): tc ct offload not supported, firmware level support is missing 
[ 3639.631930] mlx5_core 0000:3b:00.3 enp59s0f0v1: renamed from eth0 
[ 3639.764667] mlx5_core 0000:3b:00.3 enp59s0f0v1: Link up 
[ 3640.359259] IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0v1: link becomes ready 
[ 3641.786646] mlx5_core 0000:3b:00.0: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3) 
[ 3643.328980] mlx5_core 0000:3b:00.0: E-Switch: Supported tc chains and prios offload 
[ 3643.336660] mlx5_core 0000:3b:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295 
[ 3643.751797] mlx5_core 0000:3b:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3652.014707] restraintd[3965]: *** Current Time: Thu Oct 28 10:09:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[-- MARK -- Thu Oct 28 14:10:00 2021] 
[-- MARK -- Thu Oct 28 14:10:01 2021] 
[ 3712.014248] restraintd[3965]: *** Current Time: Thu Oct 28 10:10:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[ 3772.014427] restraintd[3965]: *** Current Time: Thu Oct 28 10:11:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[ 3811.928855] INFO: task kworker/u96:4:38872 blocked for more than 122 seconds. 
[ 3811.936000]       Tainted: G          I      --------- ---  5.14.0-1.6.1.el9.x86_64 #1 
[ 3811.943918] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[ 3811.951746] task:kworker/u96:4   state:D stack:    0 pid:38872 ppid:     2 flags:0x00004000 
[ 3811.960098] Workqueue: netns cleanup_net 
[ 3811.964030] Call Trace: 
[ 3811.966485]  __schedule+0x206/0x550 
[ 3811.969986]  schedule+0x3c/0xa0 
[ 3811.973139]  schedule_preempt_disabled+0xa/0x10 
[ 3811.977681]  __mutex_lock.constprop.0+0x295/0x450 
[ 3811.982394]  ? idr_for_each+0x95/0xd0 
[ 3811.986069]  devlink_pernet_pre_exit+0x2a/0xc0 
[ 3811.990525]  cleanup_net+0x1d2/0x370 
[ 3811.994111]  process_one_work+0x1e3/0x380 
[ 3811.998131]  worker_thread+0x53/0x3d0 
[ 3812.001796]  ? process_one_work+0x380/0x380 
[ 3812.005999]  kthread+0x10c/0x130 
[ 3812.009233]  ? set_kthread_struct+0x40/0x40 
[ 3812.013417]  ret_from_fork+0x1f/0x30 
[ 3812.017014] INFO: task devlink:90062 blocked for more than 122 seconds. 
[ 3812.023626]       Tainted: G          I      --------- ---  5.14.0-1.6.1.el9.x86_64 #1 
[ 3812.031536] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[ 3812.039363] task:devlink         state:D stack:    0 pid:90062 ppid: 16460 flags:0x00004000 
[ 3812.047707] Call Trace: 
[ 3812.050160]  __schedule+0x206/0x550 
[ 3812.053654]  schedule+0x3c/0xa0 
[ 3812.056809]  rwsem_down_write_slowpath+0x224/0x470 
[ 3812.061610]  register_netdevice_notifier+0x1c/0x110 
[ 3812.066505]  mlx5e_rep_bridge_init+0x111/0x130 [mlx5_core] 
[ 3812.072052]  mlx5e_uplink_rep_enable+0xd4/0x140 [mlx5_core] 
[ 3812.077668]  mlx5e_attach_netdev+0x9e/0x140 [mlx5_core] 
[ 3812.082927]  ? mlx5e_init_ul_rep+0x3e/0x50 [mlx5_core] 
[ 3812.088100]  mlx5e_netdev_attach_profile+0x93/0xb0 [mlx5_core] 
[ 3812.093967]  mlx5e_netdev_change_profile+0xa0/0xc0 [mlx5_core] 
[ 3812.099835]  mlx5e_vport_rep_load+0xa0/0xf0 [mlx5_core] 
[ 3812.105095]  mlx5_esw_offloads_rep_load+0x86/0xe0 [mlx5_core] 
[ 3812.110884]  esw_offloads_enable+0x266/0x370 [mlx5_core] 
[ 3812.116229]  mlx5_eswitch_enable_locked.part.0+0x100/0x310 [mlx5_core] 
[ 3812.122792]  esw_offloads_start+0x44/0x1f0 [mlx5_core] 
[ 3812.127972]  ? __nla_validate_parse+0x136/0x180 
[ 3812.132504]  mlx5_devlink_eswitch_mode_set+0x102/0x180 [mlx5_core] 
[ 3812.138718]  devlink_nl_cmd_eswitch_set_doit+0xc1/0x150 
[ 3812.143952]  genl_family_rcv_msg_doit+0xe7/0x150 
[ 3812.148574]  genl_rcv_msg+0xdc/0x1e0 
[ 3812.152160]  ? __devlink_port_phys_port_name_get+0x1e0/0x1e0 
[ 3812.157817]  ? genl_get_cmd+0xd0/0xd0 
[ 3812.161483]  netlink_rcv_skb+0x4e/0xf0 
[ 3812.165236]  genl_rcv+0x24/0x40 
[ 3812.168381]  netlink_unicast+0x1f6/0x2c0 
[ 3812.172307]  netlink_sendmsg+0x23b/0x480 
[ 3812.176231]  sock_sendmsg+0x5b/0x60 
[ 3812.179726]  __sys_sendto+0xf0/0x160 
[ 3812.183305]  ? handle_mm_fault+0xba/0x280 
[ 3812.187324]  ? do_user_addr_fault+0x1c7/0x660 
[ 3812.191683]  __x64_sys_sendto+0x20/0x30 
[ 3812.195524]  do_syscall_64+0x38/0x90 
[ 3812.199101]  entry_SYSCALL_64_after_hwframe+0x44/0xae 
[ 3812.204153] RIP: 0033:0x7f718733059a 
[ 3812.207734] RSP: 002b:00007ffdef8570b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c 
[ 3812.215297] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f718733059a 
[ 3812.222431] RDX: 0000000000000038 RSI: 000055eedd7ff440 RDI: 0000000000000003 
[ 3812.229563] RBP: 0000000000000000 R08: 00007f7187435200 R09: 000000000000000c 
[ 3812.236694] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 
[ 3812.243828] R13: 000055eedd7ff2a0 R14: 000055eedc986d5c R15: 000055eedd7ff440 
[ 3832.014278] restraintd[3965]: *** Current Time: Thu Oct 28 10:12:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[ 3892.014629] restraintd[3965]: *** Current Time: Thu Oct 28 10:13:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 


beaker job: https://beaker.engineering.redhat.com/jobs/5950116

distro: RHEL-9.0.0-20211020.4
kernel-5.14.0-1.6.1.el9.x86_64
openvswitch2.15-2.15.0-20.el9fdp.x86_64


Additional info:

Comment 1 qding 2021-10-29 02:33:31 UTC
Created attachment 1838183 [details]
console.log

Comment 4 Mohammad Kabat 2023-05-30 08:30:40 UTC
should be fixed in RHEL9.2 GA kernel,
please test it with the new kernel 5.14.0.284.11.1.el9


Note You need to log in before you can comment on or make changes to this bug.