Bug 2018360

Summary: [OVS offload] task hung with RHEL9
Product: Red Hat Enterprise Linux Fast Datapath Reporter: qding
Component: openvswitch2.15Assignee: Amir Tzin (Mellanox) <atzin>
Status: CLOSED CURRENTRELEASE QA Contact: qding
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: FDP 21.ICC: ctrautma, jhsiao, lariel, mhou, mkabat, mleitner, ralongi, trinh.dao
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-18 02:08:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1896414    
Attachments:
Description Flags
console.log none

Description qding 2021-10-29 02:25:03 UTC
Description of problem:

[ 3618.220187] mlx5_core 0000:3b:00.2: enabling device (0000 -> 0002) 
[ 3618.226534] mlx5_core 0000:3b:00.2: firmware version: 16.31.1014 
[ 3618.419782] mlx5_core 0000:3b:00.2: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps 
[ 3618.439904] mlx5_core 0000:3b:00.2: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3618.560072] mlx5_core 0000:3b:00.2: Supported tc offload range - chains: 1, prios: 16 
[ 3618.567921] mlx5_core 0000:3b:00.2: mlx5_tc_ct_init:2146:(pid 78892): tc ct offload not supported, firmware level support is missing 
[ 3618.588194] mlx5_core 0000:3b:00.2 enp59s0f0v0: renamed from eth2 
[ 3618.712584] mlx5_core 0000:3b:00.2 enp59s0f0v0: Link up 
[ 3618.832076] device eth0 left promiscuous mode 
[ 3618.836677] device enp59s0f0np0 left promiscuous mode 
[ 3618.841801] device ovsbr0 left promiscuous mode 
[ 3618.848475] IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0v0: link becomes ready 
[ 3618.877200] device ovs-system left promiscuous mode 
[ 3620.491395] pci 0000:3b:00.2: Removing from iommu group 150 
[ 3620.497130] pci 0000:3b:00.3: Removing from iommu group 151 
[ 3621.545238] mlx5_core 0000:3b:00.0: E-Switch: Disable: mode(OFFLOADS), nvfs(2), active vports(3) 
[ 3622.165863] mlx5_core 0000:3b:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3622.370685] mlx5_core 0000:3b:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295 
[ 3622.837668] mlx5_core 0000:3b:00.0 enp59s0f0np0: Link up 
[ 3638.612107] mlx5_core 0000:3b:00.0: E-Switch: Enable: mode(LEGACY), nvfs(2), active vports(3) 
[ 3638.727549] pci 0000:3b:00.2: [15b3:1018] type 00 class 0x020000 
[ 3638.733641] pci 0000:3b:00.2: enabling Extended Tags 
[ 3638.739776] pci 0000:3b:00.2: Adding to iommu group 150 
[ 3638.746033] mlx5_core 0000:3b:00.2: enabling device (0000 -> 0002) 
[ 3638.752362] mlx5_core 0000:3b:00.2: firmware version: 16.31.1014 
[ 3638.946724] mlx5_core 0000:3b:00.2: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps 
[ 3638.966916] mlx5_core 0000:3b:00.2: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3639.129545] mlx5_core 0000:3b:00.2: Supported tc offload range - chains: 1, prios: 16 
[ 3639.137389] mlx5_core 0000:3b:00.2: mlx5_tc_ct_init:2146:(pid 78892): tc ct offload not supported, firmware level support is missing 
[ 3639.157274] mlx5_core 0000:3b:00.2 enp59s0f0v0: renamed from eth0 
[ 3639.190393] pci 0000:3b:00.3: [15b3:1018] type 00 class 0x020000 
[ 3639.196500] pci 0000:3b:00.3: enabling Extended Tags 
[ 3639.202654] pci 0000:3b:00.3: Adding to iommu group 151 
[ 3639.208455] mlx5_core 0000:3b:00.3: enabling device (0000 -> 0002) 
[ 3639.214787] mlx5_core 0000:3b:00.3: firmware version: 16.31.1014 
[ 3639.293321] mlx5_core 0000:3b:00.2 enp59s0f0v0: Link up 
[ 3639.300644] IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0v0: link becomes ready 
[ 3639.420981] mlx5_core 0000:3b:00.3: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps 
[ 3639.441578] mlx5_core 0000:3b:00.3: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3639.605954] mlx5_core 0000:3b:00.3: Supported tc offload range - chains: 1, prios: 16 
[ 3639.613799] mlx5_core 0000:3b:00.3: mlx5_tc_ct_init:2146:(pid 78892): tc ct offload not supported, firmware level support is missing 
[ 3639.631930] mlx5_core 0000:3b:00.3 enp59s0f0v1: renamed from eth0 
[ 3639.764667] mlx5_core 0000:3b:00.3 enp59s0f0v1: Link up 
[ 3640.359259] IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0v1: link becomes ready 
[ 3641.786646] mlx5_core 0000:3b:00.0: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3) 
[ 3643.328980] mlx5_core 0000:3b:00.0: E-Switch: Supported tc chains and prios offload 
[ 3643.336660] mlx5_core 0000:3b:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295 
[ 3643.751797] mlx5_core 0000:3b:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3652.014707] restraintd[3965]: *** Current Time: Thu Oct 28 10:09:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[-- MARK -- Thu Oct 28 14:10:00 2021] 
[-- MARK -- Thu Oct 28 14:10:01 2021] 
[ 3712.014248] restraintd[3965]: *** Current Time: Thu Oct 28 10:10:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[ 3772.014427] restraintd[3965]: *** Current Time: Thu Oct 28 10:11:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[ 3811.928855] INFO: task kworker/u96:4:38872 blocked for more than 122 seconds. 
[ 3811.936000]       Tainted: G          I      --------- ---  5.14.0-1.6.1.el9.x86_64 #1 
[ 3811.943918] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[ 3811.951746] task:kworker/u96:4   state:D stack:    0 pid:38872 ppid:     2 flags:0x00004000 
[ 3811.960098] Workqueue: netns cleanup_net 
[ 3811.964030] Call Trace: 
[ 3811.966485]  __schedule+0x206/0x550 
[ 3811.969986]  schedule+0x3c/0xa0 
[ 3811.973139]  schedule_preempt_disabled+0xa/0x10 
[ 3811.977681]  __mutex_lock.constprop.0+0x295/0x450 
[ 3811.982394]  ? idr_for_each+0x95/0xd0 
[ 3811.986069]  devlink_pernet_pre_exit+0x2a/0xc0 
[ 3811.990525]  cleanup_net+0x1d2/0x370 
[ 3811.994111]  process_one_work+0x1e3/0x380 
[ 3811.998131]  worker_thread+0x53/0x3d0 
[ 3812.001796]  ? process_one_work+0x380/0x380 
[ 3812.005999]  kthread+0x10c/0x130 
[ 3812.009233]  ? set_kthread_struct+0x40/0x40 
[ 3812.013417]  ret_from_fork+0x1f/0x30 
[ 3812.017014] INFO: task devlink:90062 blocked for more than 122 seconds. 
[ 3812.023626]       Tainted: G          I      --------- ---  5.14.0-1.6.1.el9.x86_64 #1 
[ 3812.031536] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[ 3812.039363] task:devlink         state:D stack:    0 pid:90062 ppid: 16460 flags:0x00004000 
[ 3812.047707] Call Trace: 
[ 3812.050160]  __schedule+0x206/0x550 
[ 3812.053654]  schedule+0x3c/0xa0 
[ 3812.056809]  rwsem_down_write_slowpath+0x224/0x470 
[ 3812.061610]  register_netdevice_notifier+0x1c/0x110 
[ 3812.066505]  mlx5e_rep_bridge_init+0x111/0x130 [mlx5_core] 
[ 3812.072052]  mlx5e_uplink_rep_enable+0xd4/0x140 [mlx5_core] 
[ 3812.077668]  mlx5e_attach_netdev+0x9e/0x140 [mlx5_core] 
[ 3812.082927]  ? mlx5e_init_ul_rep+0x3e/0x50 [mlx5_core] 
[ 3812.088100]  mlx5e_netdev_attach_profile+0x93/0xb0 [mlx5_core] 
[ 3812.093967]  mlx5e_netdev_change_profile+0xa0/0xc0 [mlx5_core] 
[ 3812.099835]  mlx5e_vport_rep_load+0xa0/0xf0 [mlx5_core] 
[ 3812.105095]  mlx5_esw_offloads_rep_load+0x86/0xe0 [mlx5_core] 
[ 3812.110884]  esw_offloads_enable+0x266/0x370 [mlx5_core] 
[ 3812.116229]  mlx5_eswitch_enable_locked.part.0+0x100/0x310 [mlx5_core] 
[ 3812.122792]  esw_offloads_start+0x44/0x1f0 [mlx5_core] 
[ 3812.127972]  ? __nla_validate_parse+0x136/0x180 
[ 3812.132504]  mlx5_devlink_eswitch_mode_set+0x102/0x180 [mlx5_core] 
[ 3812.138718]  devlink_nl_cmd_eswitch_set_doit+0xc1/0x150 
[ 3812.143952]  genl_family_rcv_msg_doit+0xe7/0x150 
[ 3812.148574]  genl_rcv_msg+0xdc/0x1e0 
[ 3812.152160]  ? __devlink_port_phys_port_name_get+0x1e0/0x1e0 
[ 3812.157817]  ? genl_get_cmd+0xd0/0xd0 
[ 3812.161483]  netlink_rcv_skb+0x4e/0xf0 
[ 3812.165236]  genl_rcv+0x24/0x40 
[ 3812.168381]  netlink_unicast+0x1f6/0x2c0 
[ 3812.172307]  netlink_sendmsg+0x23b/0x480 
[ 3812.176231]  sock_sendmsg+0x5b/0x60 
[ 3812.179726]  __sys_sendto+0xf0/0x160 
[ 3812.183305]  ? handle_mm_fault+0xba/0x280 
[ 3812.187324]  ? do_user_addr_fault+0x1c7/0x660 
[ 3812.191683]  __x64_sys_sendto+0x20/0x30 
[ 3812.195524]  do_syscall_64+0x38/0x90 
[ 3812.199101]  entry_SYSCALL_64_after_hwframe+0x44/0xae 
[ 3812.204153] RIP: 0033:0x7f718733059a 
[ 3812.207734] RSP: 002b:00007ffdef8570b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c 
[ 3812.215297] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f718733059a 
[ 3812.222431] RDX: 0000000000000038 RSI: 000055eedd7ff440 RDI: 0000000000000003 
[ 3812.229563] RBP: 0000000000000000 R08: 00007f7187435200 R09: 000000000000000c 
[ 3812.236694] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 
[ 3812.243828] R13: 000055eedd7ff2a0 R14: 000055eedc986d5c R15: 000055eedd7ff440 
[ 3832.014278] restraintd[3965]: *** Current Time: Thu Oct 28 10:12:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[ 3892.014629] restraintd[3965]: *** Current Time: Thu Oct 28 10:13:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 


beaker job: https://beaker.engineering.redhat.com/jobs/5950116

distro: RHEL-9.0.0-20211020.4
kernel-5.14.0-1.6.1.el9.x86_64
openvswitch2.15-2.15.0-20.el9fdp.x86_64


Additional info:

Comment 1 qding 2021-10-29 02:33:31 UTC
Created attachment 1838183 [details]
console.log

Comment 4 Mohammad Kabat 2023-05-30 08:30:40 UTC
should be fixed in RHEL9.2 GA kernel,
please test it with the new kernel 5.14.0.284.11.1.el9