Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2018360

Summary: [OVS offload] task hung with RHEL9
Product: Red Hat Enterprise Linux Fast Datapath Reporter: qding
Component: openvswitch2.15Assignee: Amir Tzin (Mellanox) <atzin>
Status: CLOSED CURRENTRELEASE QA Contact: qding
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: FDP 21.ICC: ctrautma, jhsiao, lariel, mhou, mkabat, mleitner, ralongi, trinh.dao
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-18 02:08:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1896414    
Attachments:
Description Flags
console.log none

Description qding 2021-10-29 02:25:03 UTC
Description of problem:

[ 3618.220187] mlx5_core 0000:3b:00.2: enabling device (0000 -> 0002) 
[ 3618.226534] mlx5_core 0000:3b:00.2: firmware version: 16.31.1014 
[ 3618.419782] mlx5_core 0000:3b:00.2: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps 
[ 3618.439904] mlx5_core 0000:3b:00.2: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3618.560072] mlx5_core 0000:3b:00.2: Supported tc offload range - chains: 1, prios: 16 
[ 3618.567921] mlx5_core 0000:3b:00.2: mlx5_tc_ct_init:2146:(pid 78892): tc ct offload not supported, firmware level support is missing 
[ 3618.588194] mlx5_core 0000:3b:00.2 enp59s0f0v0: renamed from eth2 
[ 3618.712584] mlx5_core 0000:3b:00.2 enp59s0f0v0: Link up 
[ 3618.832076] device eth0 left promiscuous mode 
[ 3618.836677] device enp59s0f0np0 left promiscuous mode 
[ 3618.841801] device ovsbr0 left promiscuous mode 
[ 3618.848475] IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0v0: link becomes ready 
[ 3618.877200] device ovs-system left promiscuous mode 
[ 3620.491395] pci 0000:3b:00.2: Removing from iommu group 150 
[ 3620.497130] pci 0000:3b:00.3: Removing from iommu group 151 
[ 3621.545238] mlx5_core 0000:3b:00.0: E-Switch: Disable: mode(OFFLOADS), nvfs(2), active vports(3) 
[ 3622.165863] mlx5_core 0000:3b:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3622.370685] mlx5_core 0000:3b:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295 
[ 3622.837668] mlx5_core 0000:3b:00.0 enp59s0f0np0: Link up 
[ 3638.612107] mlx5_core 0000:3b:00.0: E-Switch: Enable: mode(LEGACY), nvfs(2), active vports(3) 
[ 3638.727549] pci 0000:3b:00.2: [15b3:1018] type 00 class 0x020000 
[ 3638.733641] pci 0000:3b:00.2: enabling Extended Tags 
[ 3638.739776] pci 0000:3b:00.2: Adding to iommu group 150 
[ 3638.746033] mlx5_core 0000:3b:00.2: enabling device (0000 -> 0002) 
[ 3638.752362] mlx5_core 0000:3b:00.2: firmware version: 16.31.1014 
[ 3638.946724] mlx5_core 0000:3b:00.2: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps 
[ 3638.966916] mlx5_core 0000:3b:00.2: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3639.129545] mlx5_core 0000:3b:00.2: Supported tc offload range - chains: 1, prios: 16 
[ 3639.137389] mlx5_core 0000:3b:00.2: mlx5_tc_ct_init:2146:(pid 78892): tc ct offload not supported, firmware level support is missing 
[ 3639.157274] mlx5_core 0000:3b:00.2 enp59s0f0v0: renamed from eth0 
[ 3639.190393] pci 0000:3b:00.3: [15b3:1018] type 00 class 0x020000 
[ 3639.196500] pci 0000:3b:00.3: enabling Extended Tags 
[ 3639.202654] pci 0000:3b:00.3: Adding to iommu group 151 
[ 3639.208455] mlx5_core 0000:3b:00.3: enabling device (0000 -> 0002) 
[ 3639.214787] mlx5_core 0000:3b:00.3: firmware version: 16.31.1014 
[ 3639.293321] mlx5_core 0000:3b:00.2 enp59s0f0v0: Link up 
[ 3639.300644] IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0v0: link becomes ready 
[ 3639.420981] mlx5_core 0000:3b:00.3: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps 
[ 3639.441578] mlx5_core 0000:3b:00.3: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3639.605954] mlx5_core 0000:3b:00.3: Supported tc offload range - chains: 1, prios: 16 
[ 3639.613799] mlx5_core 0000:3b:00.3: mlx5_tc_ct_init:2146:(pid 78892): tc ct offload not supported, firmware level support is missing 
[ 3639.631930] mlx5_core 0000:3b:00.3 enp59s0f0v1: renamed from eth0 
[ 3639.764667] mlx5_core 0000:3b:00.3 enp59s0f0v1: Link up 
[ 3640.359259] IPv6: ADDRCONF(NETDEV_CHANGE): enp59s0f0v1: link becomes ready 
[ 3641.786646] mlx5_core 0000:3b:00.0: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3) 
[ 3643.328980] mlx5_core 0000:3b:00.0: E-Switch: Supported tc chains and prios offload 
[ 3643.336660] mlx5_core 0000:3b:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295 
[ 3643.751797] mlx5_core 0000:3b:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) 
[ 3652.014707] restraintd[3965]: *** Current Time: Thu Oct 28 10:09:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[-- MARK -- Thu Oct 28 14:10:00 2021] 
[-- MARK -- Thu Oct 28 14:10:01 2021] 
[ 3712.014248] restraintd[3965]: *** Current Time: Thu Oct 28 10:10:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[ 3772.014427] restraintd[3965]: *** Current Time: Thu Oct 28 10:11:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[ 3811.928855] INFO: task kworker/u96:4:38872 blocked for more than 122 seconds. 
[ 3811.936000]       Tainted: G          I      --------- ---  5.14.0-1.6.1.el9.x86_64 #1 
[ 3811.943918] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[ 3811.951746] task:kworker/u96:4   state:D stack:    0 pid:38872 ppid:     2 flags:0x00004000 
[ 3811.960098] Workqueue: netns cleanup_net 
[ 3811.964030] Call Trace: 
[ 3811.966485]  __schedule+0x206/0x550 
[ 3811.969986]  schedule+0x3c/0xa0 
[ 3811.973139]  schedule_preempt_disabled+0xa/0x10 
[ 3811.977681]  __mutex_lock.constprop.0+0x295/0x450 
[ 3811.982394]  ? idr_for_each+0x95/0xd0 
[ 3811.986069]  devlink_pernet_pre_exit+0x2a/0xc0 
[ 3811.990525]  cleanup_net+0x1d2/0x370 
[ 3811.994111]  process_one_work+0x1e3/0x380 
[ 3811.998131]  worker_thread+0x53/0x3d0 
[ 3812.001796]  ? process_one_work+0x380/0x380 
[ 3812.005999]  kthread+0x10c/0x130 
[ 3812.009233]  ? set_kthread_struct+0x40/0x40 
[ 3812.013417]  ret_from_fork+0x1f/0x30 
[ 3812.017014] INFO: task devlink:90062 blocked for more than 122 seconds. 
[ 3812.023626]       Tainted: G          I      --------- ---  5.14.0-1.6.1.el9.x86_64 #1 
[ 3812.031536] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
[ 3812.039363] task:devlink         state:D stack:    0 pid:90062 ppid: 16460 flags:0x00004000 
[ 3812.047707] Call Trace: 
[ 3812.050160]  __schedule+0x206/0x550 
[ 3812.053654]  schedule+0x3c/0xa0 
[ 3812.056809]  rwsem_down_write_slowpath+0x224/0x470 
[ 3812.061610]  register_netdevice_notifier+0x1c/0x110 
[ 3812.066505]  mlx5e_rep_bridge_init+0x111/0x130 [mlx5_core] 
[ 3812.072052]  mlx5e_uplink_rep_enable+0xd4/0x140 [mlx5_core] 
[ 3812.077668]  mlx5e_attach_netdev+0x9e/0x140 [mlx5_core] 
[ 3812.082927]  ? mlx5e_init_ul_rep+0x3e/0x50 [mlx5_core] 
[ 3812.088100]  mlx5e_netdev_attach_profile+0x93/0xb0 [mlx5_core] 
[ 3812.093967]  mlx5e_netdev_change_profile+0xa0/0xc0 [mlx5_core] 
[ 3812.099835]  mlx5e_vport_rep_load+0xa0/0xf0 [mlx5_core] 
[ 3812.105095]  mlx5_esw_offloads_rep_load+0x86/0xe0 [mlx5_core] 
[ 3812.110884]  esw_offloads_enable+0x266/0x370 [mlx5_core] 
[ 3812.116229]  mlx5_eswitch_enable_locked.part.0+0x100/0x310 [mlx5_core] 
[ 3812.122792]  esw_offloads_start+0x44/0x1f0 [mlx5_core] 
[ 3812.127972]  ? __nla_validate_parse+0x136/0x180 
[ 3812.132504]  mlx5_devlink_eswitch_mode_set+0x102/0x180 [mlx5_core] 
[ 3812.138718]  devlink_nl_cmd_eswitch_set_doit+0xc1/0x150 
[ 3812.143952]  genl_family_rcv_msg_doit+0xe7/0x150 
[ 3812.148574]  genl_rcv_msg+0xdc/0x1e0 
[ 3812.152160]  ? __devlink_port_phys_port_name_get+0x1e0/0x1e0 
[ 3812.157817]  ? genl_get_cmd+0xd0/0xd0 
[ 3812.161483]  netlink_rcv_skb+0x4e/0xf0 
[ 3812.165236]  genl_rcv+0x24/0x40 
[ 3812.168381]  netlink_unicast+0x1f6/0x2c0 
[ 3812.172307]  netlink_sendmsg+0x23b/0x480 
[ 3812.176231]  sock_sendmsg+0x5b/0x60 
[ 3812.179726]  __sys_sendto+0xf0/0x160 
[ 3812.183305]  ? handle_mm_fault+0xba/0x280 
[ 3812.187324]  ? do_user_addr_fault+0x1c7/0x660 
[ 3812.191683]  __x64_sys_sendto+0x20/0x30 
[ 3812.195524]  do_syscall_64+0x38/0x90 
[ 3812.199101]  entry_SYSCALL_64_after_hwframe+0x44/0xae 
[ 3812.204153] RIP: 0033:0x7f718733059a 
[ 3812.207734] RSP: 002b:00007ffdef8570b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c 
[ 3812.215297] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f718733059a 
[ 3812.222431] RDX: 0000000000000038 RSI: 000055eedd7ff440 RDI: 0000000000000003 
[ 3812.229563] RBP: 0000000000000000 R08: 00007f7187435200 R09: 000000000000000c 
[ 3812.236694] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 
[ 3812.243828] R13: 000055eedd7ff2a0 R14: 000055eedc986d5c R15: 000055eedd7ff440 
[ 3832.014278] restraintd[3965]: *** Current Time: Thu Oct 28 10:12:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 
[ 3892.014629] restraintd[3965]: *** Current Time: Thu Oct 28 10:13:52 2021  Localwatchdog at: Fri Oct 29 09:10:51 2021 


beaker job: https://beaker.engineering.redhat.com/jobs/5950116

distro: RHEL-9.0.0-20211020.4
kernel-5.14.0-1.6.1.el9.x86_64
openvswitch2.15-2.15.0-20.el9fdp.x86_64


Additional info:

Comment 1 qding 2021-10-29 02:33:31 UTC
Created attachment 1838183 [details]
console.log

Comment 4 Mohammad Kabat 2023-05-30 08:30:40 UTC
should be fixed in RHEL9.2 GA kernel,
please test it with the new kernel 5.14.0.284.11.1.el9