Bug 2231195 - ice driver: kernel panic when running ovs dpdk pvp performance case of ovs3.2
Summary: ice driver: kernel panic when running ovs dpdk pvp performance case of ovs3.2
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux Fast Datapath
Classification: Red Hat
Component: openvswitch
Version: FDP 23.F
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Timothy Redaelli
QA Contact: qding
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-11 01:48 UTC by liting
Modified: 2023-08-16 01:07 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker FD-3097 0 None None None 2023-08-11 01:49:53 UTC

Description liting 2023-08-11 01:48:03 UTC
Description of problem:
ice driver: there is kernel call trace when running ovs dpdk pvp performance case of ovs3.2

Version-Release number of selected component (if applicable):
[root@dell-per740-57 crash]# rpm -qa|grep dpdk
dpdk-22.11-1.el9.x86_64
dpdk-tools-22.11-1.el9.x86_64
[root@dell-per740-57 crash]# rpm -qa|grep driverctl
driverctl-0.111-2.el9.noarch
[root@dell-per740-57 crash]# rpm -qa|grep openvswitch
kernel-kernel-networking-openvswitch-common-4.0-4.noarch
kernel-kernel-networking-openvswitch-perf-1.0-438.noarch
openvswitch-selinux-extra-policy-1.0-34.el9fdp.noarch
openvswitch3.2-3.2.0-0.2.el9fdp.x86_64
[root@dell-per740-57 crash]# uname -r
5.14.0-284.27.1.el9_2.x86_64


How reproducible:


Steps to Reproduce:
Run ovs dpdk vhostuser pvp performance case.

Actual results:
There are many kernel call trace and the job is hung.
Access to the system and check the current process. and found the system is running commands "driverctl set-override 0000:3b:00.0 vfio-pci"
[root@dell-per740-57 ~]# ps aux|grep driverctl
root     1601035  0.0  0.0   7312  3896 ?        D    12:05   0:00 /usr/bin/bash /usr/sbin/driverctl set-override 0000:3b:00.0 vfio-pci
root     1602051  0.0  0.0   7252   436 pts/0    D    20:58   0:00 /usr/bin/bash /usr/sbin/driverctl -v list-overrides
root     1602509  0.0  0.0   6408  2324 pts/2    S+   21:40   0:00 grep --color=auto driverctl

call trace log:
[ 6703.382640] NMI backtrace for cpu 7 
[ 6703.386132] CPU: 7 PID: 11 Comm: kworker/u96:0 Kdump: loaded Tainted: G          I      --------  ---  5.14.0-284.27.1.el9_2.x86_64 #1 
[ 6703.398203] Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.12.2 07/09/2021 
[ 6703.405771] Workqueue: netns cleanup_net 
[ 6703.409696] Call Trace: 
[ 6703.412150]  <IRQ> 
[ 6703.414169]  dump_stack_lvl+0x34/0x48 
[ 6703.417834]  nmi_cpu_backtrace.cold+0x30/0x6f 
[ 6703.422192]  ? lapic_can_unplug_cpu+0x80/0x80 
[ 6703.426553]  nmi_trigger_cpumask_backtrace+0xef/0x110 
[ 6703.431608]  trigger_single_cpu_backtrace+0x2a/0x31 
[ 6703.436487]  rcu_dump_cpu_stacks+0xa7/0xe4 
[ 6703.440585]  print_cpu_stall.cold+0x4f/0x17d 
[ 6703.444857]  check_cpu_stall+0xe9/0x240 
[ 6703.448699]  rcu_pending+0x26/0x190 
[ 6703.452191]  rcu_sched_clock_irq+0x3d/0x180 
[ 6703.456376]  update_process_times+0x8c/0xc0 
[ 6703.460562]  tick_sched_handle+0x22/0x60 
[ 6703.464489]  tick_sched_timer+0x65/0x80 
[ 6703.468329]  ? tick_sched_do_timer+0xa0/0xa0 
[ 6703.472601]  __hrtimer_run_queues+0x127/0x2c0 
[ 6703.476959]  hrtimer_interrupt+0xfc/0x210 
[ 6703.480975]  __sysvec_apic_timer_interrupt+0x5c/0x110 
[ 6703.486025]  sysvec_apic_timer_interrupt+0x6d/0x90 
[ 6703.490817]  </IRQ> 
[ 6703.492922]  <TASK> 
[ 6703.495022]  asm_sysvec_apic_timer_interrupt+0x16/0x20 
[ 6703.500162] RIP: 0010:xas_start+0x1b/0xd0 
[ 6703.504173] Code: 4c 89 4f 08 c3 cc cc cc cc 4c 89 c0 eb a5 90 48 8b 57 18 48 89 d0 83 e0 03 74 6b 48 83 f8 02 75 09 48 81 fa 05 c0 ff ff 77 31 <48> 8b 07 48 8b 57 08 48 8b 40 08 48 89 c1 83 e1 03 48 83 f9 02 75 
[ 6703.522919] RSP: 0018:ffffb4f0003c7d80 EFLAGS: 00000202 
[ 6703.528146] RAX: 0000000000000003 RBX: 0000000000000002 RCX: ffffffffac704180 
[ 6703.535279] RDX: 0000000000000003 RSI: 0000000000000004 RDI: ffffb4f0003c7d88 
[ 6703.542411] RBP: ffff8daa07ab0000 R08: 0000000000000001 R09: 0000000000000230 
[ 6703.549543] R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffffffffaadbf6d0 
[ 6703.556679] R13: ffff8db1c5cac100 R14: ffffffffade17da0 R15: ffffffffade17d54 
[ 6703.563811]  ? xas_find+0x1c0/0x1c0 
[ 6703.567306]  xa_get_mark+0x59/0x100 
[ 6703.570796]  devlinks_xa_find_get.constprop.0+0x51/0x90 
[ 6703.576023]  devlink_pernet_pre_exit+0x3c/0xf0 
[ 6703.580467]  ? mutex_lock+0xe/0x30 
[ 6703.583874]  cleanup_net+0x1d2/0x380 
[ 6703.587454]  process_one_work+0x1e2/0x3b0 
[ 6703.591468]  worker_thread+0x50/0x3a0 
[ 6703.595132]  ? rescuer_thread+0x390/0x390 
[ 6703.599144]  kthread+0xd6/0x100 
[ 6703.602291]  ? kthread_complete_and_exit+0x20/0x20 
[ 6703.607083]  ret_from_fork+0x1f/0x30 
[ 6703.610666]  </TASK> 
[ 6746.003796] restraintd[1740]: *** Current Time: Thu Aug 10 12:07:22 2023  Localwatchdog at: Thu Aug 17 10:13:22 2023 
[ 6806.003856] restraintd[1740]: *** Current Time: Thu Aug 10 12:08:22 2023  Localwatchdog at: Thu Aug 17 10:13:22 2023 
[ 6866.003735] restraintd[1740]: *** Current Time: Thu Aug 10 12:09:22 2023  Localwatchdog at: Thu Aug 17 10:13:22 2023 
[ 6883.618814] rcu: INFO: rcu_preempt self-detected stall on CPU 
[ 6883.624562] rcu: 	7-....: (240003 ticks this GP) idle=349/1/0x4000000000000000 softirq=485496/485496 fqs=60006  
[ 6883.634644] 	(t=240269 jiffies g=5329081 q=2818 ncpus=48) 
[ 6883.640043] NMI backtrace for cpu 7 
[ 6883.643534] CPU: 7 PID: 11 Comm: kworker/u96:0 Kdump: loaded Tainted: G          I      --------  ---  5.14.0-284.27.1.el9_2.x86_64 #1 
[ 6883.655607] Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.12.2 07/09/2021 
[ 6883.663175] Workqueue: netns cleanup_net 
[ 6883.667099] Call Trace: 
[ 6883.669553]  <IRQ> 
[ 6883.671574]  dump_stack_lvl+0x34/0x48 
[ 6883.675239]  nmi_cpu_backtrace.cold+0x30/0x6f 
[ 6883.679597]  ? lapic_can_unplug_cpu+0x80/0x80 
[ 6883.683956]  nmi_trigger_cpumask_backtrace+0xef/0x110 
[ 6883.689010]  trigger_single_cpu_backtrace+0x2a/0x31 
[ 6883.693888]  rcu_dump_cpu_stacks+0xa7/0xe4 
[ 6883.697990]  print_cpu_stall.cold+0x4f/0x17d 
[ 6883.702262]  check_cpu_stall+0xe9/0x240 
[ 6883.706102]  rcu_pending+0x26/0x190 
[ 6883.709595]  rcu_sched_clock_irq+0x3d/0x180 
[ 6883.713779]  update_process_times+0x8c/0xc0 
[ 6883.717967]  tick_sched_handle+0x22/0x60 
[ 6883.721891]  tick_sched_timer+0x65/0x80 
[ 6883.725731]  ? tick_sched_do_timer+0xa0/0xa0 
[ 6883.730006]  __hrtimer_run_queues+0x127/0x2c0 
[ 6883.734363]  hrtimer_interrupt+0xfc/0x210 
[ 6883.738376]  __sysvec_apic_timer_interrupt+0x5c/0x110 
[ 6883.743429]  sysvec_apic_timer_interrupt+0x6d/0x90 
[ 6883.748223]  </IRQ> 
[ 6883.750328]  <TASK> 
[ 6883.752435]  asm_sysvec_apic_timer_interrupt+0x16/0x20 
[ 6883.757573] RIP: 0010:xas_start+0x29/0xd0 
[ 6883.761587] Code: 90 48 8b 57 18 48 89 d0 83 e0 03 74 6b 48 83 f8 02 75 09 48 81 fa 05 c0 ff ff 77 31 48 8b 07 48 8b 57 08 48 8b 40 08 48 89 c1 <83> e1 03 48 83 f9 02 75 08 48 3d 00 10 00 00 77 21 48 85 d2 75 29 
[ 6883.780331] RSP: 0018:ffffb4f0003c7d80 EFLAGS: 00000202 
[ 6883.785558] RAX: ffff8dab4dcea912 RBX: 0000000000000002 RCX: ffff8dab4dcea912 
[ 6883.792693] RDX: 0000000000000004 RSI: 0000000000000004 RDI: ffffb4f0003c7d88 
[ 6883.799825] RBP: ffff8daa07ab0000 R08: 0000000000000001 R09: 0000000000000230 
[ 6883.806957] R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffffffffaadbf6d0 
[ 6883.814090] R13: ffff8db1c5cac100 R14: ffffffffade17da0 R15: ffffffffade17d54 
[ 6883.821224]  ? xas_find+0x1c0/0x1c0 
[ 6883.824718]  xa_get_mark+0x59/0x100 
[ 6883.828210]  devlinks_xa_find_get.constprop.0+0x51/0x90 
[ 6883.833435]  devlink_pernet_pre_exit+0x3c/0xf0 
[ 6883.837881]  ? mutex_lock+0xe/0x30 
[ 6883.841286]  cleanup_net+0x1d2/0x380 
[ 6883.844866]  process_one_work+0x1e2/0x3b0 
[ 6883.848879]  worker_thread+0x50/0x3a0 
[ 6883.852547]  ? rescuer_thread+0x390/0x390 
[ 6883.856557]  kthread+0xd6/0x100 
[ 6883.859704]  ? kthread_complete_and_exit+0x20/0x20 
[ 6883.864497]  ret_from_fork+0x1f/0x30 
[ 6883.868078]  </TASK> 

Expected results:
No kernel call trace.

Additional info:
job:
https://beaker.engineering.redhat.com/jobs/8174634
console log:
https://beaker-archive.hosts.prod.psi.bos.redhat.com/beaker-logs/2023/08/81746/8174634/14402292/console.log

Comment 1 liting 2023-08-11 02:29:27 UTC
Access to the 740-57 system and create sosreport, but it is hung when setting up plugins.
[root@dell-per740-57 ~]# sos report --all-logs --batch

sosreport (version 4.5.5)

This command will collect diagnostic and configuration information from
this Red Hat Enterprise Linux system and installed applications.

An archive containing the collected information will be generated in
/var/tmp/sos.iroatpia and may be provided to a Red Hat support
representative.

Any information provided to Red Hat will be treated in accordance with
the published support policies at:

        Distribution Website : https://www.redhat.com/
        Commercial Support   : https://access.redhat.com/

The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.

No changes will be made to system configuration.


 Setting up archive ...
 Setting up plugins ...

Comment 2 liting 2023-08-11 02:31:09 UTC
Reboot the system, it cannot reboot successfully, following is the console log.
[    **] A stop job is running for D-Bus Sys… Message Bus (14min 4s / no limit)
[43872.215389] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 7-... } 5329644 jiffies s: 1917 root: 0x1/.
[43872.226351] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x80/.
[43872.234009] Task dump for CPU 7:
[43872.237240] task:kworker/u96:0   state:R  running task     stack:    0 pid:   11 ppid:     2 flags:0x00004008
[43872.247147] Workqueue: netns cleanup_net
[43872.251075] Call Trace:
[43872.253527]  <TASK>
[43872.255633]  ? cleanup_net+0x1d2/0x380
[43872.259385]  ? process_one_work+0x1e2/0x3b0
[43872.263570]  ? worker_thread+0x50/0x3a0
[43872.267409]  ? rescuer_thread+0x390/0x390
[43872.271423]  ? kthread+0xd6/0x100
[43872.274743]  ? kthread_complete_and_exit+0x20/0x20
[43872.279536]  ? ret_from_fork+0x1f/0x30
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Stopped target System Initialization.
[43937.752574] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 7-... } 5395180 jiffies s: 1917 root: 0x1/.
[43937.763535] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x80/.
[43937.771191] Task dump for CPU 7:
[43937.774425] task:kworker/u96:0   state:R  running task     stack:    0 pid:   11 ppid:     2 flags:0x00004008
[43937.784332] Workqueue: netns cleanup_net
[43937.788257] Call Trace:
[43937.790708]  <TASK>
[43937.792817]  ? cleanup_net+0x1d2/0x380
[43937.796570]  ? process_one_work+0x1e2/0x3b0
[43937.800757]  ? worker_thread+0x50/0x3a0
[43937.804596]  ? rescuer_thread+0x390/0x390
[43937.808607]  ? kthread+0xd6/0x100
[43937.811925]  ? kthread_complete_and_exit+0x20/0x20
[43937.816721]  ? ret_from_fork+0x1f/0x30
[43937.820474]  </TASK>
[44003.289758] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 7-... } 5460716 jiffies s: 1917 root: 0x1/.
[44003.300717] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x80/.
[44003.308378] Task dump for CPU 7:
[44003.311610] task:kworker/u96:0   state:R  running task     stack:    0 pid:   11 ppid:     2 flags:0x00004008
[44003.321514] Workqueue: netns cleanup_net
[44003.325440] Call Trace:
[44003.327894]  <TASK>
[44003.330002]  ? cleanup_net+0x1d2/0x380
[44003.333752]  ? process_one_work+0x1e2/0x3b0
[44003.337942]  ? worker_thread+0x50/0x3a0
[44003.341778]  ? rescuer_thread+0x390/0x390
[44003.345790]  ? kthread+0xd6/0x100
[44003.349109]  ? kthread_complete_and_exit+0x20/0x20
[44003.353905]  ? ret_from_fork+0x1f/0x30
[44003.357658]  </TASK>
[44016.982936] rcu: INFO: rcu_preempt self-detected stall on CPU
[44016.988680] rcu: 	7-....: (37320832 ticks this GP) idle=349/1/0x4000000000000000 softirq=485496/485496 fqs=9331035 
[44016.999109] 	(t=37372962 jiffies g=5329081 q=1208124 ncpus=48)
[44017.004937] NMI backtrace for cpu 7
[44017.008430] CPU: 7 PID: 11 Comm: kworker/u96:0 Kdump: loaded Tainted: G          I      --------  ---  5.14.0-284.27.1.el9_2.x86_64 #1
[44017.020504] Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.12.2 07/09/2021
[44017.028069] Workqueue: netns cleanup_net
[44017.031997] Call Trace:
[44017.034450]  <IRQ>
[44017.036470]  dump_stack_lvl+0x34/0x48
[44017.040134]  nmi_cpu_backtrace.cold+0x30/0x6f
[44017.044495]  ? lapic_can_unplug_cpu+0x80/0x80
[44017.048855]  nmi_trigger_cpumask_backtrace+0xef/0x110
[44017.053907]  trigger_single_cpu_backtrace+0x2a/0x31
[44017.058784]  rcu_dump_cpu_stacks+0xa7/0xe4
[44017.062887]  print_cpu_stall.cold+0x4f/0x17d
[44017.067158]  check_cpu_stall+0xe9/0x240
[44017.070997]  rcu_pending+0x26/0x190
[44017.074489]  rcu_sched_clock_irq+0x3d/0x180
[44017.078677]  update_process_times+0x8c/0xc0
[44017.082862]  tick_sched_handle+0x22/0x60
[44017.086787]  tick_sched_timer+0x65/0x80
[44017.090627]  ? tick_sched_do_timer+0xa0/0xa0
[44017.094900]  __hrtimer_run_queues+0x127/0x2c0
[44017.099261]  hrtimer_interrupt+0xfc/0x210
[44017.103274]  __sysvec_apic_timer_interrupt+0x5c/0x110
[44017.108327]  sysvec_apic_timer_interrupt+0x6d/0x90
[44017.113119]  </IRQ>
[44017.115225]  <TASK>
[44017.117329]  asm_sysvec_apic_timer_interrupt+0x16/0x20
[44017.122471] RIP: 0010:__rcu_read_lock+0x1d/0x30
[44017.127003] Code: 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 65 48 8b 14 25 40 8f 01 00 8b 82 f4 08 00 00 83 c0 01 89 82 f4 08 00 00 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44
[44017.145748] RSP: 0018:ffffb4f0003c7d68 EFLAGS: 00000202
[44017.150975] RAX: 0000000000000002 RBX: 0000000000000001 RCX: 0000000000000001
[44017.158107] RDX: ffff8daa04780000 RSI: ffffb4f0003c7e08 RDI: ffffffffac704180
[44017.165240] RBP: ffffb4f0003c7e08 R08: 0000000000000238 R09: 0000000020000000
[44017.172372] R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffffffffaadbf6d0
[44017.179506] R13: ffffffffffffffff R14: ffffffffade17da0 R15: ffffffffade17d54
[44017.186638]  ? xas_find+0x1c0/0x1c0
[44017.190133]  xa_find+0x5f/0xe0
[44017.193193]  ? xas_find+0x1c0/0x1c0
[44017.196684]  devlinks_xa_find_get.constprop.0+0x32/0x90
[44017.201910]  devlink_pernet_pre_exit+0x3c/0xf0
[44017.206357]  ? mutex_lock+0xe/0x30
[44017.209763]  cleanup_net+0x1d2/0x380
[44017.213344]  process_one_work+0x1e2/0x3b0
[44017.217354]  worker_thread+0x50/0x3a0
[44017.221022]  ? rescuer_thread+0x390/0x390
[44017.225032]  kthread+0xd6/0x100
[44017.228180]  ? kthread_complete_and_exit+0x20/0x20
[44017.232972]  ret_from_fork+0x1f/0x30
[44017.236552]  </TASK>
[44068.826942] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 7-... } 5526252 jiffies s: 1917 root: 0x1/.
[44068.837904] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x80/.
[44068.845562] Task dump for CPU 7:
[44068.848794] task:kworker/u96:0   state:R  running task     stack:    0 pid:   11 ppid:     2 flags:0x00004008
[44068.858700] Workqueue: netns cleanup_net
[44068.862626] Call Trace:
[44068.865079]  <TASK>
[44068.867184]  ? cleanup_net+0x1d2/0x380
[44068.870938]  ? process_one_work+0x1e2/0x3b0
[44068.875124]  ? worker_thread+0x50/0x3a0
[44068.878962]  ? rescuer_thread+0x390/0x390
[44068.882974]  ? kthread+0xd6/0x100
[44068.886294]  ? kthread_complete_and_exit+0x20/0x20
[44068.891088]  ? ret_from_fork+0x1f/0x30
[44068.894841]  </TASK>

Comment 3 liting 2023-08-11 03:05:47 UTC
I clicked "reboot" in the beaker, it can reboot successfully. And I create sosreport and attach it.

Comment 5 liting 2023-08-14 06:05:16 UTC
I rerun 3 jobs and has no panic, so this issue does not always reproduce.
https://beaker.engineering.redhat.com/jobs/8177222
https://beaker.engineering.redhat.com/jobs/8181689
https://beaker.engineering.redhat.com/jobs/8178632

Comment 6 liting 2023-08-16 01:07:54 UTC
Run one more job, still has no panic.
https://beaker.engineering.redhat.com/jobs/8189074


Note You need to log in before you can comment on or make changes to this bug.