Bug 2282969

Summary: oops: Toggling turboboost 2times crashes ELN kernel
Product: [Fedora] Fedora Reporter: Samuel Dobroň <sdobron>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: rawhideCC: acaringi, adscvr, airlied, alciregi, bskeggs, hdegoede, hpa, josef, kernel-maint, linville, masami256, mchehab, ptalbert, steved
Target Milestone: ---Keywords: Regression, TestBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Samuel Dobroň 2024-05-23 12:34:01 UTC
1. Please describe the problem:
Disabling, enabling and once again disabling, enabling turbo boost crashes the kernel. This has to be somehow runtime-related, reboot between toggling doesn't crash it. It's always a 2nd toggle after booting up the machine. 

If the kernel doesn't panic, we measured significantly lower processing rate for our XDP Drop test scenario (might be related, if not we'll fill a new ticket), the test is sensitive to IRQs, BPF and XDP changes, turbo boost might affect it as well. 

Panic log:
[  550.256511] Oops: divide error: 0000 [#1] PREEMPT SMP NOPTI 
[  550.262093] CPU: 5 PID: 8093 Comm: sh Kdump: loaded Not tainted 6.10.0-0.rc0.20240521git8f6a15f095a6.10.eln136.x86_64 #1 
[  550.272954] Hardware name: Dell Inc. PowerEdge R740/0DY2X0, BIOS 2.11.2 004/21/2021 
[  550.280604] RIP: 0010:store_no_turbo+0x12f/0x160 
[  550.285232] Code: 3b 05 f5 2c 93 01 48 89 c3 89 c7 72 d6 41 0f b6 fc e8 25 21 60 ff e9 71 ff ff ff 48 8b 05 e1 26 55 02 48 8b 08 6b 41 1c 64 99 <f7> 79 2c 39 05 c8 26 55 02 7e 9e 89 05 c0 26 55 02 eb 96 48 c7 c5 
[  550.303980] RSP: 0018:ffffa78d46fa7b40 EFLAGS: 00010206 
[  550.309204] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff9a74c3455e00 
[  550.316339] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffaab01600 
[  550.323471] RBP: 0000000000000002 R08: 0000000000000001 R09: 000000000000000a 
[  550.330605] R10: 000000000000000a R11: 0fffffffffffffff R12: ffff9a74dd1f9601 
[  550.337735] R13: fffffffffffffff2 R14: ffffa78d46fa7bd0 R15: ffff9a74dd1f96a0 
[  550.344871] FS:  00007f67dd5c2740(0000) GS:ffff9a83ffc80000(0000) knlGS:0000000000000000 
[  550.352957] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[  550.358702] CR2: 0000562a05b47448 CR3: 0000000154a42001 CR4: 00000000007706f0 
[  550.365833] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
[  550.372967] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 
[  550.380098] PKRU: 55555554 
[  550.382814] Call Trace: 
[  550.385267]  <TASK> 
[  550.387373]  ? show_trace_log_lvl+0x1b0/0x2f0 
[  550.391740]  ? show_trace_log_lvl+0x1b0/0x2f0 
[  550.396101]  ? kernfs_fop_write_iter+0x13e/0x1f0 
[  550.400726]  ? store_no_turbo+0x12f/0x160 
[  550.404741]  ? __die_body.cold+0x8/0x12 
[  550.408579]  ? die+0x2e/0x50 
[  550.411473]  ? do_trap+0xca/0x110 
[  550.414795]  ? do_error_trap+0x65/0x80 
[  550.418555]  ? store_no_turbo+0x12f/0x160 
[  550.422566]  ? exc_divide_error+0x38/0x50 
[  550.426579]  ? store_no_turbo+0x12f/0x160 
[  550.430593]  ? asm_exc_divide_error+0x1a/0x20 
[  550.434962]  ? store_no_turbo+0x12f/0x160 
[  550.438983]  kernfs_fop_write_iter+0x13e/0x1f0 
[  550.443439]  vfs_write+0x291/0x460 
[  550.446854]  ksys_write+0x6d/0xf0 
[  550.450172]  do_syscall_64+0x7e/0x160 
[  550.453846]  ? __count_memcg_events+0x58/0xf0 
[  550.458214]  ? mem_cgroup_commit_charge+0x7d/0xb0 
[  550.462920]  ? __mod_memcg_lruvec_state+0xa6/0x150 
[  550.467712]  ? __lruvec_stat_mod_folio+0x6b/0xb0 
[  550.472339]  ? do_anonymous_page+0x3ca/0x510 
[  550.476612]  ? __handle_mm_fault+0x2e2/0x710 
[  550.480888]  ? __count_memcg_events+0x58/0xf0 
[  550.485253]  ? handle_mm_fault+0x1f7/0x310 
[  550.489354]  ? do_user_addr_fault+0x347/0x640 
[  550.493721]  ? clear_bhb_loop+0x25/0x80 
[  550.497568]  ? clear_bhb_loop+0x25/0x80 
[  550.501407]  ? clear_bhb_loop+0x25/0x80 
[  550.505249]  entry_SYSCALL_64_after_hwframe+0x76/0x7e 
[  550.510309] RIP: 0033:0x7f67dd2fda57 
[  550.513904] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 
[  550.532651] RSP: 002b:00007fff8731ad38 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 
[  550.540219] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f67dd2fda57 
[  550.547350] RDX: 0000000000000002 RSI: 0000562a05b46440 RDI: 0000000000000001 
[  550.554482] RBP: 0000562a05b46440 R08: 0000000000000003 R09: 0000000000000077 
[  550.561616] R10: 0000000000000063 R11: 0000000000000246 R12: 0000000000000002 
[  550.568747] R13: 00007f67dd3fb780 R14: 0000000000000002 R15: 00007f67dd3f69e0 
[  550.575883]  </TASK> 
[  550.578075] Modules linked in: rfkill sunrpc ipmi_ssif intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm mlx5_ib x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ib_uverbs iTCO_wdt iTCO_vendor_support rapl mei_me macsec intel_cstate dcdbas i2c_i801 dell_smbios ib_core mgag200 intel_uncore i2c_smbus pcspkr dell_wmi_descriptor mei lpc_ich wmi_bmof intel_pch_thermal acpi_power_meter ipmi_si acpi_ipmi ipmi_devintf ipmi_msghandler pktgen fuse xfs sd_mod t10_pi sg mlx5_core ahci crct10dif_pclmul mlxfw crc32_pclmul libahci crc32c_intel igb megaraid_sas tls i40e ghash_clmulni_intel psample libata dimlib pci_hyperv_intf i2c_algo_bit libie dca wmi dm_mirror dm_region_hash dm_log dm_mod 
[    0.000000] Linux version 6.10.0-0.rc0.20240521git8f6a15f095a6.10.eln136.x86_64 (mockbuild@a1e858daf99142ca9b8337bb37125af2) (gcc (GCC) 14.1.1 20240507 (Red Hat 14.1.1-1), GNU ld version 2.42.50.20240513) #1 SMP PREEMPT_DYNAMIC Tue May 21 13:20:30 UTC 2024 
[    0.000000] Command line: elfcorehdr=0x58000000 BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.10.0-0.rc0.20240521git8f6a15f095a6.10.eln136.x86_64 ro isolcpus=0,2,4,6,8,10,12,14 resume=/dev/mapper/rhel_wsfd--advnetlab65-swap console=ttyS1,115200n81 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 acpi_no_memhotplug transparent_hugepage=never nokaslr hest_disable novmcoredd cma=0 hugetlb_cma=0 disable_cpu_apicid=0 iTCO_wdt.pretimeout=0 
[    0.000000] BIOS-provided physical RAM map: 
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved 
[    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009bfff] usable 
[    0.000000] BIOS-e820: [mem 0x000000000009c000-0x000000000009ffff] reserved 
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved 
[    0.000000] BIOS-e820: [mem 0x000000004fbb0000-0x000000004ff34fff] reserved 
[    0.000000] BIOS-e820: [mem 0x00000000580e00b0-0x0000000067ffffff] usable 
[    0.000000] BIOS-e820: [mem 0x0000000068bff000-0x000000006ebfefff] reserved 
[    0.000000] BIOS-e820: [mem 0x000000006ebff000-0x000000006f9fefff] ACPI NVS 
[    0.000000] BIOS-e820: [mem 0x000000006f9ff000-0x000000006fffefff] ACPI data 
[    0.000000] BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved 
[    0.000000] BIOS-e820: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved 
[    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fed00fff] reserved 
[    0.000000] BIOS-e820: [mem 0x00000000fed40000-0x00000000fed44fff] reserved 
[    0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved 



2. What is the Version-Release number of the kernel:

kernel-6.10.0-0.rc0.20240517gitea5f6ad9ad96.6.eln136
https://koji.fedoraproject.org/koji/taskinfo?taskID=117807110


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Kernel doesn't panic on kernel-6.9.0-64.eln136, everything works as expected.


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
Reproducible always.
Disable, enable, disable enable turboboost:

[root@wsfd-advnetlab61 ~]# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
[root@wsfd-advnetlab61 ~]# cat /sys/devices/system/cpu/intel_pstate/no_turbo
1
[root@wsfd-advnetlab61 ~]# echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo
[root@wsfd-advnetlab61 ~]# cat /sys/devices/system/cpu/intel_pstate/no_turbo
0
[root@wsfd-advnetlab61 ~]# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
-- crash


Toggle, reboot, toggle works fine:
[root@wsfd-advnetlab62 ~]# cat /sys/devices/system/cpu/intel_pstate/no_turbo
0
[root@wsfd-advnetlab62 ~]# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
[root@wsfd-advnetlab62 ~]# cat /sys/devices/system/cpu/intel_pstate/no_turbo
1
[root@wsfd-advnetlab62 ~]# reboot
[root@wsfd-advnetlab62 ~]# Connection to xxx closed by remote host.
Connection to xxx closed.
$ ssh root@xxx
[root@wsfd-advnetlab62 ~]# cat /sys/devices/system/cpu/intel_pstate/no_turbo
0
[root@wsfd-advnetlab62 ~]# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
[root@wsfd-advnetlab62 ~]# cat /sys/devices/system/cpu/intel_pstate/no_turbo
1
[root@wsfd-advnetlab62 ~]#



5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:


6. Are you running any modules that not shipped with directly Fedora's kernel?:
No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Reproducible: Always

Comment 3 Samuel Dobroň 2024-05-29 12:17:33 UTC
Thanks Jan, I missed this.
The same traceback is present in my reproducing jobs as well, i just manually double-checked it and your guess is right;

After disabling turboboost for the first time, there is a following traceback (same as Jan's):
$ echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
[  754.950459] unchecked MSR access error: RDMSR from 0x771 at rIP: 0xffffffffa6ae6bf5 )
[  754.960452] Call Trace:
[  754.962905]  <TASK>
[  754.965012]  ? show_trace_log_lvl+0x1b0/0x2f0
[  754.969370]  ? show_trace_log_lvl+0x1b0/0x2f0
[  754.973730]  ? __flush_smp_call_function_queue+0x95/0x400
[  754.979130]  ? ex_handler_msr.isra.0.cold+0x28/0x60
[  754.984008]  ? fixup_exception+0x157/0x380
[  754.988109]  ? gp_try_fixup_and_notify+0x1e/0xb0
[  754.992727]  ? exc_general_protection+0xff/0x410
[  754.997347]  ? asm_exc_general_protection+0x26/0x30
[  755.002226]  ? __pfx___rdmsr_on_cpu+0x10/0x10
[  755.006587]  ? __rdmsr_on_cpu+0x25/0x60
[  755.010424]  __flush_smp_call_function_queue+0x95/0x400
[  755.015652]  flush_smp_call_function_queue+0x2b/0x60
[  755.020617]  do_idle+0x9c/0xd0
[  755.023678]  cpu_startup_entry+0x29/0x30
[  755.027604]  rest_init+0xcc/0xd0
[  755.030837]  start_kernel+0x41f/0x420
[  755.034503]  x86_64_start_reservations+0x24/0x30
[  755.039121]  x86_64_start_kernel+0x97/0xa0
[  755.043221]  common_startup_64+0x13e/0x141
[  755.047319]  </TASK>


But kernel doesn't panic, enabling TB doesn't throw anything:
$ echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo

And disabling it for 2nd time crashes the kernel with the traceback from issue description.

Comment 4 Samuel Dobroň 2024-06-21 09:54:38 UTC
Seems to be fixed since kernel-6.10.0-0.rc2.20240606git2df0193e62cf.27.eln137 (https://koji.fedoraproject.org/koji/taskinfo?taskID=118651345)