1. Please describe the problem: Full system lockup with no response to numlock changes and reboot is the only thing that resolves it. I'm able to capture some logs over netconsole. I am not sure but the issue might have started when updating to Nvidia driver from 580.82 to 580.95. I vaguely remember being on 6.16.8 2. What is the Version-Release number of the kernel: 6.16.8-200.fc42 6.16.10-200.fc42 6.16.11-200.fc42 6.17.3-300.fc42 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : It works currently on 6.16.7 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: This seems to happen after starting a modded Minecraft server once it's fully loaded, but other intensive tasks like playing the game Hades II might also cause it. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: N/A (Tested with 6.17.3) 6. Are you running any modules that not shipped with directly Fedora's kernel?: Nvidia proprietary driver. 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. Will be attached. This is from a kernel run with nowatchdog set ``` Oct 15 18:52:15 pink-unicorn kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: Oct 15 18:52:15 pink-unicorn kernel: rcu: #0118-...0: (1 GPs behind) idle=7d94/1/0x4000000000000000 softirq=21427/21428 fqs=11815 Oct 15 18:52:15 pink-unicorn kernel: rcu: #011(detected by 14, t=60004 jiffies, g=44341, q=6922 ncpus=20) Oct 15 18:52:15 pink-unicorn kernel: Sending NMI from CPU 14 to CPUs 8: Oct 15 18:52:15 pink-unicorn kernel: NMI backtrace for cpu 8 Oct 15 18:52:15 pink-unicorn kernel: CPU: 8 UID: 1000 PID: 3957 Comm: Server thread Tainted: G U O 6.16.11-200.fc42.x86_64 #1 PREEMPT(lazy) Oct 15 18:52:15 pink-unicorn kernel: Tainted: [U]=USER, [O]=OOT_MODULE Oct 15 18:52:15 pink-unicorn kernel: Hardware name: Gigabyte Technology Co., Ltd. Z690 AORUS ELITE AX DDR4/Z690 AORUS ELITE AX DDR4, BIOS F30 09/27/2024 Oct 15 18:52:15 pink-unicorn kernel: RIP: 0010:hrtimer_active+0x15/0x50 Oct 15 18:52:15 pink-unicorn kernel: Code: 83 c0 01 eb 91 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 47 30 8b 50 10 f6 c2 01 75 23 <80> 7f 38 00 75 29 48 39 78 18 74 23 8b 48 10 39 ca 75 e1 48 8b 57 Oct 15 18:52:15 pink-unicorn kernel: RSP: 0000:ffffd2848a66fc20 EFLAGS: 00000046 Oct 15 18:52:15 pink-unicorn kernel: RAX: ffff897fbf621440 RBX: ffff8961baac1668 RCX: ffff897fbf621400 Oct 15 18:52:15 pink-unicorn kernel: RDX: 000000000003f342 RSI: 0000000000000087 RDI: ffff8961baac1668 Oct 15 18:52:15 pink-unicorn kernel: RBP: ffffd2848a66ff58 R08: 0000000000000087 R09: ffff89802425d000 Oct 15 18:52:15 pink-unicorn kernel: R10: 0000000000000008 R11: 0000000000000000 R12: ffffd2848a66fd00 Oct 15 18:52:15 pink-unicorn kernel: R13: 0000000000000001 R14: ffff8961baac1500 R15: ffff897fbf621440 Oct 15 18:52:15 pink-unicorn kernel: FS: 00007f303ccff6c0(0000) GS:ffff89802425d000(0000) knlGS:0000000000000000 Oct 15 18:52:15 pink-unicorn kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 15 18:52:15 pink-unicorn kernel: CR2: 00007f2e90b52000 CR3: 00000001c7afd004 CR4: 0000000000f72ef0 Oct 15 18:52:15 pink-unicorn kernel: PKRU: 55555554 Oct 15 18:52:15 pink-unicorn kernel: Call Trace: Oct 15 18:52:15 pink-unicorn kernel: <TASK> Oct 15 18:52:15 pink-unicorn kernel: hrtimer_cancel+0x30/0x40 Oct 15 18:52:15 pink-unicorn kernel: cpu_clock_event_stop+0x5c/0xa0 Oct 15 18:52:15 pink-unicorn kernel: __perf_event_overflow+0x1eb/0x3b0 Oct 15 18:52:15 pink-unicorn kernel: ? __alloc_frozen_pages_noprof+0x18b/0x350 Oct 15 18:52:15 pink-unicorn kernel: perf_swevent_hrtimer+0xe3/0x150 Oct 15 18:52:15 pink-unicorn kernel: ? mod_memcg_lruvec_state+0x1bf/0x2f0 Oct 15 18:52:15 pink-unicorn kernel: ? timerqueue_del+0x2e/0x60 Oct 15 18:52:15 pink-unicorn kernel: ? __pfx_perf_swevent_hrtimer+0x10/0x10 Oct 15 18:52:15 pink-unicorn kernel: __hrtimer_run_queues+0x110/0x2a0 Oct 15 18:52:15 pink-unicorn kernel: hrtimer_interrupt+0xfc/0x230 Oct 15 18:52:15 pink-unicorn kernel: __sysvec_apic_timer_interrupt+0x55/0x100 Oct 15 18:52:15 pink-unicorn kernel: sysvec_apic_timer_interrupt+0x38/0x90 Oct 15 18:52:15 pink-unicorn kernel: asm_sysvec_apic_timer_interrupt+0x1a/0x20 Oct 15 18:52:15 pink-unicorn kernel: RIP: 0033:0x7f30dc88f369 Oct 15 18:52:15 pink-unicorn kernel: Code: 8b 5d f8 c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 55 48 89 e5 41 55 41 54 53 48 83 ec 08 44 0f b6 af 49 01 00 00 <44> 89 ea 83 e2 01 75 45 48 8d 05 e8 e9 84 00 44 0f b6 20 44 89 e8 Oct 15 18:52:15 pink-unicorn kernel: RSP: 002b:00007f303ccfccc0 EFLAGS: 00000206 Oct 15 18:52:15 pink-unicorn kernel: RAX: 00007f30dd0d1a08 RBX: 00007f30d4041470 RCX: 0000000000000007 Oct 15 18:52:15 pink-unicorn kernel: RDX: 00007f30c5336a10 RSI: 00007f30c5336a10 RDI: 00007f30c5336a10 Oct 15 18:52:15 pink-unicorn kernel: RBP: 00007f303ccfcce0 R08: 00007f30c43a0000 R09: 00007f30dba40000 Oct 15 18:52:15 pink-unicorn kernel: R10: 00007f30bd09cb40 R11: 0000000000000004 R12: 0000000000000000 Oct 15 18:52:15 pink-unicorn kernel: R13: 0000000000000006 R14: 0000000000000000 R15: 000055b4fa9df2a0 Oct 15 18:52:15 pink-unicorn kernel: </TASK> Oct 15 18:52:15 pink-unicorn kernel: </TASK> ``` Hardware info: Motherboard: Z690 AORUS ELITE AX DDR4 CPU: 12th Gen Intel(R) Core(TM) i7-12700K GPU: NVIDIA GeForce RTX 5070 Ti Reproducible: Always
Created attachment 2109975 [details] Kernel log
The crash was also observed with the nouveau drivers as well.
I have just tested with a clean KDE 42 install with no Nvidia drivers and only having java to trigger it. I wasn't able to grab the logs but it seems to be the same behavior.
Observed same all-core lockup on an AMD system with integrated graphics card. OpenJDK 21 from official Fedora repo. Running Minecraft server. From journalctl it appears that a core trapped into hard lockup first and other cores soon trapped into soft lockup after the first core. Everything looks fine without starting Minecraft server, but the lockup is triggered every time after the server being started and running for 1-2 minutes. On 6.15.8 no lockup observed, but on 6.16.8 and 6.16.12 the problem is consistent. Not tested on other kernels. No additional module is inserted manually, but according to one of my friends who tried to find out why there might be some BPF module registered by the JDK(or audit) to monitor performance of the server. System info: CPU: AMD Ryzen 7 7700X GPU: Integrated Motherboard: TUF B650-PLUS WiFi
(In reply to Yan Pan from comment #4) > Observed same all-core lockup on an AMD system with integrated graphics > card. OpenJDK 21 from official Fedora repo. Running Minecraft server. > From journalctl it appears that a core trapped into hard lockup first and > other cores soon trapped into soft lockup after the first core. Everything > looks fine without starting Minecraft server, but the lockup is triggered > every time after the server being started and running for 1-2 minutes. > On 6.15.8 no lockup observed, but on 6.16.8 and 6.16.12 the problem is > consistent. Not tested on other kernels. No additional module is inserted > manually, but according to one of my friends who tried to find out why there > might be some BPF module registered by the JDK(or audit) to monitor > performance of the server. This seems to match up exactly with what I'm experiencing. I'll try testing in a vm tomorrow.
So it's possible that this is related to the spark mod. I was able to reproduce it in a VM as well. https://github.com/lucko/spark/issues/530 and has been reported to upstream. https://lore.kernel.org/stable/CAHPNGSQpXEopYreir+uDDEbtXTBvBvi8c6fYXJvceqtgTPao3Q@mail.gmail.com/T/#u