Created attachment 1269473 [details] system information Description of problem: I updated to F26 and the update went well, but the system wouldn't boot after. Unfortunately, the journal etc. have no information on the failed boots at all. Since this was an upgrade, I used the F25 kernel and that boot just fine. Version-Release number of selected component (if applicable): kernel-4.11.0-0.rc4.git0.1.fc26.x86_64 kernel-4.11.0-0.rc5.git0.1.fc26.x86_64 How reproducible: Always Steps to Reproduce: 1. Update F25 to F26 using dnf - check that update goes correctly 2. Reboot after upgrade 3. Actual results: System does not boot - monitor doesnt show any output, and keyboard/mouse seem unresponsive too. Expected results: Should boot normally as expected. Additional info: I also used the F26 workstation live image hoping to run a fresh install to replicate the issue, but the live image doesn't run well either. I get CPU lockups. Images attached - I cant quite say if the live image and the installed+upgraded system are experiencing the same issue, so this may not be helpful. fpaste sysinfo from the functioning kernel also attached.
Created attachment 1269474 [details] image showing cpu lockup on F26 LIVE media
Created attachment 1269475 [details] second image showing cpu lockup on F26 LIVE media
I realise there isn't much debug information here - please do let me know if there are any steps you'd like me to take to supply other bits that may be required.
Created attachment 1272508 [details] Screen photo showing output The output says something about nouveau - could be an issue there? Would a dev please take a look and reassign etc?
Reporting in to state that i have same and/or similar issues related to the nouveau driver. Currenly using a Dell XPS 15 9550.
Can someone please confirm what component this bug is in? On bodhi, it was said that this is a kernel bug, but the kernel maint team has recently assigned it to nouveau? Now I don't know who to look for, and what package to give karma to in bodhi either. https://bodhi.fedoraproject.org/updates/xorg-x11-drv-nouveau-1.0.14-2.fc26#comment-597074
(In reply to Ankur Sinha (FranciscoD) from comment #6) > Can someone please confirm what component this bug is in? On bodhi, it was > said that this is a kernel bug, but the kernel maint team has recently > assigned it to nouveau? Now I don't know who to look for, and what package > to give karma to in bodhi either. > > https://bodhi.fedoraproject.org/updates/xorg-x11-drv-nouveau-1.0.14-2. > fc26#comment-597074 It gets assigned to this package because we have no other way to say "this bug is in this specific graphics driver in the kernel". I don't currently have any good ideas about the bug you're seeing here however. If you could manage to boot with "nomodeset 3", and login via ssh and run "modprobe -r nouveau; modprobe nouveau modeset=1", you might be able to get more complete logs which could help.
I'll do that and get back to you. Thank you for confirming. As a last check, I installed the proprietary nvidia driver off rpmfusion, and my system boots with the f26 kernels: [asinha@ankur ~]$ uname -a Linux ankur.pc 4.11.0-0.rc8.git0.1.fc26.x86_64 #1 SMP Mon Apr 24 15:42:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux I'll go uninstall these drivers and try to get you more info. Cheers!
I booted the system with "nomodeset 3" and that got me to a login prompt. Then I used a different machine to ssh in and ran: sudo modprobe -r nouveau sudo modprobe nouveau modeset=1 The ssh login hung - I couldn't do anything there. On the main machine, I got some output before it hung. Images attached.
Created attachment 1275407 [details] image 1 showing output after "modprobe nouveau modeset=1"
Created attachment 1275408 [details] image 2 showing output after "modprobe nouveau modeset=1"
Created attachment 1275409 [details] image 3 showing output after "modprobe nouveau modeset=1" Then I had to force the system off. If there's any more debug info I can collect, please let me know. This is the hardware (info from working f25 kernel): [asinha@ankur ~]$ lspci -k | grep -A 2 -E "(VGA|3D)" 00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 18) Subsystem: Dell Device 044d Kernel driver in use: i915 -- 01:00.0 VGA compatible controller: NVIDIA Corporation GT218M [GeForce 310M] (rev ff) Kernel driver in use: nouveau Kernel modules: nouveau [asinha@ankur ~]$ In the list of codenames[1], it's "NVA8 (GT218)" [1] https://nouveau.freedesktop.org/wiki/CodeNames/
Hmm, I think there's been a few other mentions of issues on GT21x hardware recently. I wonder if this is related somehow. Unfortunately, no changes come to mind as obvious culprits. Ok, I don't suppose journalctl output has any more information on the lockup, full backtrace etc? netconsole[1] is another option for getting a better log in these situations. Thanks, Ben. [1] https://www.kernel.org/doc/Documentation/networking/netconsole.txt
Also, what kernel were you using before the update to F26?
(In reply to Ben Skeggs from comment #13) > Hmm, I think there's been a few other mentions of issues on GT21x hardware > recently. I wonder if this is related somehow. Unfortunately, no changes > come to mind as obvious culprits. > > Ok, I don't suppose journalctl output has any more information on the > lockup, full backtrace etc? No - it seems to have cut off before the cpu lock up began. > > netconsole[1] is another option for getting a better log in these situations. > > Thanks, > Ben. > > [1] https://www.kernel.org/doc/Documentation/networking/netconsole.txt I'll go look into netconsole and try to get more info. (I haven't used it before.) (In reply to Ben Skeggs from comment #14) > Also, what kernel were you using before the update to F26? This is the one I was on, and have kept around to use my system: [asinha@ankur ~]$ uname -a Linux ankur.pc 4.10.6-200.fc25.x86_64 #1 SMP Mon Mar 27 14:06:23 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Created attachment 1275727 [details] netconsole output I got netconsole working and then did the ssh bit again: nomodeset 3 modprobe -r nouveau modprobe nouveau nomodeset=1 The log is attached. I didn't know when to stop so I let it run for a bit and stopped it when I felt it was repeating itself. I hope that helps some. Please let me know if there's any more debug info I can collect.
(In reply to Ankur Sinha (FranciscoD) from comment #16) > Created attachment 1275727 [details] > netconsole output > > I got netconsole working and then did the ssh bit again: > > nomodeset 3 > modprobe -r nouveau > modprobe nouveau nomodeset=1 > > The log is attached. I didn't know when to stop so I let it run for a bit > and stopped it when I felt it was repeating itself. > > I hope that helps some. Please let me know if there's any more debug info I > can collect. Ah, thank you! I have an idea of what's going on here now. I should be able to reproduce a similar situation and debug the problem.
I've posted a scratch kernel build[1] that has a test implementation of a fix for your issue. It also contains some extra debug output, as there are a number of confusing aspects to the bug you're seeing (most of which is: 4.10 should be effected too, but it's clearly not somehow). Regardless of whether or not the test kernel works for you, can you please attach the kernel log from the test kernel. I would also be very interested to see a kernel log from the working 4.10 kernel too. Thanks, Ben. [1] https://koji.fedoraproject.org/koji/taskinfo?taskID=19392593
Created attachment 1276885 [details] netconsole output for 4.10.6-200.fc25 (which works) I've used the same method - booted with "nomodeset 3" and then "rmmod nouveau; modprobe nouveau modeset=1" via ssh.
Created attachment 1276886 [details] netconsole output for 4.11.0-1 test build Basically hung after "modprobe nouveau modeset=1" and rebooted, so I tried it again and it did the same thing. (The log will show two boots as a result.)
Hi, Just dropping a note. It's still happening with kernel-core-4.11.4-300.fc26.x86_64 Cheers, Ankur
(In reply to Ankur Sinha (FranciscoD) from comment #21) > Hi, > > Just dropping a note. > It's still happening with kernel-core-4.11.4-300.fc26.x86_64 > > Cheers, > Ankur Well, the particular softlockup that was first reported is gone with the test build at least.. Now I'm partially clueless as to what else could possibly be going on. The most interesting thing that stands out is that in the failing cases, we're only detecting 16MiB of VRAM instead of 512... I've seen some reports that other similar looking issues magically disappeared in 4.12, could you quickly test one of the 4.12/4.13 builds from f27 and see if they help you?
I tried out kernel-core-4.13.0-0.rc1.git0.1.fc27.x86_64 from the kernel-nodebug repo. I got as far as gdm here, but on attempting to log in, it would hang. Here's what journalctl caught: Jul 20 11:37:04 ankur.pc kernel: BUG: unable to handle kernel paging request at ffffffffc03c6f47 Jul 20 11:37:04 ankur.pc kernel: IP: report_bug+0x94/0x120 Jul 20 11:37:04 ankur.pc kernel: PGD 1e8e0c067 Jul 20 11:37:04 ankur.pc kernel: P4D 1e8e0c067 Jul 20 11:37:04 ankur.pc kernel: PUD 1e8e0e067 Jul 20 11:37:04 ankur.pc kernel: PMD 23d944067 Jul 20 11:37:04 ankur.pc kernel: PTE 800000023d087161 Jul 20 11:37:04 ankur.pc kernel: Jul 20 11:37:04 ankur.pc kernel: Oops: 0003 [#1] SMP Jul 20 11:37:04 ankur.pc kernel: Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilt Jul 20 11:37:04 ankur.pc kernel: mac80211 wmi_bmof intel_ips i2c_i801 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_hda_codec_idt videobuf2_core snd_hda Jul 20 11:37:04 ankur.pc kernel: CPU: 3 PID: 43 Comm: kworker/3:1 Not tainted 4.13.0-0.rc1.git0.1.fc27.x86_64 #1 Jul 20 11:37:04 ankur.pc kernel: Hardware name: Dell Inc. Vostro 3400/08YN7X, BIOS A10 10/25/2010 Jul 20 11:37:04 ankur.pc kernel: Workqueue: pm pm_runtime_work Jul 20 11:37:04 ankur.pc kernel: task: ffffa0c581dc0000 task.stack: ffffc0e740ddc000 Jul 20 11:37:04 ankur.pc kernel: RIP: 0010:report_bug+0x94/0x120 Jul 20 11:37:04 ankur.pc kernel: RSP: 0018:ffffc0e740ddf870 EFLAGS: 00010002 Jul 20 11:37:04 ankur.pc kernel: RAX: 0000000000000907 RBX: ffffc0e740ddf9d8 RCX: ffffffffc03c6f3d Jul 20 11:37:04 ankur.pc kernel: RDX: 0000000000000001 RSI: 0000000000000260 RDI: 0000000000000001 Jul 20 11:37:04 ankur.pc kernel: RBP: ffffc0e740ddf890 R08: ffffc0e740de0000 R09: 00000000000002bc Jul 20 11:37:04 ankur.pc kernel: R10: ffffffff99e06a80 R11: ffffffffc071f650 R12: ffffffffc03a9be9 Jul 20 11:37:04 ankur.pc kernel: R13: ffffffffc03c0a7c R14: 0000000000000004 R15: ffffc0e740ddf9d8 Jul 20 11:37:04 ankur.pc kernel: FS: 0000000000000000(0000) GS:ffffa0c58bd80000(0000) knlGS:0000000000000000 Jul 20 11:37:04 ankur.pc kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 20 11:37:04 ankur.pc kernel: CR2: ffffffffc03c6f47 CR3: 00000002404d9000 CR4: 00000000000006e0 Jul 20 11:37:04 ankur.pc kernel: Call Trace: Jul 20 11:37:04 ankur.pc kernel: ? drm_calc_vbltimestamp_from_scanoutpos+0x299/0x330 [drm] Jul 20 11:37:04 ankur.pc kernel: fixup_bug+0x2e/0x50 Jul 20 11:37:04 ankur.pc kernel: do_trap+0x119/0x150 Jul 20 11:37:04 ankur.pc kernel: do_error_trap+0x89/0x110 Jul 20 11:37:04 ankur.pc kernel: ? drm_calc_vbltimestamp_from_scanoutpos+0x299/0x330 [drm] Jul 20 11:37:04 ankur.pc kernel: ? vsnprintf+0xea/0x4d0 Jul 20 11:37:04 ankur.pc kernel: do_invalid_op+0x20/0x30 Jul 20 11:37:04 ankur.pc kernel: invalid_op+0x1e/0x30 Jul 20 11:37:04 ankur.pc kernel: RIP: 0010:drm_calc_vbltimestamp_from_scanoutpos+0x299/0x330 [drm] Jul 20 11:37:04 ankur.pc kernel: RSP: 0018:ffffc0e740ddfa80 EFLAGS: 00010086 Jul 20 11:37:04 ankur.pc kernel: RAX: ffffffffc0792b60 RBX: ffffa0c57d202000 RCX: 0000000000000000 Jul 20 11:37:04 ankur.pc kernel: RDX: ffffffffc03c6078 RSI: 0000000000000001 RDI: ffffffffc03c0a9c Jul 20 11:37:04 ankur.pc kernel: RBP: ffffc0e740ddfae8 R08: 0000000000000000 R09: ffffffffc03a9950 Jul 20 11:37:04 ankur.pc kernel: R10: ffffa0c57d9de888 R11: ffffffffc071f650 R12: 0000000000000000 Jul 20 11:37:04 ankur.pc kernel: R13: ffffa0c57d9de800 R14: ffffc0e740ddfafc R15: ffffc0e740ddfb40 Jul 20 11:37:04 ankur.pc kernel: ? nouveau_display_vblank_disable+0x30/0x30 [nouveau] Jul 20 11:37:04 ankur.pc kernel: ? drm_get_last_vbltimestamp+0x90/0x90 [drm] Jul 20 11:37:04 ankur.pc kernel: ? vprintk_emit+0x328/0x390 Jul 20 11:37:04 ankur.pc kernel: drm_get_last_vbltimestamp+0x56/0x90 [drm] Jul 20 11:37:04 ankur.pc kernel: drm_update_vblank_count+0x76/0x270 [drm] Jul 20 11:37:04 ankur.pc kernel: drm_vblank_disable_and_save+0x5d/0xd0 [drm] Jul 20 11:37:04 ankur.pc kernel: drm_crtc_vblank_off+0xb7/0x210 [drm] Jul 20 11:37:04 ankur.pc kernel: ? insert_work+0x52/0xc0 Jul 20 11:37:04 ankur.pc kernel: nouveau_display_fini+0x5d/0xd0 [nouveau] Jul 20 11:37:04 ankur.pc kernel: ? vga_switcheroo_runtime_resume+0x50/0x50 Jul 20 11:37:04 ankur.pc kernel: nouveau_display_suspend+0x57/0x120 [nouveau] Jul 20 11:37:04 ankur.pc kernel: nouveau_do_suspend+0x7d/0x1d0 [nouveau] Jul 20 11:37:04 ankur.pc kernel: nouveau_pmops_runtime_suspend+0x59/0xc0 [nouveau] Jul 20 11:37:04 ankur.pc kernel: pci_pm_runtime_suspend+0x5f/0x170 Jul 20 11:37:04 ankur.pc kernel: ? vga_switcheroo_runtime_resume+0x50/0x50 Jul 20 11:37:04 ankur.pc kernel: vga_switcheroo_runtime_suspend+0x23/0xa0 Jul 20 11:37:04 ankur.pc kernel: __rpm_callback+0xc2/0x200 Jul 20 11:37:04 ankur.pc kernel: ? vga_switcheroo_runtime_resume+0x50/0x50 Jul 20 11:37:04 ankur.pc kernel: rpm_callback+0x24/0x80 Jul 20 11:37:04 ankur.pc kernel: ? vga_switcheroo_runtime_resume+0x50/0x50 Jul 20 11:37:04 ankur.pc kernel: rpm_suspend+0x138/0x630 Jul 20 11:37:04 ankur.pc kernel: pm_runtime_work+0x68/0x90 Jul 20 11:37:04 ankur.pc kernel: process_one_work+0x193/0x3c0 Jul 20 11:37:04 ankur.pc kernel: worker_thread+0x4a/0x3a0 Jul 20 11:37:04 ankur.pc kernel: kthread+0x125/0x140 Jul 20 11:37:04 ankur.pc kernel: ? process_one_work+0x3c0/0x3c0 Jul 20 11:37:04 ankur.pc kernel: ? kthread_park+0x60/0x60 Jul 20 11:37:04 ankur.pc kernel: ret_from_fork+0x25/0x30 Jul 20 11:37:04 ankur.pc kernel: Code: 74 59 0f b7 41 0a 4c 63 69 04 0f b7 71 08 89 c7 49 01 cd 83 e7 01 a8 02 74 15 66 85 ff 74 10 a8 04 ba 01 00 00 00 75 26 83 c8 04 <6 Jul 20 11:37:04 ankur.pc kernel: RIP: report_bug+0x94/0x120 RSP: ffffc0e740ddf870 Jul 20 11:37:04 ankur.pc kernel: CR2: ffffffffc03c6f47 Jul 20 11:37:04 ankur.pc kernel: ---[ end trace 2b46307d28802a2b ]---
Created attachment 1301625 [details] Full journalctl log for 4.13.0-0.rc1.git0.1
Created attachment 1301626 [details] Full journalctl log for 4.13.0-0.rc1.git0.1 with selinux in permissive mode Saw a few selinux messages in there so switched it to permissive to confirm that the issue still persisted - it did. Full log attached.
Ah, that's a separate issue that I'm currently working on fixing. Some kernel-related stuff changed from underneath us. It actually looks like your original issue has indeed gone away though, which is encouraging, but confusing, as I can't explain why. In your case, you should actually be able to boot now with "nouveau.runpm=0" to disable powering-down the NVIDIA GPU when its not in use, however, you won't be able to use suspend/resume until this is fixed.
I've attempted to guess what I need to backport from 4.12 for you, but as I can't reproduce, I can't confirm. Can you test this scratch build for me please? https://koji.fedoraproject.org/koji/taskinfo?taskID=20708110 Thanks, Ben.
This scratch build with 4.11.11-300.fc26.x86_64 seems to work!! :D
Great! It's good to have confirmation of the cause, and solution. Finally. I originally planned on backporting the fixes to 4.11, however, I've been informed that 4.11 is end-of-life. It looks like 4.12 (which already have the fix) builds are available in koji for F26 already though, so should be in updates-testing/updates at some point soon.
Hi Ben, I think this is fixed - I haven't had the issue in a while on F27 now. Closing. Please reopen if needed. Thanks again for fixing it.