Bug 1439890

Summary: unable to boot on freshly updated F26 system
Product: [Fedora] Fedora Reporter: Ankur Sinha (FranciscoD) <sanjay.ankur>
Component: xorg-x11-drv-nouveauAssignee: Ben Skeggs <bskeggs>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 26CC: airlied, ajax, bskeggs, gansalmon, ichavero, itamar, jonathan, kernel-maint, labbott, madhu.chinakonda, mchehab, roeffen.gijs, sanjay.ankur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-07 21:39:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
system information
none
image showing cpu lockup on F26 LIVE media
none
second image showing cpu lockup on F26 LIVE media
none
Screen photo showing output
none
image 1 showing output after "modprobe nouveau modeset=1"
none
image 2 showing output after "modprobe nouveau modeset=1"
none
image 3 showing output after "modprobe nouveau modeset=1"
none
netconsole output
none
netconsole output for 4.10.6-200.fc25 (which works)
none
netconsole output for 4.11.0-1 test build
none
Full journalctl log for 4.13.0-0.rc1.git0.1
none
Full journalctl log for 4.13.0-0.rc1.git0.1 with selinux in permissive mode none

Description Ankur Sinha (FranciscoD) 2017-04-06 19:03:11 UTC
Created attachment 1269473 [details]
system information

Description of problem:
I updated to F26 and the update went well, but the system wouldn't boot after. Unfortunately, the journal etc. have no information on the failed boots at all. Since this was an upgrade, I used the F25 kernel and that boot just fine. 


Version-Release number of selected component (if applicable):
kernel-4.11.0-0.rc4.git0.1.fc26.x86_64
kernel-4.11.0-0.rc5.git0.1.fc26.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Update F25 to F26 using dnf - check that update goes correctly
2. Reboot after upgrade
3.

Actual results:
System does not boot - monitor doesnt show any output, and keyboard/mouse seem unresponsive too.

Expected results:
Should boot normally as expected.

Additional info:
I also used the F26 workstation live image hoping to run a fresh install to replicate the issue, but the live image doesn't run well either. I get CPU lockups. Images attached - I cant quite say if the live image and the installed+upgraded system are experiencing the same issue, so this may not be helpful.


fpaste sysinfo from the functioning kernel also attached.

Comment 1 Ankur Sinha (FranciscoD) 2017-04-06 19:04:10 UTC
Created attachment 1269474 [details]
image showing cpu lockup on F26 LIVE media

Comment 2 Ankur Sinha (FranciscoD) 2017-04-06 19:05:02 UTC
Created attachment 1269475 [details]
second image showing cpu lockup on F26 LIVE media

Comment 3 Ankur Sinha (FranciscoD) 2017-04-06 19:07:23 UTC
I realise there isn't much debug information here - please do let me know if there are any steps you'd like me to take to supply other bits that may be required.

Comment 4 Ankur Sinha (FranciscoD) 2017-04-19 07:16:38 UTC
Created attachment 1272508 [details]
Screen photo showing output

The output says something about nouveau - could be an issue there? Would a dev please take a look and reassign etc?

Comment 5 Gijs Roeffen 2017-04-19 11:14:55 UTC
Reporting in to state that i have same and/or similar issues related to the nouveau driver. Currenly using a Dell XPS 15 9550.

Comment 6 Ankur Sinha (FranciscoD) 2017-05-01 11:54:18 UTC
Can someone please confirm what component this bug is in? On bodhi, it was said that this is a kernel bug, but the kernel maint team has recently assigned it to nouveau? Now I don't know who to look for, and what package to give karma to in bodhi either.

https://bodhi.fedoraproject.org/updates/xorg-x11-drv-nouveau-1.0.14-2.fc26#comment-597074

Comment 7 Ben Skeggs 2017-05-01 12:10:56 UTC
(In reply to Ankur Sinha (FranciscoD) from comment #6)
> Can someone please confirm what component this bug is in? On bodhi, it was
> said that this is a kernel bug, but the kernel maint team has recently
> assigned it to nouveau? Now I don't know who to look for, and what package
> to give karma to in bodhi either.
> 
> https://bodhi.fedoraproject.org/updates/xorg-x11-drv-nouveau-1.0.14-2.
> fc26#comment-597074

It gets assigned to this package because we have no other way to say "this bug is in this specific graphics driver in the kernel".

I don't currently have any good ideas about the bug you're seeing here however.  If you could manage to boot with "nomodeset 3", and login via ssh and run "modprobe -r nouveau; modprobe nouveau modeset=1", you might be able to get more complete logs which could help.

Comment 8 Ankur Sinha (FranciscoD) 2017-05-01 12:18:17 UTC
I'll do that and get back to you. Thank you for confirming. 

As a last check, I installed the proprietary nvidia driver off rpmfusion, and my system boots with the f26 kernels:

[asinha@ankur  ~]$ uname -a
Linux ankur.pc 4.11.0-0.rc8.git0.1.fc26.x86_64 #1 SMP Mon Apr 24 15:42:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

I'll go uninstall these drivers and try to get you more info.

Cheers!

Comment 9 Ankur Sinha (FranciscoD) 2017-05-01 13:26:48 UTC
I booted the system with "nomodeset 3" and that got me to a login prompt. Then I used a different machine to ssh in and ran:

sudo modprobe -r nouveau
sudo modprobe nouveau modeset=1

The ssh login hung - I couldn't do anything there. On the main machine, I got some output before it hung. Images attached.

Comment 10 Ankur Sinha (FranciscoD) 2017-05-01 13:27:42 UTC
Created attachment 1275407 [details]
image 1 showing output after "modprobe nouveau modeset=1"

Comment 11 Ankur Sinha (FranciscoD) 2017-05-01 13:28:13 UTC
Created attachment 1275408 [details]
image 2 showing output after "modprobe nouveau modeset=1"

Comment 12 Ankur Sinha (FranciscoD) 2017-05-01 13:31:13 UTC
Created attachment 1275409 [details]
image 3 showing output after "modprobe nouveau modeset=1"

Then I had to force the system off. If there's any more debug info I can collect, please let me know. 

This is the hardware (info from working f25 kernel):

[asinha@ankur  ~]$ lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 18)
        Subsystem: Dell Device 044d
        Kernel driver in use: i915
--
01:00.0 VGA compatible controller: NVIDIA Corporation GT218M [GeForce 310M] (rev ff)
        Kernel driver in use: nouveau
        Kernel modules: nouveau
[asinha@ankur  ~]$ 

In the list of codenames[1], it's "NVA8 (GT218)"

[1] https://nouveau.freedesktop.org/wiki/CodeNames/

Comment 13 Ben Skeggs 2017-05-01 14:20:45 UTC
Hmm, I think there's been a few other mentions of issues on GT21x hardware recently.  I wonder if this is related somehow.  Unfortunately, no changes come to mind as obvious culprits.

Ok, I don't suppose journalctl output has any more information on the lockup, full backtrace etc?

netconsole[1] is another option for getting a better log in these situations.

Thanks,
Ben.

[1] https://www.kernel.org/doc/Documentation/networking/netconsole.txt

Comment 14 Ben Skeggs 2017-05-01 14:21:20 UTC
Also, what kernel were you using before the update to F26?

Comment 15 Ankur Sinha (FranciscoD) 2017-05-01 20:22:13 UTC
(In reply to Ben Skeggs from comment #13)
> Hmm, I think there's been a few other mentions of issues on GT21x hardware
> recently.  I wonder if this is related somehow.  Unfortunately, no changes
> come to mind as obvious culprits.
> 
> Ok, I don't suppose journalctl output has any more information on the
> lockup, full backtrace etc?

No - it seems to have cut off before the cpu lock up began.

> 
> netconsole[1] is another option for getting a better log in these situations.
> 
> Thanks,
> Ben.
> 
> [1] https://www.kernel.org/doc/Documentation/networking/netconsole.txt

I'll go look into netconsole and try to get more info. (I haven't used it before.)

(In reply to Ben Skeggs from comment #14)
> Also, what kernel were you using before the update to F26?

This is the one I was on, and have kept around to use my system:
[asinha@ankur  ~]$ uname -a
Linux ankur.pc 4.10.6-200.fc25.x86_64 #1 SMP Mon Mar 27 14:06:23 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Comment 16 Ankur Sinha (FranciscoD) 2017-05-02 18:27:30 UTC
Created attachment 1275727 [details]
netconsole output

I got netconsole working and then did the ssh bit again:

nomodeset 3
modprobe -r nouveau
modprobe nouveau nomodeset=1

The log is attached. I didn't know when to stop so I let it run for a bit and stopped it when I felt it was repeating itself.

I hope that helps some. Please let me know if there's any more debug info I can collect.

Comment 17 Ben Skeggs 2017-05-02 20:43:59 UTC
(In reply to Ankur Sinha (FranciscoD) from comment #16)
> Created attachment 1275727 [details]
> netconsole output
> 
> I got netconsole working and then did the ssh bit again:
> 
> nomodeset 3
> modprobe -r nouveau
> modprobe nouveau nomodeset=1
> 
> The log is attached. I didn't know when to stop so I let it run for a bit
> and stopped it when I felt it was repeating itself.
> 
> I hope that helps some. Please let me know if there's any more debug info I
> can collect.

Ah, thank you!  I have an idea of what's going on here now.  I should be able to reproduce a similar situation and debug the problem.

Comment 18 Ben Skeggs 2017-05-04 08:56:34 UTC
I've posted a scratch kernel build[1] that has a test implementation of a fix for your issue.  It also contains some extra debug output, as there are a number of confusing aspects to the bug you're seeing (most of which is: 4.10 should be effected too, but it's clearly not somehow).

Regardless of whether or not the test kernel works for you, can you please attach the kernel log from the test kernel.  I would also be very interested to see a kernel log from the working 4.10 kernel too.

Thanks,
Ben.

[1] https://koji.fedoraproject.org/koji/taskinfo?taskID=19392593

Comment 19 Ankur Sinha (FranciscoD) 2017-05-07 18:30:19 UTC
Created attachment 1276885 [details]
netconsole output for 4.10.6-200.fc25 (which works)

I've used the same method - booted with "nomodeset 3" and then "rmmod nouveau; modprobe nouveau modeset=1" via ssh.

Comment 20 Ankur Sinha (FranciscoD) 2017-05-07 18:31:50 UTC
Created attachment 1276886 [details]
netconsole output for 4.11.0-1 test build

Basically hung after "modprobe nouveau modeset=1" and rebooted, so I tried it again and it did the same thing. (The log will show two boots as a result.)

Comment 21 Ankur Sinha (FranciscoD) 2017-06-11 21:34:34 UTC
Hi,

Just dropping a note. 
It's still happening with kernel-core-4.11.4-300.fc26.x86_64

Cheers,
Ankur

Comment 22 Ben Skeggs 2017-07-20 02:12:10 UTC
(In reply to Ankur Sinha (FranciscoD) from comment #21)
> Hi,
> 
> Just dropping a note. 
> It's still happening with kernel-core-4.11.4-300.fc26.x86_64
> 
> Cheers,
> Ankur

Well, the particular softlockup that was first reported is gone with the test build at least.. Now I'm partially clueless as to what else could possibly be going on.  The most interesting thing that stands out is that in the failing cases, we're only detecting 16MiB of VRAM instead of 512...

I've seen some reports that other similar looking issues magically disappeared in 4.12, could you quickly test one of the 4.12/4.13 builds from f27 and see if they help you?

Comment 23 Ankur Sinha (FranciscoD) 2017-07-20 09:47:20 UTC
I tried out kernel-core-4.13.0-0.rc1.git0.1.fc27.x86_64 from the kernel-nodebug repo. I got as far as gdm here, but on attempting to log in, it would hang. Here's what journalctl caught:

Jul 20 11:37:04 ankur.pc kernel: BUG: unable to handle kernel paging request at ffffffffc03c6f47
Jul 20 11:37:04 ankur.pc kernel: IP: report_bug+0x94/0x120
Jul 20 11:37:04 ankur.pc kernel: PGD 1e8e0c067
Jul 20 11:37:04 ankur.pc kernel: P4D 1e8e0c067
Jul 20 11:37:04 ankur.pc kernel: PUD 1e8e0e067
Jul 20 11:37:04 ankur.pc kernel: PMD 23d944067
Jul 20 11:37:04 ankur.pc kernel: PTE 800000023d087161
Jul 20 11:37:04 ankur.pc kernel:
Jul 20 11:37:04 ankur.pc kernel: Oops: 0003 [#1] SMP
Jul 20 11:37:04 ankur.pc kernel: Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilt
Jul 20 11:37:04 ankur.pc kernel:  mac80211 wmi_bmof intel_ips i2c_i801 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 snd_hda_codec_idt videobuf2_core snd_hda
Jul 20 11:37:04 ankur.pc kernel: CPU: 3 PID: 43 Comm: kworker/3:1 Not tainted 4.13.0-0.rc1.git0.1.fc27.x86_64 #1
Jul 20 11:37:04 ankur.pc kernel: Hardware name: Dell Inc. Vostro 3400/08YN7X, BIOS A10 10/25/2010
Jul 20 11:37:04 ankur.pc kernel: Workqueue: pm pm_runtime_work
Jul 20 11:37:04 ankur.pc kernel: task: ffffa0c581dc0000 task.stack: ffffc0e740ddc000
Jul 20 11:37:04 ankur.pc kernel: RIP: 0010:report_bug+0x94/0x120
Jul 20 11:37:04 ankur.pc kernel: RSP: 0018:ffffc0e740ddf870 EFLAGS: 00010002
Jul 20 11:37:04 ankur.pc kernel: RAX: 0000000000000907 RBX: ffffc0e740ddf9d8 RCX: ffffffffc03c6f3d
Jul 20 11:37:04 ankur.pc kernel: RDX: 0000000000000001 RSI: 0000000000000260 RDI: 0000000000000001
Jul 20 11:37:04 ankur.pc kernel: RBP: ffffc0e740ddf890 R08: ffffc0e740de0000 R09: 00000000000002bc
Jul 20 11:37:04 ankur.pc kernel: R10: ffffffff99e06a80 R11: ffffffffc071f650 R12: ffffffffc03a9be9
Jul 20 11:37:04 ankur.pc kernel: R13: ffffffffc03c0a7c R14: 0000000000000004 R15: ffffc0e740ddf9d8
Jul 20 11:37:04 ankur.pc kernel: FS:  0000000000000000(0000) GS:ffffa0c58bd80000(0000) knlGS:0000000000000000
Jul 20 11:37:04 ankur.pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 20 11:37:04 ankur.pc kernel: CR2: ffffffffc03c6f47 CR3: 00000002404d9000 CR4: 00000000000006e0
Jul 20 11:37:04 ankur.pc kernel: Call Trace:
Jul 20 11:37:04 ankur.pc kernel:  ? drm_calc_vbltimestamp_from_scanoutpos+0x299/0x330 [drm]
Jul 20 11:37:04 ankur.pc kernel:  fixup_bug+0x2e/0x50
Jul 20 11:37:04 ankur.pc kernel:  do_trap+0x119/0x150
Jul 20 11:37:04 ankur.pc kernel:  do_error_trap+0x89/0x110
Jul 20 11:37:04 ankur.pc kernel:  ? drm_calc_vbltimestamp_from_scanoutpos+0x299/0x330 [drm]
Jul 20 11:37:04 ankur.pc kernel:  ? vsnprintf+0xea/0x4d0
Jul 20 11:37:04 ankur.pc kernel:  do_invalid_op+0x20/0x30
Jul 20 11:37:04 ankur.pc kernel:  invalid_op+0x1e/0x30
Jul 20 11:37:04 ankur.pc kernel: RIP: 0010:drm_calc_vbltimestamp_from_scanoutpos+0x299/0x330 [drm]
Jul 20 11:37:04 ankur.pc kernel: RSP: 0018:ffffc0e740ddfa80 EFLAGS: 00010086
Jul 20 11:37:04 ankur.pc kernel: RAX: ffffffffc0792b60 RBX: ffffa0c57d202000 RCX: 0000000000000000
Jul 20 11:37:04 ankur.pc kernel: RDX: ffffffffc03c6078 RSI: 0000000000000001 RDI: ffffffffc03c0a9c
Jul 20 11:37:04 ankur.pc kernel: RBP: ffffc0e740ddfae8 R08: 0000000000000000 R09: ffffffffc03a9950
Jul 20 11:37:04 ankur.pc kernel: R10: ffffa0c57d9de888 R11: ffffffffc071f650 R12: 0000000000000000
Jul 20 11:37:04 ankur.pc kernel: R13: ffffa0c57d9de800 R14: ffffc0e740ddfafc R15: ffffc0e740ddfb40
Jul 20 11:37:04 ankur.pc kernel:  ? nouveau_display_vblank_disable+0x30/0x30 [nouveau]
Jul 20 11:37:04 ankur.pc kernel:  ? drm_get_last_vbltimestamp+0x90/0x90 [drm]
Jul 20 11:37:04 ankur.pc kernel:  ? vprintk_emit+0x328/0x390
Jul 20 11:37:04 ankur.pc kernel:  drm_get_last_vbltimestamp+0x56/0x90 [drm]
Jul 20 11:37:04 ankur.pc kernel:  drm_update_vblank_count+0x76/0x270 [drm]
Jul 20 11:37:04 ankur.pc kernel:  drm_vblank_disable_and_save+0x5d/0xd0 [drm]
Jul 20 11:37:04 ankur.pc kernel:  drm_crtc_vblank_off+0xb7/0x210 [drm]
Jul 20 11:37:04 ankur.pc kernel:  ? insert_work+0x52/0xc0
Jul 20 11:37:04 ankur.pc kernel:  nouveau_display_fini+0x5d/0xd0 [nouveau]
Jul 20 11:37:04 ankur.pc kernel:  ? vga_switcheroo_runtime_resume+0x50/0x50
Jul 20 11:37:04 ankur.pc kernel:  nouveau_display_suspend+0x57/0x120 [nouveau]
Jul 20 11:37:04 ankur.pc kernel:  nouveau_do_suspend+0x7d/0x1d0 [nouveau]
Jul 20 11:37:04 ankur.pc kernel:  nouveau_pmops_runtime_suspend+0x59/0xc0 [nouveau]
Jul 20 11:37:04 ankur.pc kernel:  pci_pm_runtime_suspend+0x5f/0x170
Jul 20 11:37:04 ankur.pc kernel:  ? vga_switcheroo_runtime_resume+0x50/0x50
Jul 20 11:37:04 ankur.pc kernel:  vga_switcheroo_runtime_suspend+0x23/0xa0
Jul 20 11:37:04 ankur.pc kernel:  __rpm_callback+0xc2/0x200
Jul 20 11:37:04 ankur.pc kernel:  ? vga_switcheroo_runtime_resume+0x50/0x50
Jul 20 11:37:04 ankur.pc kernel:  rpm_callback+0x24/0x80
Jul 20 11:37:04 ankur.pc kernel:  ? vga_switcheroo_runtime_resume+0x50/0x50
Jul 20 11:37:04 ankur.pc kernel:  rpm_suspend+0x138/0x630
Jul 20 11:37:04 ankur.pc kernel:  pm_runtime_work+0x68/0x90
Jul 20 11:37:04 ankur.pc kernel:  process_one_work+0x193/0x3c0
Jul 20 11:37:04 ankur.pc kernel:  worker_thread+0x4a/0x3a0
Jul 20 11:37:04 ankur.pc kernel:  kthread+0x125/0x140
Jul 20 11:37:04 ankur.pc kernel:  ? process_one_work+0x3c0/0x3c0
Jul 20 11:37:04 ankur.pc kernel:  ? kthread_park+0x60/0x60
Jul 20 11:37:04 ankur.pc kernel:  ret_from_fork+0x25/0x30
Jul 20 11:37:04 ankur.pc kernel: Code: 74 59 0f b7 41 0a 4c 63 69 04 0f b7 71 08 89 c7 49 01 cd 83 e7 01 a8 02 74 15 66 85 ff 74 10 a8 04 ba 01 00 00 00 75 26 83 c8 04 <6
Jul 20 11:37:04 ankur.pc kernel: RIP: report_bug+0x94/0x120 RSP: ffffc0e740ddf870
Jul 20 11:37:04 ankur.pc kernel: CR2: ffffffffc03c6f47
Jul 20 11:37:04 ankur.pc kernel: ---[ end trace 2b46307d28802a2b ]---

Comment 24 Ankur Sinha (FranciscoD) 2017-07-20 09:50:29 UTC
Created attachment 1301625 [details]
Full journalctl log for 4.13.0-0.rc1.git0.1

Comment 25 Ankur Sinha (FranciscoD) 2017-07-20 09:52:31 UTC
Created attachment 1301626 [details]
Full journalctl log for 4.13.0-0.rc1.git0.1 with selinux in permissive mode

Saw a few selinux messages in there so switched it to permissive to confirm that the issue still persisted - it did. Full log attached.

Comment 26 Ben Skeggs 2017-07-20 22:39:16 UTC
Ah, that's a separate issue that I'm currently working on fixing.  Some kernel-related stuff changed from underneath us.

It actually looks like your original issue has indeed gone away though, which is encouraging, but confusing, as I can't explain why.  In your case, you should actually be able to boot now with "nouveau.runpm=0" to disable powering-down the NVIDIA GPU when its not in use, however, you won't be able to use suspend/resume until this is fixed.

Comment 27 Ben Skeggs 2017-07-24 04:22:59 UTC
I've attempted to guess what I need to backport from 4.12 for you, but as I can't reproduce, I can't confirm.  Can you test this scratch build for me please?

https://koji.fedoraproject.org/koji/taskinfo?taskID=20708110

Thanks,
Ben.

Comment 28 Ankur Sinha (FranciscoD) 2017-07-29 14:09:45 UTC
This scratch build with 4.11.11-300.fc26.x86_64 seems to work!! :D

Comment 29 Ben Skeggs 2017-07-31 00:30:48 UTC
Great!  It's good to have confirmation of the cause, and solution.  Finally.  I originally planned on backporting the fixes to 4.11, however, I've been informed that 4.11 is end-of-life.  It looks like 4.12 (which already have the fix) builds are available in koji for F26 already though, so should be in updates-testing/updates at some point soon.

Comment 30 Ankur Sinha (FranciscoD) 2018-01-07 21:39:40 UTC
Hi Ben,

I think this is fixed - I haven't had the issue in a while on F27 now. Closing. Please reopen if needed.

Thanks again for fixing it.