Bug 1643302

Summary: locked out: switching ttys and login broken with NVIDIA driver
Product: [Fedora] Fedora Reporter: Fabio Valentini <decathorpe>
Component: gdmAssignee: Ray Strode [halfline] <rstrode>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 29CC: alain.vigne.14, armin, atigro, bugzilla.redhat.com, jadit2, john.j5live, komusubi, marijn, mclasen, rhughes, rstrode, simon.galton
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-08 03:16:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Fabio Valentini 2018-10-25 21:08:54 UTC
After upgrading from a working fedora Workstation 28 system to fedora 29, I can't log into a GNOME session anymore from gdm. Switching ttys doesn't work either. (Maybe those two things are related?)

- pressing Ctrl-Alt-FX key combinations has no effect
- logging in as my user results in a black screen, looks like gdm is killed and nothing else happens

The thing is, since I can't really get debug logs from the system (I'm completely locked out as soon as I try to log in and trigger the error). So, after hard-resetting the system, there's not necessarily something useful in the journal - and I don't want to torture my PC needlessly by repeatedly force-resetting it.


The only useful log messages I could salvage from when the system locked up were these messages, from when starting the X server for the user session was attempted:

(EE) NVIDIA(GPU-0): Failed to acquire modesetting permission.
(EE) NVIDIA(0): Failing initialization of X screen 0
(II) UnloadModule: "nvidia"
(II) UnloadSubModule: "glxserver_nvidia"
(II) Unloading glxserver_nvidia
(II) UnloadSubModule: "wfb"
(II) UnloadSubModule: "fb"
(EE) Screen(s) found, but none have a usable configuration.
(EE)
Fatal server error:
(EE) no screens found(EE)

(note the conflicting error messages: Screens found / no screens found)


Version-Release number of selected component (if applicable):
gdm-3.30.1-2.fc29
gnome-shell-3.30.1-1.fc29
mutter-3.30.1-1.fc29
nvidia-driver-410.66-2.fc29.x86_64
xorg-x11-xserver-Xorg-1.20.1-4.fc29

↑ these package versions are from when I first tried to upgrade to fedora 29. I attempted the upgrade again with the state of packages that are available from f29/updates-testing now, and the same error occurs still.


How reproducible:
Always.


Steps to Reproduce:
1. install (or upgrade to) fedora 29
2. install NVIDIA drivers (from negativo17 repo)
3. reboot
4. find yourself locked out of your system by broken gdm / GNOME session, and unable to switch to a tty for recovery and debugging


Actual results:
I can't use fedora 29.


Expected results:
After upgrading from fedora 28 to 29, system should continue to work.


Additional info:
The NVIDIA drivers aren't to blame here. I have the exact same version of the driver installed on fedora 28, and it works flawlessly.

- Enabling / disabling modesetting in the nvidia kernel module on the kernel command line changed nothing.

- Explicitly disabling wayland for gdm in /etc/gdm/custom.conf changed nothing (it started in X mode anyway, I could see that from the recovered system logs).

Comment 1 Mateusz Mikuła 2018-10-28 12:12:43 UTC
I cannot reproduce it on Rawhide with RPM Fusion Nvidia drivers but on Fedora 29 (both RPM Fusion and Negativo17) I can consistently reproduce it with `rhgb` present on the kernel command line and kernels 4.18.11 or newer.

I can successfully log in with kernel 4.18.10 and `rhgb` enabled or kernel >= 4.8.11 and `rhgb` disabled.

Comment 2 Mateusz Mikuła 2018-10-28 12:21:25 UTC
(In reply to Mateusz Mikuła from comment #1)
> I cannot reproduce it on Rawhide with RPM Fusion Nvidia drivers but on
> Fedora 29 (both RPM Fusion and Negativo17) I can consistently reproduce it
> with `rhgb` present on the kernel command line and kernels 4.18.11 or newer.
> 
> I can successfully log in with kernel 4.18.10 and `rhgb` enabled or kernel
> >= 4.8.11 and `rhgb` disabled.

My bad, Rawhide doesn't use `rhgb` by default it's the same as Fedora 29.

Comment 3 Fabio Valentini 2018-10-28 16:45:14 UTC
I can confirm that dropping rhgb from the kernel command line fixes the tty switching and login issue.

Comment 4 Michael Cronenworth 2018-10-31 14:47:41 UTC
FYI: I'm not seeing this problem on my setup. The 'rhgb' option is present.

Card: 750 Ti
Kernel: 4.18.16-300.fc29.x86_64
Driver: 410.73

Comment 5 Fabio Valentini 2018-10-31 15:23:46 UTC
I checked with the latest packages from today (Oct 31).

With this cmdline, the system works fine:

BOOT_IMAGE=/vmlinuz-4.18.16-300.fc29.x86_64 root=UUID=(...) ro rootflags=subvol=fedora-29 rd.driver.blacklist=nouveau nvidia-drm.modeset=0 resume=UUID=(...) rcu_nocbs=1-11 quiet

(/ is a btrfs subvolume; I disabled modesetting because wayland doesn't quite work right yet; rcu_nocbs=1-11 works around a ryzen 1st gen hardware bug)

Simply adding "rhgb" at the end still breaks tty switching and logging in.

Card:   GTX 1070
Kernel: 4.18.16-300.fc29.x86_64
Driver: 410.73 (nvidia-driver-410.73-4.fc29.x86_64)

Comment 6 Marijn Oosterveld 2018-11-01 12:27:15 UTC
(In reply to Fabio Valentini from comment #5)
> I checked with the latest packages from today (Oct 31).
> 
> With this cmdline, the system works fine:
> 
> BOOT_IMAGE=/vmlinuz-4.18.16-300.fc29.x86_64 root=UUID=(...) ro
> rootflags=subvol=fedora-29 rd.driver.blacklist=nouveau nvidia-drm.modeset=0
> resume=UUID=(...) rcu_nocbs=1-11 quiet
> 
> (/ is a btrfs subvolume; I disabled modesetting because wayland doesn't
> quite work right yet; rcu_nocbs=1-11 works around a ryzen 1st gen hardware
> bug)
> 
> Simply adding "rhgb" at the end still breaks tty switching and logging in.
> 
> Card:   GTX 1070
> Kernel: 4.18.16-300.fc29.x86_64
> Driver: 410.73 (nvidia-driver-410.73-4.fc29.x86_64)

I have the same hardware and kernel as you, but removing rhgb did not work for me.

Comment 7 Mateusz Mikuła 2018-11-03 16:29:31 UTC
(In reply to Marijn Oosterveld from comment #6)
> (In reply to Fabio Valentini from comment #5)
> > I checked with the latest packages from today (Oct 31).
> > 
> > With this cmdline, the system works fine:
> > 
> > BOOT_IMAGE=/vmlinuz-4.18.16-300.fc29.x86_64 root=UUID=(...) ro
> > rootflags=subvol=fedora-29 rd.driver.blacklist=nouveau nvidia-drm.modeset=0
> > resume=UUID=(...) rcu_nocbs=1-11 quiet
> > 
> > (/ is a btrfs subvolume; I disabled modesetting because wayland doesn't
> > quite work right yet; rcu_nocbs=1-11 works around a ryzen 1st gen hardware
> > bug)
> > 
> > Simply adding "rhgb" at the end still breaks tty switching and logging in.
> > 
> > Card:   GTX 1070
> > Kernel: 4.18.16-300.fc29.x86_64
> > Driver: 410.73 (nvidia-driver-410.73-4.fc29.x86_64)
> 
> I have the same hardware and kernel as you, but removing rhgb did not work
> for me.

Multiple users with Nvidia reported this workaround as working on Reddit, could you make sure `rhgb` was removed by running `cat /proc/cmdline`?

Comment 8 simon.galton 2018-11-04 00:51:33 UTC
(In reply to Mateusz Mikuła from comment #7)
> (In reply to Marijn Oosterveld from comment #6)
> > (In reply to Fabio Valentini from comment #5)
> > > I checked with the latest packages from today (Oct 31).
> > > 
> > > With this cmdline, the system works fine:
> > > 
> > > BOOT_IMAGE=/vmlinuz-4.18.16-300.fc29.x86_64 root=UUID=(...) ro
> > > rootflags=subvol=fedora-29 rd.driver.blacklist=nouveau nvidia-drm.modeset=0
> > > resume=UUID=(...) rcu_nocbs=1-11 quiet
> > > 
> > > (/ is a btrfs subvolume; I disabled modesetting because wayland doesn't
> > > quite work right yet; rcu_nocbs=1-11 works around a ryzen 1st gen hardware
> > > bug)
> > > 
> > > Simply adding "rhgb" at the end still breaks tty switching and logging in.
> > > 
> > > Card:   GTX 1070
> > > Kernel: 4.18.16-300.fc29.x86_64
> > > Driver: 410.73 (nvidia-driver-410.73-4.fc29.x86_64)
> > 
> > I have the same hardware and kernel as you, but removing rhgb did not work
> > for me.
> 
> Multiple users with Nvidia reported this workaround as working on Reddit,
> could you make sure `rhgb` was removed by running `cat /proc/cmdline`?

I also had this problem, even after removing 'rhgb'.  I was able to get it to work by setting the nvidia-drm.modeset value to 1:

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.18.16-300.fc29.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap quiet rd.driver.blacklist=nouveau nvidia-drm.modeset=1

Comment 9 Alain V. 2018-11-04 08:12:32 UTC
I have same issue

Card: GeForce GTX 960
Kernel: 4.18.16-300.fc29.x86_64
Driver: 410.73

removing 'rhgb' is the workaround for me. No need for nvidia-drm.modeset=1.

Comment 10 Ray Strode [halfline] 2018-11-04 12:26:45 UTC
my guess is we need this fix in plymouth https://gitlab.freedesktop.org/plymouth/plymouth/commit/89283f38b04a6543484b35576af296651bc3c0ba 

will look tomorrow

Comment 11 Armin B. 2018-11-05 10:49:22 UTC
+1 on removing rhgb. System details:

# 03:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 980] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: eVga.com. Corp. Device 2983
	Flags: bus master, fast devsel, latency 0, IRQ 133
	Memory at de000000 (32-bit, non-prefetchable) [size=16M]
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express <?>
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia

# uname -a
Linux bluelion 4.18.16-300.fc29.x86_64 #1 SMP Sat Oct 20 23:24:08 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 12 Ray Strode [halfline] 2018-11-05 21:00:29 UTC
can you guys try this update:

https://bodhi.fedoraproject.org/updates/FEDORA-2018-89d998abe5

make sure to run 

# dracut -f

after installing it to rebuild the initramfs

Comment 13 Mateusz Mikuła 2018-11-06 10:00:29 UTC
FEDORA-2018-89d998abe5 works fine with Negativo17 Nvidia drivers (modeset enabled).

Comment 14 Fedora Update System 2018-11-06 22:01:39 UTC
plymouth-0.9.4-1.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-89d998abe5

Comment 15 Fabio Valentini 2018-11-07 10:08:43 UTC
Yep, this update seems to fix the issue.

Comment 16 Fedora Update System 2018-11-08 03:16:13 UTC
plymouth-0.9.4-1.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Comment 17 Alain V. 2018-11-08 07:48:02 UTC
I confirm my issue is solved with this change, among the other changes pulled in by "dnf update", today.

Thank you.
Alain