Bug 1292305 - blacklisting nouveau module cause kernel to hang
blacklisting nouveau module cause kernel to hang
Status: CLOSED EOL
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
22
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-16 19:08 EST by Roy A. Gilmore
Modified: 2016-07-19 14:56 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-07-19 14:56:38 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
File from akmod-nvidia (185 bytes, text/plain)
2015-12-17 15:19 EST, Roy A. Gilmore
no flags Details
File from nvidia installer (76 bytes, text/plain)
2015-12-17 15:22 EST, Roy A. Gilmore
no flags Details
File to blacklist BOTH nouveau AND i915 (83 bytes, text/plain)
2016-01-01 22:43 EST, Roy A. Gilmore
no flags Details

  None (edit)
Description Roy A. Gilmore 2015-12-16 19:08:10 EST
Description of problem:
Nouveau driver is defective, need to use commercial nVidia driver. When nouveau driver is blacklisted kernel, hangs shortly after loading drm module even in single user mode. Hangs before getting to point where any logs are saved.


Version-Release number of selected component (if applicable):
All versions

How reproducible:
Every time

Steps to Reproduce:
1. Place a .conf file in /etc/modprobe.d or /usr/lib/modprobe.d that blacklists the nouveau module
2. run mkinitrd -f or dracut -f
3. reboot

Actual results:
System hangs shortly after loading drm module, there are a few usb modules that sometimes load before hanging, but, aren't consistently loaded, and probably have nothing to do with the hanging problem since it only happens when nouveau is blacklisted.

Expected results:
System to boot correctly.

Additional info:
Can not find any configuration files that bring in nouveau. Is it hardcoded? Is this just a misguided way to prevent the commercial nVidia driver from being used. This makes no sense to me. Why should it matter if the nouveau driver is blacklisted in single user mode? Any generic VGA driver should be able to be used for single user mode.
Comment 1 Justin M. Forbes 2015-12-17 11:17:54 EST
This hang is without the proprietary driver installed and trying to load?

No, loading of drivers is not some method to prevent you from doing what you want to do, it is all based on pci id. This is what stops you from having to write config files for every piece of hardware on your system in order to make things work, the kernel probes the pci tree and loads the appropriate drivers for every device if such drivers exist. If you have nouveau blacklisted, it should not load it.   If the hang is without the nvidia driver installed, we need to look into that.  If the hang is with the nvidia driver installed, you will need to close this bug and ask them to look into that.

Also of note, if nouveau is defective in some way, a bug should be opened for that so the issue can be fixed.
Comment 2 Roy A. Gilmore 2015-12-17 15:14:28 EST
This is without the proprietary driver installed.

I have tried the akmod-nvidia rpm from rpmfusion-nonfree-updates. When I couldn't get the akmod-nvidia rpm to work, I copied the /usr/lib/modprobe.d/blacklist-nouveau.conf file, uninstalled all the nvidia rpms, and put the copy of blacklist-nouveau.conf back in /usr/lib/modprobe.d, ran depmod, mkinitrd and rebooted, system hangs with NO nvidia drivers installed.

I also tried the driver directly from nVidia. When installing nVidis's driver it detects that nouveau is loaded and offers to create the following file(s): /{etc,usr/lib}/modprobe.d/nvidia-installer-disable-nouveau.conf (both files are identical) and prompts you to run whatever commands are required on your distribution to generate a new initrd and reboot before continuing installation. At this point, nothing has been installed other than the 2 files. I ran depmod, mkinitrd and rebooted, again, system hangs with NO nvidia drivers installed.

I removed the 2 nvidia files from /{etc,usr/lib}/modprobe.d and ran "modprobe -c | grep nouveau" and only found the following two lines:

alias pci:v000010DEd*sv*sd*bc03sc*i* nouveau
alias pci:v000012D2d*sv*sd*bc03sc*i* nouveau

so, I temporarily moved the /lib/modules/$(uname -r)/kernel/drivers/gpu/drm/nouveau directory to my home directory, and ran depmod and "modprobe -c | grep nouveau" again, and got no entries for nouveau. Since there were no entries for nouveau, I went ahead and ran mkinitrd and rebooted, system hangs with NO nvidia OR nouveau drivers installed.

With no drivers installed, shouldn't it fall back to generic VGA? I don't think there should be ANY dependencies on video drivers in single user mode.

I will attach the files that akmod-nvidia and the "official" nVidia installer installs so you look at them. But, I don't think they're at fault.

The nouveau driver has long standing known bugs causing random lockups (see: https://fedoraproject.org/wiki/Common_kernel_problems#Systems_with_nVidia_adapters_using_the_nouveau_driver_lock_up_randomly), so far the only workaround is to pass noaccel=1 to the nouveau driver which does stop the lockups, but, this causes other issues. Also, nouveau 3d acceleration has never worked properly. These bugs have existed for years, have been reported, and I have had to use the proprietary nVidia driver in the past because apparently fixing the nouveau driver is not a high priority. But, about six weeks ago I started getting the kernel hangs with the nouveau driver blacklisted and have been forced to use the crappy bug-ridden nouveau driver.
Comment 3 Roy A. Gilmore 2015-12-17 15:19 EST
Created attachment 1106843 [details]
File from akmod-nvidia

File installed by akmod-nvidia from rpmfusion-nonfree-updates
Comment 4 Roy A. Gilmore 2015-12-17 15:22 EST
Created attachment 1106845 [details]
File from nvidia installer

File from nvidia "official" installer
Comment 5 Roy A. Gilmore 2016-01-01 22:43 EST
Created attachment 1110965 [details]
File to blacklist BOTH nouveau AND i915

Had to blacklist BOTH nouveau AND i915 to get kernel to stop hanging.
Comment 6 Roy A. Gilmore 2016-01-01 23:33:07 EST
(In reply to Roy A. Gilmore from comment #5)
> Created attachment 1110965 [details]
> File to blacklist BOTH nouveau AND i915
> 
> Had to blacklist BOTH nouveau AND i915 to get kernel to stop hanging.

After some more poking around, I discovered that by also blacklisting the builtin graphics adapter, I could disable the nouveau module.

Motherboard: ASUS SABERTOOTH Z77
Graphics Card: GTX660 TI-DC2O-2GD5

The motherboard has an "Integrated Graphics Processor", the BIOS provides no way to disable it, but, I can select which will be the primary. I have the external card selected as primary.

[root@thor ~]# lspci -s 00:02.0 -v
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09) (prog-if 00 [VGA controller])
	DeviceName:  Onboard IGD
	Subsystem: ASUSTeK Computer Inc. P8 series motherboard
	Flags: fast devsel, IRQ 11
	Memory at f7400000 (64-bit, non-prefetchable) [disabled] [size=4M]
	Memory at d0000000 (64-bit, prefetchable) [disabled] [size=256M]
	I/O ports at f000 [disabled] [size=64]
	Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
	Capabilities: [d0] Power Management version 2
	Capabilities: [a4] PCI Advanced Features
	Kernel modules: i915

[root@thor ~]# lspci -s 01:00.0 -v
01:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 660 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. Device 841f
	Flags: bus master, fast devsel, latency 0, IRQ 11
	Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
	Memory at e0000000 (64-bit, prefetchable) [size=128M]
	Memory at e8000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	Expansion ROM at f7000000 [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100] Virtual Channel
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Kernel modules: nouveau

If I blacklist the nouveau module and leave the i915 module enabled, the system hangs shortly after loading the drm module. If I disable both the nouveau module and the i915 module, the system boots correctly. Why would the i915 module care if the nouveau module was blacklisted? And, this is relatively recent behavior, I didn't use to have to blacklist the i915 module. Fortunately, I don't have a second monitor connected to the internal graphics adapter. So I can get away with blacklisting the i915 module, but, these are two physically separate graphics adapters, there should be NO dependencies between them.
Comment 7 Justin M. Forbes 2016-01-07 00:03:07 EST
Bios update recently? Usually when running an external graphics card, you disable the i915 in bios unless you are in a laptop situation where you have power concerns.  Sounds like the i915 is enabled in bios some how.  When the kernel sees it can't load support for the nvidia card, it assumes the i915 is the primary. If your bios is set correctly, it will not even see the i915 device.
Comment 8 Roy A. Gilmore 2016-01-07 01:34:06 EST
The latest bios is version 2104 which was released on 2013/09/16, I have been running this bios since it was released. There is no way to completely disable the i915 in bios. I can select which is the primary graphics adapter, and I have selected the PCIe card as the primary graphics adapter. So, yes the i915 is enabled in bios, but, it is set as the secondary adapter. Setting the external card as the primary adapter is as "set correctly" as can be done on this bios. But, I don't think this is a bios issue. This setting worked fine until a little over a month before I filed this bug report (around the time f23 was released). Why should the kernel care about having multiple graphics adapters? Lots of systems have multiple graphics adapters, lots of systems have multiple monitors. The kernel should not be making a policy decision that is more appropriately handled by configuration. It should NOT assume the i915 is primary. If the kernel can't load the nouveau driver it should fall back to plain vanilla vga on the external card. As I stated in a previous message, I temporarily moved the nouveau driver completely out of the tree and re-ran depmod and mkinitrd. By checking modprobe -c and lsinitrd, as far as I could tell, the kernel had no idea that the nouveau had ever existed at that point. But, it still hung, soon after loading the drm module. When I also blacklisted the i915 module, then the nVidia adapter did fall back to plain vanilla vga. Unless there is some hard-coded dependency that I'm missing, the i915 graphics adapter and the nVidia graphics adapter are physically two separate graphics adapters and should have ZERO dependencies on each other. If either adapter fails, the other adapter should still work. If both adapters are enabled in bios, the kernel should honor the bios settings, and load the available drivers for BOTH adapters, EVEN if it has to fall back to plain vanilla vga for one OR both of the adapters. KVM is fundamental functionality. The kernel should be tolerant of KVM issues, and find a way to work with anything short of a catastrophic hardware failure. Especially in single-user mode.
Comment 9 Justin M. Forbes 2016-01-07 11:26:47 EST
Okay, so there wasn't a bios change. I was asking because my non laptop systems here with both radeon and nvidia cards don't even show the i915 card physically on the bus unless I enable it in bios.
Silly question, are you sure that the system is actually hung? Can you ssh into it?  Or, if you were to plug a monitor into the i915 does it actually appear to boot correctly?  It sounds like the system starts DRM initialization, sees no appropriate driver for the "primary" but finds one for the secondary and switches the console to that. While this might not be what you want, it makes the most sense from a system standpoint as i915 should be much faster than software and default vga.
Also, on your blacklisting of nouveau, you left out the step of adding the following to your grub.conf "rd.driver.blacklist=nouveau"
While we cannot actually support problems with the nvidia proprietary driver, their use is well documented, and googling fedora nvidia driver install might get you some current guides that would walk you through it all.
Comment 10 Roy A. Gilmore 2016-01-07 16:09:15 EST
DISCLAIMER: This message contains my opinions.

I'll have to get back to you on whether the kernel has actually switched the console to the i915 adapter. That's something I hadn't considered. I run VMware Workstation on this system, so I don't usually need a second physical computer at this location, so I'll have to bring another system and monitor here to check that out. But, regardless, the kernel should NOT override what the sysadmin has told it to do. There is a reason I set the primary adapter to the external adapter in bios, and the kernel should do what the sysadmin tells it to do. The VGA driver IS an appropriate driver, even though it may not be the BEST driver. And while the i915 driver may be faster than the VGA driver, I'm not aware of any benchmark test results contained in the kernel to allow the kernel to make that decision intelligently. Regardless, it's trivial to tell whether a monitor is connected to an adapter or not. And, in this particular case, the kernel should be aware that there isn't a monitor connected to the i915 adapter, and definitely shouldn't switch the console to an adapter without a monitor connected. In my opinion, ALL policy decisions are the responsibility of the sysadmin. Also, in my opinion, the kernel is trying to be TOO smart, making too many policy decisions, and it's failing.

While I may be wrong, I don't feel that I should HAVE to add "rd.driver.blacklist=nouveau" to the grub.conf file. It makes absolutely no logical sense to me to have to blacklist the nouveau module in two places. In my opinion, the kernel command line should ONLY be used for critical things like configuring compiled-in modules that need to be configured before mounting the initrd/initramfs, finding the initrd/initramfs, finding the root, and quick and dirty troubleshooting. By the time the kernel is trying to load the nouveau module (or ANY loadable module, for that matter), the initrd/initramfs is by necessity already mounted, and the kernel has access to all of the files in the initrd/initramfs's /etc/modprobe.d and /usr/lib/modprobe.d. The command line should only be necessary to override the configuration files contained in /etc/modprobe.d and /usr/lib/modprobe.d for troubleshooting purposes. Especially, since I temporarily completely removed the module from the tree. Once the module is removed from the tree, and depmod/mkinitrd are run, the kernel shouldn't know anything about the nouveau module at all, and shouldn't care less whether it's blacklisted or not.

I agree that you shouldn't have to support the nVidia driver. That's why I'm not asking for any direct support. All I want from Fedora and/or the upstream kernel maintainers is a way to disable the nouveau driver, and for the kernel to do what it's told to do.

This is off topic, but whatever happened to the kernel-doc package? As close as I can tell, F20 was the last release to carry the kernel-doc package. There doesn't appear to be a way to install the kernel documentation anymore. Has there been a replacement package issued that I can't locate for some reason? While I can, and have manually downloaded the kernel SRPM's and can view the documentation, it might save the maintainers some time if it was easier for the end-users to look up the information they need to troubleshoot their own problems. It's kind of hard to tell the end-users to RTFM, if the M(anual) isn't readily available. I'm not saying that you told me to RTFM, it's just that's the standard answer to most questions in the Linux/UNIX world.
Comment 11 Josh Boyer 2016-01-07 16:18:11 EST
We don't build kernel-doc any longer because the documentation can be found on kernel.org and it required special hacks in the buildsystem to build it.  Also, the content of kernel-doc would not tell you how to blacklist drivers as that is done in userspace.
Comment 12 Roy A. Gilmore 2016-01-07 17:15:00 EST
The problem with online documentation is that when you need it the most, it may not be available. It takes a lot of subsystems to work together correctly to bring up a web browser, connect to the internet, and view the documentation on kernel.org. When things go horribly wrong, local plain ASCII text documentation still "works".

I prefaced the the question about kernel-doc about being off-topic, this was a generic question and had nothing to do with blacklisting modules.
Comment 13 Fedora End Of Life 2016-07-19 14:56:38 EDT
Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.