Bug 1657453 - 4.20.0-0.rc5 failure to boot due to amdgpu and iwlwifi issues
Summary: 4.20.0-0.rc5 failure to boot due to amdgpu and iwlwifi issues
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 29
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-08 11:13 UTC by James A. Robinson
Modified: 2018-12-17 22:14 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-17 22:14:41 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description James A. Robinson 2018-12-08 11:13:23 UTC
Description of problem:

When I updated my kernel packages to 4.20.0-rc I found my system would no longer boot.

So under 4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 from:

kernel-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-core-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-devel-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-extra-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64

I found my system was failing to boot.  Removing the 'rhgb' and 'quiet' boot flags what I see is the boot gets to the following point:

[    4.098459] [drm] amdgpu kernel modesetting enabled
[    4.099721] Parsing CRAT table with 4 nodes
[    4.100871] Ignoring ACPI CRAT on non-APU system
[    4.102015] Virtual CRAT table with 4 nodes
[    4.103145] Parsing CRAT table with 4 nodes
[    4.104294] Creating topology SYSFS entries
[    4.105539] Topology: Add CPU node
[    4.106705] Finished initializing topology
[    4.108203] fb0: switching to amdgpudrmfb from EFI VGA

and then everything just stops.  To get around this I added the 'nomodset' boot flag, and booted into runlevel 3.

However once logged into the console, I found wifi network was no longer working either.  The dmesg log showed:

[   16.196553] iwlwifi 0000:05:00.0: enabling device (0000 -> 0002)
[   16.198190] iwlwifi 0000:05:00.0: No suitable DMA available
[   16.218359] iwlwifi: probe of 0000:05:00.0 failed with error -5

Under the previous 4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64 from:

kernel-0:4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64
kernel-core-0:4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64
kernel-devel-0:4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-0:4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-extra-0:4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64

What I see is:

[    4.110338] [drm] amdgpu kernel modesetting enabled.
[    4.111714] Parsing CRAT table with 4 nodes
[    4.113032] Ignoring ACPI CRAT on non-APU system
[    4.114348] Virtual CRAT table created for CPU
[    4.115671] Parsing CRAT table with 4 nodes
[    4.116982] Creating topology SYSFS entries
[    4.118326] Topology: Add CPU node
[    4.119627] Finished initializing topology
[    4.121154] checking generic (90000000 7e9000) vs hw (90000000 10000000)
[    4.122260] fb0: switching to amdgpudrmfb from EFI VGA
[    4.124829] Console: switching to colour dummy device 80x25
[    4.125104] [drm] initializing kernel modesetting (POLARIS11 0x1002:0x67E3 0x1002:0x0B0D 0x00).

and

[   66.558457] iwlwifi 0000:05:00.0: enabling device (0000 -> 0002)
[   66.563804] iwlwifi 0000:05:00.0: loaded firmware version 29.1044073957.0 op_mode iwlmvm
[   66.657580] iwlwifi 0000:05:00.0: Detected Intel(R) Dual Band Wireless AC 3168, REV=0x220
[   66.678364] iwlwifi 0000:05:00.0: base HW address: 3c:6a:a7:a0:de:8f
[   66.731816] iwlwifi 0000:05:00.0 wlp5s0: renamed from wlan0

With both components operating as expected.

Version-Release number of selected component (if applicable):

kernel-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-core-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-devel-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-extra-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64

How reproducible:

For my setup it was consistent for 5 reboots as I gathered information.

Steps to Reproduce:

1. Using a motherboard w/o built-in VGA, with an AMD Radeon Pro WX 4100 and an Intel 802.11ac WiFi Module.
2. Install the 4.20.0-RC5 kernel package and try to boot

Actual results:

The kernel fails to load amdgpu modeset and freezes.  If nomodeset is enabled then it boots, but then fails to load iwlwifi.

Expected results:

Given that 4.20.0-rc4 works w/o a problem on this hardware I expected 4.20.0-rc5 to continue to work.

Additional info:

Video card is an AMD Radeon Pro WX 4100, motherboard is an ASRock X399M Taichi with an Intel 802.11ac WiFi Module.

# lsmod | grep amdgpu
amdgpu               3756032  5
chash                  16384  1 amdgpu
amd_iommu_v2           20480  1 amdgpu
gpu_sched              36864  1 amdgpu
drm_kms_helper        204800  1 amdgpu
ttm                   110592  1 amdgpu
drm                   487424  6 gpu_sched,drm_kms_helper,amdgpu,ttm
i2c_algo_bit           16384  2 igb,amdgpu

lsmod | grep wifi
iwlwifi               286720  1 iwlmvm
cfg80211              770048  3 iwlmvm,iwlwifi,mac80211

Comment 1 Laura Abbott 2018-12-10 16:24:05 UTC
If you are using the vanilla packages your best bet is to run git bisect between the working and non-working kernel version to find which commit broke boot on your machine.

Comment 2 James A. Robinson 2018-12-10 22:32:43 UTC
Hi,

I went through the process of a git bisect between v4.20-rc4 and v4.20-rc5 and didn't find anything.  I ought to have tried a v4.20-rc4 build right off the bat using the rc5 config, because that would have clued me in on the issue.

Turns out I can build v4.20-rc5 w/o any problem as long as CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not on.

Here are the differences in the config distrubuted with the packages:

$ diff /boot/config-4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64 /boot/config-4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 
3c3
< # Linux/x86_64 4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64 Kernel Configuration
---
> # Linux/x86_64 4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 Kernel Configuration
23c23
< CONFIG_BUILD_SALT="4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64"
---
> CONFIG_BUILD_SALT="4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64"
104a105
> # CONFIG_PSI_DEFAULT_DISABLED is not set
381c382
< # CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set
---
> CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y
740a742
> CONFIG_KVM_AMD_SEV=y
5725d5726
< CONFIG_SND_SOC_INTEL_SKYLAKE_SSP_CLK=m
5726a5728,5730
> CONFIG_SND_SOC_INTEL_SKYLAKE_SSP_CLK=m
> CONFIG_SND_SOC_INTEL_SKYLAKE_HDAUDIO_CODEC=y
> CONFIG_SND_SOC_INTEL_SKYLAKE_COMMON=m
5750d5753
< CONFIG_SND_SOC_INTEL_SKL_HDA_DSP_GENERIC_MACH=m
5751a5755
> CONFIG_SND_SOC_INTEL_SKL_HDA_DSP_GENERIC_MACH=m
8114c8118
< # CONFIG_CRYPTO_DEV_SP_PSP is not set
---
> CONFIG_CRYPTO_DEV_SP_PSP=y

Copying config-4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 to my v4.20-rc5 to my kernel tree .config and changing CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT to match the rc4 release allows me to compile a kernel that boots properly on my system:

$ diff /boot/config-4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 .config
3c3
< # Linux/x86_64 4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 Kernel Configuration
---
> # Linux/x86 4.20.0-rc5 Kernel Configuration
382c382
< CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y
---
> # CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set

My CPU is an AMD 2990WX, my BIOS is running AGESA 1.0.0.2.

Comment 3 James A. Robinson 2018-12-10 23:54:17 UTC
I'm thinking this may be a long standing bug...  I see in this old thread:

[SOLVED] Problem with AMDGPU: blank screen at boot
https://forums.gentoo.org/viewtopic-t-1074902-postdays-0-postorder-asc-start-0.html

a comment:

PostPosted: Sat Jan 06, 2018 3:49 pm
"I had a similar error with my Vega 64 and the latest 4.15-rc kernel which went away when I disabled AMD Secure Memory Encryption (SME) support under Processor Type and Features, although could be completely unrelated to your issue."

There's also quite a bit of discussion in that thread about firmware modules, but honestly it reads a lot like people who don't really know what they are talking about suggesting blind changes because "it works for them."

Comment 4 James A. Robinson 2018-12-14 18:35:32 UTC
The easiest workaround to this so far has been for me to adjust

/etc/sysconfig/grub

to include mem_encrypt=off in the GRUB_CMDLINE_LINUX, so for example:


GRUB_CMDLINE_LINUX="resume=/dev/mapper/fedora_filum-swap rd.lvm.lv=fedora_filum/root rd.luks.uuid=luks-e348dab7-017b-4b74-a77c-0162668047e7 rd.lvm.lv=fedora_filum/swap rhgb quiet mem_encrypt=off"

This allows me to install and boot off the fedora packaged kernels.

Comment 5 Jeremy Cline 2018-12-17 22:14:41 UTC
CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is off again in the Rawhide kernels so I'll go ahead and close this.


Note You need to log in before you can comment on or make changes to this bug.