Bug 1657453

Summary: 4.20.0-0.rc5 failure to boot due to amdgpu and iwlwifi issues
Product: [Fedora] Fedora Reporter: James A. Robinson <jim.robinson>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 29CC: airlied, bskeggs, ewk, hdegoede, ichavero, itamar, jarodwilson, jcline, jglisse, john.j5live, jonathan, josef, kernel-maint, labbott, linville, mchehab, mjg59, phea.duch, steved
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-17 22:14:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James A. Robinson 2018-12-08 11:13:23 UTC
Description of problem:

When I updated my kernel packages to 4.20.0-rc I found my system would no longer boot.

So under 4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 from:

kernel-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-core-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-devel-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-extra-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64

I found my system was failing to boot.  Removing the 'rhgb' and 'quiet' boot flags what I see is the boot gets to the following point:

[    4.098459] [drm] amdgpu kernel modesetting enabled
[    4.099721] Parsing CRAT table with 4 nodes
[    4.100871] Ignoring ACPI CRAT on non-APU system
[    4.102015] Virtual CRAT table with 4 nodes
[    4.103145] Parsing CRAT table with 4 nodes
[    4.104294] Creating topology SYSFS entries
[    4.105539] Topology: Add CPU node
[    4.106705] Finished initializing topology
[    4.108203] fb0: switching to amdgpudrmfb from EFI VGA

and then everything just stops.  To get around this I added the 'nomodset' boot flag, and booted into runlevel 3.

However once logged into the console, I found wifi network was no longer working either.  The dmesg log showed:

[   16.196553] iwlwifi 0000:05:00.0: enabling device (0000 -> 0002)
[   16.198190] iwlwifi 0000:05:00.0: No suitable DMA available
[   16.218359] iwlwifi: probe of 0000:05:00.0 failed with error -5

Under the previous 4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64 from:

kernel-0:4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64
kernel-core-0:4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64
kernel-devel-0:4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-0:4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-extra-0:4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64

What I see is:

[    4.110338] [drm] amdgpu kernel modesetting enabled.
[    4.111714] Parsing CRAT table with 4 nodes
[    4.113032] Ignoring ACPI CRAT on non-APU system
[    4.114348] Virtual CRAT table created for CPU
[    4.115671] Parsing CRAT table with 4 nodes
[    4.116982] Creating topology SYSFS entries
[    4.118326] Topology: Add CPU node
[    4.119627] Finished initializing topology
[    4.121154] checking generic (90000000 7e9000) vs hw (90000000 10000000)
[    4.122260] fb0: switching to amdgpudrmfb from EFI VGA
[    4.124829] Console: switching to colour dummy device 80x25
[    4.125104] [drm] initializing kernel modesetting (POLARIS11 0x1002:0x67E3 0x1002:0x0B0D 0x00).

and

[   66.558457] iwlwifi 0000:05:00.0: enabling device (0000 -> 0002)
[   66.563804] iwlwifi 0000:05:00.0: loaded firmware version 29.1044073957.0 op_mode iwlmvm
[   66.657580] iwlwifi 0000:05:00.0: Detected Intel(R) Dual Band Wireless AC 3168, REV=0x220
[   66.678364] iwlwifi 0000:05:00.0: base HW address: 3c:6a:a7:a0:de:8f
[   66.731816] iwlwifi 0000:05:00.0 wlp5s0: renamed from wlan0

With both components operating as expected.

Version-Release number of selected component (if applicable):

kernel-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-core-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-devel-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64
kernel-modules-extra-0:4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64

How reproducible:

For my setup it was consistent for 5 reboots as I gathered information.

Steps to Reproduce:

1. Using a motherboard w/o built-in VGA, with an AMD Radeon Pro WX 4100 and an Intel 802.11ac WiFi Module.
2. Install the 4.20.0-RC5 kernel package and try to boot

Actual results:

The kernel fails to load amdgpu modeset and freezes.  If nomodeset is enabled then it boots, but then fails to load iwlwifi.

Expected results:

Given that 4.20.0-rc4 works w/o a problem on this hardware I expected 4.20.0-rc5 to continue to work.

Additional info:

Video card is an AMD Radeon Pro WX 4100, motherboard is an ASRock X399M Taichi with an Intel 802.11ac WiFi Module.

# lsmod | grep amdgpu
amdgpu               3756032  5
chash                  16384  1 amdgpu
amd_iommu_v2           20480  1 amdgpu
gpu_sched              36864  1 amdgpu
drm_kms_helper        204800  1 amdgpu
ttm                   110592  1 amdgpu
drm                   487424  6 gpu_sched,drm_kms_helper,amdgpu,ttm
i2c_algo_bit           16384  2 igb,amdgpu

lsmod | grep wifi
iwlwifi               286720  1 iwlmvm
cfg80211              770048  3 iwlmvm,iwlwifi,mac80211

Comment 1 Laura Abbott 2018-12-10 16:24:05 UTC
If you are using the vanilla packages your best bet is to run git bisect between the working and non-working kernel version to find which commit broke boot on your machine.

Comment 2 James A. Robinson 2018-12-10 22:32:43 UTC
Hi,

I went through the process of a git bisect between v4.20-rc4 and v4.20-rc5 and didn't find anything.  I ought to have tried a v4.20-rc4 build right off the bat using the rc5 config, because that would have clued me in on the issue.

Turns out I can build v4.20-rc5 w/o any problem as long as CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not on.

Here are the differences in the config distrubuted with the packages:

$ diff /boot/config-4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64 /boot/config-4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 
3c3
< # Linux/x86_64 4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64 Kernel Configuration
---
> # Linux/x86_64 4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 Kernel Configuration
23c23
< CONFIG_BUILD_SALT="4.20.0-0.rc4.git1.1.vanilla.knurd.1.fc29.x86_64"
---
> CONFIG_BUILD_SALT="4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64"
104a105
> # CONFIG_PSI_DEFAULT_DISABLED is not set
381c382
< # CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set
---
> CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y
740a742
> CONFIG_KVM_AMD_SEV=y
5725d5726
< CONFIG_SND_SOC_INTEL_SKYLAKE_SSP_CLK=m
5726a5728,5730
> CONFIG_SND_SOC_INTEL_SKYLAKE_SSP_CLK=m
> CONFIG_SND_SOC_INTEL_SKYLAKE_HDAUDIO_CODEC=y
> CONFIG_SND_SOC_INTEL_SKYLAKE_COMMON=m
5750d5753
< CONFIG_SND_SOC_INTEL_SKL_HDA_DSP_GENERIC_MACH=m
5751a5755
> CONFIG_SND_SOC_INTEL_SKL_HDA_DSP_GENERIC_MACH=m
8114c8118
< # CONFIG_CRYPTO_DEV_SP_PSP is not set
---
> CONFIG_CRYPTO_DEV_SP_PSP=y

Copying config-4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 to my v4.20-rc5 to my kernel tree .config and changing CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT to match the rc4 release allows me to compile a kernel that boots properly on my system:

$ diff /boot/config-4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 .config
3c3
< # Linux/x86_64 4.20.0-0.rc5.git3.1.vanilla.knurd.1.fc29.x86_64 Kernel Configuration
---
> # Linux/x86 4.20.0-rc5 Kernel Configuration
382c382
< CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y
---
> # CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is not set

My CPU is an AMD 2990WX, my BIOS is running AGESA 1.0.0.2.

Comment 3 James A. Robinson 2018-12-10 23:54:17 UTC
I'm thinking this may be a long standing bug...  I see in this old thread:

[SOLVED] Problem with AMDGPU: blank screen at boot
https://forums.gentoo.org/viewtopic-t-1074902-postdays-0-postorder-asc-start-0.html

a comment:

PostPosted: Sat Jan 06, 2018 3:49 pm
"I had a similar error with my Vega 64 and the latest 4.15-rc kernel which went away when I disabled AMD Secure Memory Encryption (SME) support under Processor Type and Features, although could be completely unrelated to your issue."

There's also quite a bit of discussion in that thread about firmware modules, but honestly it reads a lot like people who don't really know what they are talking about suggesting blind changes because "it works for them."

Comment 4 James A. Robinson 2018-12-14 18:35:32 UTC
The easiest workaround to this so far has been for me to adjust

/etc/sysconfig/grub

to include mem_encrypt=off in the GRUB_CMDLINE_LINUX, so for example:


GRUB_CMDLINE_LINUX="resume=/dev/mapper/fedora_filum-swap rd.lvm.lv=fedora_filum/root rd.luks.uuid=luks-e348dab7-017b-4b74-a77c-0162668047e7 rd.lvm.lv=fedora_filum/swap rhgb quiet mem_encrypt=off"

This allows me to install and boot off the fedora packaged kernels.

Comment 5 Jeremy Cline 2018-12-17 22:14:41 UTC
CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT is off again in the Rawhide kernels so I'll go ahead and close this.