Bug 1297120 - Kernel 4.3.3.300 fail boot on Intel i7 6th Gen (Skylake)
Kernel 4.3.3.300 fail boot on Intel i7 6th Gen (Skylake)
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
23
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-09 08:33 EST by Earl Ramirez
Modified: 2016-10-03 10:53 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-10-03 10:53:11 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
boot log (51.19 KB, text/plain)
2016-01-28 13:44 EST, Jeff Bastian
no flags Details
dmesg from kernel 4.2.3-300.fc23.x86_64 (79.44 KB, text/plain)
2016-02-05 14:29 EST, Jeff Bastian
no flags Details
dmesg from kernel 4.3.4-300.fc23.x86_64 with acpi=off (41.46 KB, text/plain)
2016-02-05 15:09 EST, Jeff Bastian
no flags Details

  None (edit)
Description Earl Ramirez 2016-01-09 08:33:48 EST
Description of problem:
When I try to book my laptop with kernel 4.3.3.300; which is currently in fedora-updates-testing repo; I get the following message and the laptop locks up:
[       0.977414] usb 1-4: new high-speed USB device number 2 using xhci_hcd

Version-Release number of selected component (if applicable):
Kernel 4.3.3.300

How reproducible:
Yes
If you have a laptop with an Intel i7 6th generation laptop


Steps to Reproduce:
1. Enable fedora-updates-testing repo and update the kernel dnf update   kernel
2. reboot and select the updated kernel
3. remove rhbg and quite from the kernel argument and boot up

Actual results:
[	0.687681] loaded using pool lzo/zbud
[	0.686509]  Magic number: 0:798:124
[	0.684587] graphics fb0: hash matches
[	0.685118] pi 0000:00:1f.0: hash matches
[	0.690209] rtc_cmos 02:02: setting system clock to 2016-01-08 11:06:39 UTC (142251199)
[	0.924377] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) 
[	0.925410] ata3: SATA link down (SStatus 4 SControl 300)
[	0.926527] ata1.00: supports DRM functions and may not be fully accessible
[	0.924377] ata2: SATA link up 1.5 Gbps (SStatus 133 SControl 300) 
[	0.928701] ata1.00: READ LOG DMA EXT failed, trying unqueued
[	0.929798] ata1.00: ATA-9: Samsung SSD 850 EVO 500GB, EMT01B6Q, max UDMA/133
[	0.930815] ata1.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[	0.932090] ata2.00: ATAPI: SlimtypeDVD A DA8A6SH, GAA2, max UDMA/133
[       0.977414] usb 1-4: new high-speed USB device number 2 using xhci_hcd
Expected results:
GUI to be loaded so that I can login

Additional info:
Comment 1 Jeff Bastian 2016-01-28 13:44 EST
Created attachment 1119227 [details]
boot log

I'm seeing the same thing (hang on boot) with 4.3.3-303.fc23 kernel on an HP z240 system with a Skylake CPU, but it makes it a bit further into boot.  See the attached boot log.

The F23 GA kernel -- 4.2.3-300.fc23 -- boots ok.
Comment 2 Jeff Bastian 2016-01-28 13:53:24 EST
FWIW, kernel 4.4.0-1.fc24 also hangs on this system.
Comment 3 Earl Ramirez 2016-02-04 08:54:54 EST
I have checked with rawhide kernel 4.5.0 rc2 and it fails on the exact location for me, for the F23 GA kernel it works with the following kernel arguments:
nouveau.modeset=0 rd.driver.blacklist=nouveau i915.preliminary_hw_support=1.
Comment 4 Josh Boyer 2016-02-04 09:25:37 EST
Can you both attach the full output of dmesg from a boot of a working kernel?

Also, what happens if you specify nomodeset on the kernel command line with the nonworking kernel?
Comment 5 Earl Ramirez 2016-02-04 09:32:06 EST
I get at the same location; I was just checking the bugs filed at kernel.org and I saw that this is related to the intel cstate, I just tried acpi=off from the kernel command line and I was able to boot into kernel 4.3.5-300; which is currently in the testing repo.

More information on the upstream kernel bug here [0], in the mean time I will continue to poke around and post my updates.



[0] https://bugzilla.kernel.org/show_bug.cgi?id=109081
Comment 6 Earl Ramirez 2016-02-04 09:53:54 EST
I have added intel_idle.max_cstate=7 and removed acpi=off and it worked better; E.g. the touchpad worked also proper power off and reboot.
Comment 7 Josh Boyer 2016-02-04 09:56:22 EST
(In reply to Earl Ramirez from comment #6)
> I have added intel_idle.max_cstate=7 and removed acpi=off and it worked
> better; E.g. the touchpad worked also proper power off and reboot.

That's good to know.  The bug you referenced is interesting, but kind of a mess.

Out of curiosity, what microcode does your system have for the CPU?
Comment 8 Earl Ramirez 2016-02-04 10:08:07 EST
This is the microcode for the current system:

microcode	: 0x33
Comment 9 Josh Boyer 2016-02-04 10:17:53 EST
(In reply to Earl Ramirez from comment #8)
> This is the microcode for the current system:
> 
> microcode	: 0x33

So per the referenced bug, that's kind of not sufficient.  Is there a firmware update for your machine available from the vendor?
Comment 10 Earl Ramirez 2016-02-04 10:21:53 EST
The last time I checked, there wasn't any, I will check again and report back
Comment 11 Earl Ramirez 2016-02-05 04:43:07 EST
I updated the firmware and I am still seeing the same microcode
Comment 12 Jeff Bastian 2016-02-05 14:29 EST
Created attachment 1121502 [details]
dmesg from kernel 4.2.3-300.fc23.x86_64

(In reply to Josh Boyer from comment #4)
> Can you both attach the full output of dmesg from a boot of a working kernel?

Attached is the journalctl output (*) of 4.2.3-300.fc23.x86_64 on the HP z240


(*) this is a beaker system and beaker flushes dmesg
Comment 13 Jeff Bastian 2016-02-05 14:35:37 EST
I installed the microcode_ctl rpm, but there were no updates for my CPU.


[root@hp-z240-01 ~]# cat /sys/devices/system/cpu/cpu0/microcode/version
0x50

[root@hp-z240-01 ~]# yum install microcode_ctl
...
Installed:
  microcode_ctl.x86_64 2:2.1-9.1.fc23                                           

Complete!

[root@hp-z240-01 ~]# echo 1 > /sys/devices/system/cpu/microcode/reload

[root@hp-z240-01 ~]# cat /sys/devices/system/cpu/cpu0/microcode/version
0x50

[root@hp-z240-01 ~]# lscpu | egrep 'family|Model|Stepping'
CPU family:            6
Model:                 94
Model name:            Intel(R) Core(TM) i5-6400 CPU @ 2.70GHz
Stepping:              3

[root@hp-z240-01 ~]# printf '%x\n' 94
5e

[root@hp-z240-01 ~]# ls /lib/firmware/intel-ucode/06-5e-03
ls: cannot access /lib/firmware/intel-ucode/06-5e-03: No such file or directory
Comment 14 Jeff Bastian 2016-02-05 14:52:47 EST
Neither nomodeset nor intel_idle.max_cstate=7 helped on the HP z240 I'm using.  The 4.3.4-300.fc23.x86_64 kernel still hangs.
Comment 15 Jeff Bastian 2016-02-05 15:09 EST
Created attachment 1121516 [details]
dmesg from kernel 4.3.4-300.fc23.x86_64 with acpi=off

Booting with acpi=off worked at least.
Comment 16 Josh Boyer 2016-02-06 03:41:10 EST
(In reply to Jeff Bastian from comment #13)
> I installed the microcode_ctl rpm, but there were no updates for my CPU.

Intel hasn't released any stand-alone yet.  They are apparently only available from the board manufacturers in system firmware updates.
Comment 18 Josh Boyer 2016-02-06 03:48:09 EST
(In reply to Jeff Bastian from comment #15)
> Created attachment 1121516 [details]
> dmesg from kernel 4.3.4-300.fc23.x86_64 with acpi=off
> 
> Booting with acpi=off worked at least.

What happens if you blacklist the hp_wmi driver and/or the snd_hda_intel driver?  Looking at the boot logs for your machine, it seems those would be the next to load.
Comment 19 Earl Ramirez 2016-02-06 07:25:10 EST
(In reply to Josh Boyer from comment #16)
> (In reply to Jeff Bastian from comment #13)
> > I installed the microcode_ctl rpm, but there were no updates for my CPU.
> 
> Intel hasn't released any stand-alone yet.  They are apparently only
> available from the board manufacturers in system firmware updates.

I upgraded the firmware for my board and the microcode for the CPU didn't budge; so I don't know how true it is that the firmware will resolve this issue.
Comment 20 Earl Ramirez 2016-02-06 07:40:14 EST
(In reply to Josh Boyer from comment #18)
> (In reply to Jeff Bastian from comment #15)
> > Created attachment 1121516 [details]
> > dmesg from kernel 4.3.4-300.fc23.x86_64 with acpi=off
> > 
> > Booting with acpi=off worked at least.
> 
> What happens if you blacklist the hp_wmi driver and/or the snd_hda_intel
> driver?  Looking at the boot logs for your machine, it seems those would be
> the next to load.

I saw in the upstream bug, at least one user mentioned that using intel_idle.max_cstate=0 worked; maybe you can give that a try.
Comment 21 Jeff Bastian 2016-02-08 16:20:53 EST
(In reply to Josh Boyer from comment #18) 
> What happens if you blacklist the hp_wmi driver and/or the snd_hda_intel
> driver?  Looking at the boot logs for your machine, it seems those would be
> the next to load.

Blacklisting the hp_wmi driver worked!

[root@hp-z240-01 ~]# uname -r
4.3.4-300.fc23.x86_64
[root@hp-z240-01 ~]# grep -o 'rdblacklist[^[:space:]]*' /proc/cmdline
rdblacklist=hp_wmi
[root@hp-z240-01 ~]# cat /etc/modprobe.d/blacklist-hp_wmi.conf
blacklist hp_wmi

But blacklisting the snd_hda_intel driver didn't work: the 4.3.4-300.fc23
kernel still hung about 11 seconds into boot.


(In reply to Earl Ramirez from comment #20)
> I saw in the upstream bug, at least one user mentioned that using
> intel_idle.max_cstate=0 worked; maybe you can give that a try.

That worked too!

[root@hp-z240-01 ~]# uname -r
4.3.4-300.fc23.x86_64
[root@hp-z240-01 ~]# grep -o 'intel_idle[^[:space:]]*' /proc/cmdline
intel_idle.max_cstate=0



So, what do WMI hotkeys have to do with power-saving C-states?
Comment 22 Jeff Bastian 2016-02-08 16:57:58 EST
Out of curiosity, I tried max C-states from 0-7, and it's all good up until 7:

intel_idle.max_cstate=0  ok
intel_idle.max_cstate=1  ok
intel_idle.max_cstate=2  ok
intel_idle.max_cstate=3  ok
intel_idle.max_cstate=4  ok
intel_idle.max_cstate=5  ok
intel_idle.max_cstate=6  ok
intel_idle.max_cstate=7  fail


Some extra info from the last good boot (max_cstate=6):

[root@hp-z240-01 ~]# dmesg | grep intel_idle
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.3.4-300.fc23.x86_64 root=/dev/mapper/fedora_hp--z240--01-root ro rd.lvm.lv=fedora_hp-z240-01/root rd.lvm.lv=fedora_hp-z240-01/swap console=ttyS0,115200N81 LANG=en_US.UTF-8 intel_idle.max_cstate=6
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.3.4-300.fc23.x86_64 root=/dev/mapper/fedora_hp--z240--01-root ro rd.lvm.lv=fedora_hp-z240-01/root rd.lvm.lv=fedora_hp-z240-01/swap console=ttyS0,115200N81 LANG=en_US.UTF-8 intel_idle.max_cstate=6
[    1.786161] intel_idle: MWAIT substates: 0x142120
[    1.786162] intel_idle: v0.4 model 0x5E
[    1.786163] intel_idle: lapic_timer_reliable_states 0xffffffff
[    1.786164] intel_idle: max_cstate 6 reached

[root@hp-z240-01 ~]# cpupower idle-info
CPUidle driver: intel_idle
CPUidle governor: menu

Analyzing CPU 0:
Number of idle states: 7
Available idle states: POLL C1-SKL C1E-SKL C3-SKL C6-SKL C7s-SKL C8-SKL
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 239
Duration: 167650
C1-SKL:
Flags/Description: MWAIT 0x00
Latency: 2
Usage: 2006
Duration: 885110
C1E-SKL:
Flags/Description: MWAIT 0x01
Latency: 10
Usage: 3526
Duration: 2326532
C3-SKL:
Flags/Description: MWAIT 0x10
Latency: 70
Usage: 674
Duration: 667517
C6-SKL:
Flags/Description: MWAIT 0x20
Latency: 85
Usage: 3751
Duration: 4350605
C7s-SKL:
Flags/Description: MWAIT 0x33
Latency: 124
Usage: 544
Duration: 1190425
C8-SKL:
Flags/Description: MWAIT 0x40
Latency: 200
Usage: 9370
Duration: 81364045
Comment 23 Earl Ramirez 2016-02-09 17:04:42 EST
(In reply to Jeff Bastian from comment #21)
> (In reply to Josh Boyer from comment #18) 
> > What happens if you blacklist the hp_wmi driver and/or the snd_hda_intel
> > driver?  Looking at the boot logs for your machine, it seems those would be
> > the next to load.
> 
> Blacklisting the hp_wmi driver worked!
> 
> [root@hp-z240-01 ~]# uname -r
> 4.3.4-300.fc23.x86_64
> [root@hp-z240-01 ~]# grep -o 'rdblacklist[^[:space:]]*' /proc/cmdline
> rdblacklist=hp_wmi
> [root@hp-z240-01 ~]# cat /etc/modprobe.d/blacklist-hp_wmi.conf
> blacklist hp_wmi
> 
> But blacklisting the snd_hda_intel driver didn't work: the 4.3.4-300.fc23
> kernel still hung about 11 seconds into boot.
> 
> 
> (In reply to Earl Ramirez from comment #20)
> > I saw in the upstream bug, at least one user mentioned that using
> > intel_idle.max_cstate=0 worked; maybe you can give that a try.
> 
> That worked too!
> 
> [root@hp-z240-01 ~]# uname -r
> 4.3.4-300.fc23.x86_64
> [root@hp-z240-01 ~]# grep -o 'intel_idle[^[:space:]]*' /proc/cmdline
> intel_idle.max_cstate=0
> 
> 
> 
> So, what do WMI hotkeys have to do with power-saving C-states?

I don't know to be honest; maybe Josh can shed some light on this
Comment 24 LukasH 2016-02-18 11:52:00 EST
I have similar (or maybe the same) issue on IBM x3650 servers (Xeon E5-2609 v2, microcode: 0x428). With various 4.3 kernels (4.3.3-301, 4.3.4-300, 4.3.5-300) system hangs. In facts it boots, but :
      * it's impossible to switch to text during graphical boot ;
      * Fedora logo hangs after boot on console, no login screen appears ;
      * ports (such as ssh 22/tcp, 80/tcp http) looks like open from remote side, but if I'll try to connect via ssh, it hangs during the process and appropriate service (ssh login, http request) is not completed. I tried to wait about 30-40 minutes, nothing happened.

I'll try to grab some additional debug info at some weekend night and post it here, I'll also try intel_idle.max_cstate= workaround as a kernel parameter before boot, as suggested above.

In any case, it has to be 4.3 kernel related. On 4.2.8-300 and any older kernel everything works like a charm on this IBM platform.
Comment 25 LukasH 2016-02-25 07:48:53 EST
I noticed the same issue on Dell PowerEdge T710 (Xeon E5520, microcode: 0x19). I tried `processor.max_cstate=1 intel_idle.max_cstate=0', as a grub parameter, but it was still the same (and yes, on 4.2.8-300 and older kernels it works fine).

I have noticed this (in /var/log/message log) :

Feb 25 05:25:36 maggie systemd-logind: Failed to abandon session scope: Connection reset by peer
Feb 25 05:25:38 maggie systemd-logind: Failed to abandon session scope: Transport endpoint is not connected
...
Feb 25 05:29:39 maggie systemd-logind: Failed to start user slice: Connection timed out


I'm really not sure, if this is a systemd(-logind) issue, but it maybe corresponds with no login on console (and with impossibility to connect thru ssh - session hangs after login on "debug1: Entering interactive session."). FTP relations are "frozen" in the same way too, but what is interesting, https (Apache) works well, and VirtualBox guest machines (I'm able to manage them thru web-console) are able to up & run also without problem.
Comment 26 LukasH 2016-02-25 10:38:04 EST
Exactly the same issue is described here :

http://unix.stackexchange.com/questions/256804/cannot-boot-after-last-kernel-update-fedora-23
Comment 27 Laura Abbott 2016-09-23 15:35:54 EDT
*********** MASS BUG UPDATE **************
 
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 23 kernel bugs.
 
Fedora 23 has now been rebased to 4.7.4-100.fc23.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 24 or 25, and are still experiencing this issue, please change the version to Fedora 24 or 25.
 
If you experience different issues, please open a new bug report for those.
Comment 28 Earl Ramirez 2016-10-03 08:29:31 EDT
This bug was fixed in Kernel 4.6 as promised by upstream, I'm currently on Fedora 24 and the issue do not persist; therefore, it safe to close to bug or mark it as resolved.
Comment 29 Laura Abbott 2016-10-03 10:53:11 EDT
Thank you for letting us know.

Note You need to log in before you can comment on or make changes to this bug.