Bug 1264584 - Kernel hangs in very early boot when Skylake system is given 64G memory
Kernel hangs in very early boot when Skylake system is given 64G memory
Status: CLOSED NEXTRELEASE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
22
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-18 17:51 EDT by Andy Ross
Modified: 2015-11-10 08:26 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-10 08:26:27 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Andy Ross 2015-09-18 17:51:44 EDT
Description of problem:

I have a Skylake (i7-6700k) system on a MSI Z170A M5 motherboard.

It works fine in Fedora when booted with less than 64G of memory, but
if I fill the board the kernel fails to initialize.  I see the grub
menu fine (though due to a BIOS bug where it seems to overwrite
BootOrder, I have to launch grub via the EFI shell) and can launch
Windows from it, but the kernel fails to come up in the max-memory
configuration.

I've swapped sticks in 32G and failed to reproduce the problem also.
And of course windows boots fine on the same hardware with 64G.  I
think this is a kernel issue.

Version-Release number of selected component (if applicable):

Current kernel 4.1.6-200 and whatever the kernel on the install image is have the same behavior.

How reproducible:

100%

Steps to Reproduce:
1.  Configure machine with 2x 16G DIMMs
2.  Install Fedora
3.  Add two more 16G DIMMS and try booting

Actual results:

Blank screen and hang.  Passing systemd.unit=rescue and removing rhgb/quiet has no effect.  No kernel logs of any type appear.  Note that grub boots fine.

Expected results:

Normal boot.

Additional info:
Comment 1 Joshua Rosen 2015-10-08 18:20:13 EDT
I have an i6700K on a Gigabyte GA-Z170XP-SLI with 64G of RAM. Fedora 22 and Fedora 23 Beta boot fine however reboot and shutdown -h don't work (I've filed a separate bug). 

To rule out a memory problem run sys_basher, it's in the repositories. To run the memory tests do,

sys_basher -t 2048 -m >& log &

Run sys_basher by itself, if you have any VMs shut them down and don't run any other applications. Sys_basher produces a couple of files that are of interest, sys_basher.log and sys_basher.rpt. If you have a bad DIMM sys_basher will identify it. Some problems can cause your system to crash while running sys_basher. If your system does crash examine the log file to see what sys_basher was doing when it crashed. Sys_basher syncs the logs to the disk and records the CPU and motherboard temperatures after each test so you'll be able to get a pretty good idea what the problem is.

To run all of sys_basher's tests including the CPU and disk tests do,

sys_basher -t 2048 >* log &
Comment 2 Joshua Rosen 2015-10-08 18:34:22 EDT
BTW sys_basher will take about 7 hours to run a a 64G i7-6700K system so do it overnight.
Comment 3 Justin M. Forbes 2015-10-20 15:42:03 EDT
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 22 kernel bugs.

Fedora 22 has now been rebased to 4.2.3-200.fc22.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 23, and are still experiencing this issue, please change the version to Fedora 23.

If you experience different issues, please open a new bug report for those.
Comment 4 Andy Ross 2015-11-08 11:24:09 EST
I just had a chance to test this against Fedora 23, and it's working fine now.  The live image kernel and the most recent updated one boot just fine on exactly the same hardware and BIOS version where F22 failed.

(Hilariously, now the NVIDIA proprietary drivers now hang the machine on boot after installation where they worked fine with the F22 kernel, but that's not your problem.)

Feel free to close this.
Comment 5 Joshua Rosen 2015-11-08 11:40:59 EST
I strongly suggest that you run sys_basher to check your RAM. MSI boards are particularly sensitive to RAM issues, I avoid them and stick with Gigabyte because I've run into memory problems on several MSI boards. 

To run just the memory tests do

sys_basher -m -t 2048 >& log &
Comment 6 Andy Ross 2015-11-09 11:15:48 EST
Joshua: it's not the RAM.  As described, it is (was) a 100% reliable failure on a device that works without problem under windows, and when subsets (I enumerated all of them) of the DIMMs are installed in 32G configurations.  And it was resolved completely by a software change.

That's not to say that hardware testing doesn't have value, but it's not a panacea either.  I do system level debugging of Linux kernel issues (under Android; apologies, heh) professionally, I really do know how to diagnose this stuff. :)
Comment 7 Joshua Rosen 2015-11-09 12:50:58 EST
Andy,

You may be right however you shouldn't be so quick to rule out a RAM problem. I've been a computer hardware designer for 40 years, the first half of my career was as a CPU designer, now I do various high performance chip designs so I know what I'm talking about also. RAM problems are surprisingly common, that's because they do only minimal testing of the components at best, that's because it's to costly to do a comprehensive test. After having stability problems on a number of machines I wrote sys_basher to find RAM, CPU and heat related problems. Sys_basher does very extensive memory tests over a period of hours, if your system passes then you can be confident that it's reliable. When I build a new box the first thing I do after installing Linux is to run sys_basher overnight, I encourage you to do the same.

Note You need to log in before you can comment on or make changes to this bug.