Bug 1387919

Summary: Linux kernel 4.7.9-200 does not boot on Dell Inspiron 500m
Product: [Fedora] Fedora Reporter: Albert Flügel <af>
Component: kernelAssignee: Prarit Bhargava <prarit>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 24CC: af, gansalmon, george.sigut, ichavero, itamar, jonathan, kernel-maint, labbott, madhu.chinakonda, mchehab, nvlbox, prarit, robn, sajero, saxonm, slingamn
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-17 14:33:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
lspci output of the machine with 4.7.9 failing
none
dmidecode output of the machine with 4.7.9 failing
none
lspci output of a Lenovo T500 with 4.7.9 working
none
dmidecode output of a Lenovo T500 with 4.7.9 working
none
lspci output of a box with Asus board and AMD processor (see #10)
none
dmidecode output of a box with Asus board and AMD processor (see #10)
none
Patch to make kernel 4.7.9 boot again on certain architectures
none
Patch adding "x86/smpboot: Init apic mapping before usage" to kernel build as Patch899 none

Description Albert Flügel 2016-10-23 15:51:06 UTC
Description of problem:
Upgraded kernel to current 4.7.9

Version-Release number of selected component (if applicable):
4.7.9-200

How reproducible:
Install kernel 4.7.9-200 and boot on a Dell Inspiron 500m, also called Latitude D500

Steps to Reproduce:
1. Install kernel 4.7.9
2. reboot


Actual results:
After Grub message "Loading initial ramdisk" nothing happens anymore

Expected results:
Linux kernel starts

Additional info:
I tried 2 times to install 4.7.9 just in case the initrd was created incorrectly the first time, but this did not help.
Removing the kernel commandline arguments rhgb and quiet did not change anything in terms of no output is written, nothing at all, no further hint, what does not work.
Ctrl-Alt-Del does not work, have to press the power button
noapic or noapm or both do not help.
CPU is a Intel(R) Pentium(R) M processor 2.10GHz, in case that matters. However i can't imagine. This laptop runs Fedora since many versions and this is the first time, the kernel does not start.
4.7.7-200 was fine
There seems no version inbetween e.g. 4.7.8

Comment 1 Albert Flügel 2016-10-23 16:00:13 UTC
On a Lenovo T500 4.7.9 starts normally. Problem might be limited to 32 bit kernel ?

Comment 2 Nick Lee 2016-10-24 18:06:04 UTC
Absolutely the same problem

Fedora24 x86_64 with Nouveau module

CPU Pentium(R) Dual-Core  E5200  @ 2.50GHz

lspci
00:00.0 Host bridge: NVIDIA Corporation MCP79 Host Bridge (rev b1)
00:00.1 RAM memory: NVIDIA Corporation MCP79 Memory Controller (rev b1)
00:03.0 ISA bridge: NVIDIA Corporation MCP79 LPC Bridge (rev b2)
00:03.1 RAM memory: NVIDIA Corporation MCP79 Memory Controller (rev b1)
00:03.2 SMBus: NVIDIA Corporation MCP79 SMBus (rev b1)
00:03.3 RAM memory: NVIDIA Corporation MCP79 Memory Controller (rev b1)
00:03.4 RAM memory: NVIDIA Corporation MCP79 Memory Controller (rev b1)
00:03.5 Co-processor: NVIDIA Corporation MCP79 Co-processor (rev b1)
00:04.0 USB controller: NVIDIA Corporation MCP79 OHCI USB 1.1 Controller (rev b1)
00:04.1 USB controller: NVIDIA Corporation MCP79 EHCI USB 2.0 Controller (rev b1)
00:06.0 USB controller: NVIDIA Corporation MCP79 OHCI USB 1.1 Controller (rev b1)
00:06.1 USB controller: NVIDIA Corporation MCP79 EHCI USB 2.0 Controller (rev b1)
00:08.0 Audio device: NVIDIA Corporation MCP79 High Definition Audio (rev b1)
00:09.0 PCI bridge: NVIDIA Corporation MCP79 PCI Bridge (rev b1)
00:0a.0 Ethernet controller: NVIDIA Corporation MCP79 Ethernet (rev b1)
00:0b.0 SATA controller: NVIDIA Corporation MCP79 AHCI Controller (rev b1)
00:0c.0 PCI bridge: NVIDIA Corporation MCP79 PCI Express Bridge (rev b1)
00:10.0 PCI bridge: NVIDIA Corporation MCP79 PCI Express Bridge (rev b1)
00:15.0 PCI bridge: NVIDIA Corporation MCP79 PCI Express Bridge (rev b1)
00:16.0 PCI bridge: NVIDIA Corporation MCP79 PCI Express Bridge (rev b1)
00:17.0 PCI bridge: NVIDIA Corporation MCP79 PCI Express Bridge (rev b1)
00:18.0 PCI bridge: NVIDIA Corporation MCP79 PCI Express Bridge (rev b1)
03:00.0 VGA compatible controller: NVIDIA Corporation C79 [GeForce 9300 / nForce 730i] (rev b1)

Comment 3 orejas 2016-10-24 19:31:25 UTC
no boot  on 4.7.9 i686 hp pavillon zt 3000 512 mb ram. kernel 4.7.5  boot ok
in 4.7.9 after remove quit from boot line
"Probing EDD (edd=off to disable) ... ok"
then freeze

video amd/Ati rv250/m9 gl (mobility firegl 9000/radeon 9000
cpu 
genuine intel 
familia de cpu=6
modelo 9
intel pentium m processor 1500 mhz
disk 150gb

Comment 4 Laura Abbott 2016-10-24 19:50:12 UTC
can you test http://koji.fedoraproject.org/koji/taskinfo?taskID=16118947 ? There was a similar bootup issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=1384238

Comment 5 Nick Lee 2016-10-24 20:48:05 UTC
4.7.8-200.rhbz1384238.fc24.x86_64 the same fail. After Grub nothing.

Comment 6 Al Schapira 2016-10-25 18:56:24 UTC
Same problem.  On Dell C840 (pentium) 4.7.7 (f23) works fine, but after yum update, 4.7.8 failed to boot and after 2nd yum update 4.7.9 fails to boot, both with booting message alone on otherwise black screen.

Comment 7 Nick Lee 2016-10-25 20:35:10 UTC
4.8.4-200.fc24.x86_64 the same fail. After Grub nothing.

Comment 8 Rob van Nieuwkerk 2016-10-26 01:43:59 UTC
I have a similar problem: on a Dell Inspiron 8600 laptop (Pentium M, 32-bit, 2GB RAM) kernel-4.7.9-200.fc24.i686 does not boot.

I have to revert to the kernel that came with the install: kernel-4.5.5-300.fc24.i686 (I just installed F24 on this laptop).

The processor does not support PAE. Another old laptop with PAE CPU *does* boot fine (with the PAE-version of the new kernel)

Comment 9 Albert Flügel 2016-10-26 07:40:31 UTC
Created attachment 1214193 [details]
lspci output of the machine with 4.7.9 failing

Comment 10 Albert Flügel 2016-10-26 07:41:00 UTC
Regarding #4: I cannot try this build, because it's for x86_64. The laptop where 4.7.9 does not start, is a i686. Is there also a build for i686 ?

I tried 4.8.4-200.fc24.i686: Same story, not a single message after the grub output.

Additional info, i don't know if this matters:
When i switch off the laptop in this hanging state, because even Ctrl-Alt-Del does not work anymore, during next boot the BIOS performs additional checks, probably because it assumes, that the previous POST did not finish. This seems weird to me as at least the loader has started. BIOS POST should be over, right ?
So could this mean the first pieces of the kernel code confuse the BIOS ?

Additional info:
4.7.9 starts without issue as a KVM/QEMU guest, also on a box with an AMD Phenom processor and an Asus M5A88-V EVO board.

I'll attach the lspci and dmidecode output for the different cases i've tested (except for the KVM/QEMU virtual machine) and would like to encourage others to do the same. Probably this gives some clue, what is the common denominator of the machines, where Linux does not start from version 4.7.8 upward.

Comment 11 Albert Flügel 2016-10-26 07:41:42 UTC
Created attachment 1214195 [details]
dmidecode output of the machine with 4.7.9 failing

Comment 12 Albert Flügel 2016-10-26 07:42:37 UTC
Created attachment 1214196 [details]
lspci output of a Lenovo T500 with 4.7.9 working

Comment 13 Albert Flügel 2016-10-26 07:43:15 UTC
Created attachment 1214197 [details]
dmidecode output of a Lenovo T500 with 4.7.9 working

Comment 14 Albert Flügel 2016-10-26 07:45:31 UTC
Created attachment 1214198 [details]
lspci output of a box with Asus board and AMD processor (see #10)

Comment 15 Albert Flügel 2016-10-26 07:46:16 UTC
Created attachment 1214199 [details]
dmidecode output of a box with Asus board and AMD processor (see #10)

Comment 17 Albert Flügel 2016-10-26 21:05:39 UTC
Tried to revert this patch mentioned in comment #16, but it did not help.
I had already looked at that patch and it is looking clean ot me. However, it seems not complete. At least reverting using patch -R did not work. To get a consistent source state i had to do it manually. Unfortunately, as said, still no boot.

Comment 18 Shivaram Lingamneni 2016-10-27 18:33:55 UTC
I believe I'm affected by this bug. By removing `rhgb quiet` and adding `debug ignore_loglevel earlyprintk=vga,keep` to the kernel command line, I got a panic message:

https://i.imgur.com/PhFBU0Y.jpg

and with the further addition of `boot_delay=10`, I was able to get a video of the panic (trace starts around 1:20):

http://sendvid.com/v3uxqqtl

so this does look like it might be related to the APIC change.

Here's `lshw -sanitize` under the last working kernel, 4.7.5-200.fc24:

https://gist.github.com/slingamn/afac16aea8c17d37f95ebe2388779df2

Comment 19 Albert Flügel 2016-10-27 20:58:33 UTC
Interesting findings. What do you mean by "this apic change" ? There are not only the changes from vanilla kernel 4.7.7 to 4.7.8. As far as i have seen several other APIC related patches are pulled in during boot added by redhat, that might interfere with upstream changes or whatever.
So what i tried to do is to build without these patches. Some earlier time, it was possible to comment the patches out in the spec file. Today, as usual, everything has changed and commenting them out does not help. There is a macro _with_vanilla in the spec file. When i set this for rpmbuild -bb ..., i get an error message, so this does not work. Seems, noone uses this option. So the next thing i would try to do is reverse-engineer this patching mechanism and modify the spec file accordingly. But frankly after lots of hours i spent into this i'm really fed up. i wouldn't be surprised if this ended up in a wontfix because our hareware is too old ...

Comment 20 Al Schapira 2016-10-27 22:15:44 UTC
I hope "wontfix" won't follow "justbroke".

Comment 21 Shivaram Lingamneni 2016-10-27 22:30:25 UTC
Sorry, I spoke carelessly: I don't have any special insight into what APIC-related change might have caused the issue.

Do you have the same panic message as I do (with the addition of those command-line options)?

Comment 22 Albert Flügel 2016-10-28 08:33:01 UTC
Thank you very much for the arguments for this super-cool slow-motion boot ! Didn't know this yet.
What i get can be seen here: http://www.muc.de/~af/scr.jpg .
I did not see the video so i cannot compare now.
I see a null pointer memory access early in native_apic_mem_read . Can't check what this means now, because i'm on the way to vacation, away from computers for a few days.
Could have to do with the patch abolishing the apic version array, but not necessarily. Will look at the kernel code Monday, if noone else will have done until then.

Comment 23 GMS 2016-10-28 19:43:31 UTC
(In reply to Albert Flügel from comment #0)
> Description of problem:
> Upgraded kernel to current 4.7.9
> 
> Version-Release number of selected component (if applicable):
> 4.7.9-200
> 
> How reproducible:
> Install kernel 4.7.9-200 and boot on a Dell Inspiron 500m, also called
> Latitude D500
> 
> Steps to Reproduce:
> 1. Install kernel 4.7.9
> 2. reboot
> 
> 
> Actual results:
> After Grub message "Loading initial ramdisk" nothing happens anymore
> 
> Expected results:
> Linux kernel starts

Same problem with IBM ThinkPad X31 (works with 4.7.6). Generation later
(IBM ThinkPad X60) is fine with 4.7.9. Both 32 bit.

Comment 24 Albert Flügel 2016-10-31 17:32:28 UTC
Created attachment 1215884 [details]
Patch to make kernel 4.7.9 boot again on certain architectures

put into SOURCES , add line like this to kernel.spec :
Patch899: cpu_from_apic_too_early.patch
and rebuild as usual rpmbuild -bb ...

Comment 25 Albert Flügel 2016-10-31 17:49:02 UTC
Problem is, as can be seen here: http://www.muc.de/~af/scr4.jpg , that hard_smp_processor_id is called by prefill_possible_map in a very early boot stage.
prefill_possible_map collects infos about CPUs to fill some data structures. hard_smp_processor_id makes calls to functions in the fixmap area, that make accesses to kind of memory mapped hardware registers, e.g. the APIC. The problem is, that in this early stage of booting this kind of memory management is not established yet, so this cannot work. This causes the message
BUG: unable to handle kernel paging request at ffffc020
visible in http://www.muc.de/~af/scr.jpg . The named address is exactly the one, where the APIC status register is expected. It is also present in the EAX register and the instruction at native_apic_mem_read + 17 is a mov from this address. So this access leads to an oops.
Looking at prefill_possible_map one can see, that the value obtained from hard_smp_processor_id is used just for an informational output. The attached patch comments out the call to hard_smp_processor_id and replaces the cpu identifier in the output with the already available apic cpu id. This makes the kernel boot again.
Probably the resulting output is not what the upstream maintainers want to see. However, for now it is imo better to have a somewhat unappropriate output compared to an unbootable linux.

RPM packages to test on i686 (no PAE !) can be downloaded here:
http://www.muc.de/~af/linux

Feedback is welcome, upstream communication requested.

Comment 26 Albert Flügel 2016-11-02 07:28:57 UTC
The patch works also for 4.8.4. Packages for i686 also in http://www.muc.de/~af/linux

Comment 27 Albert Flügel 2016-11-02 11:02:36 UTC
The problem came in with commit 2a51fe083eba7f99cbda72f5ef90cdf2f4df882c .
However, it's just the call to hard_smp_processor_id that leads to oops.

Comment 28 Albert Flügel 2016-11-03 10:02:07 UTC
RPMs with the patch built into kernel 4.8.4 also for i686-PAE and the x86_64 architecture can be found in http://www.muc.de/~af/linux

Comment 29 Prarit Bhargava 2016-11-03 12:07:43 UTC
(In reply to Albert Flügel from comment #25)
> Problem is, as can be seen here: http://www.muc.de/~af/scr4.jpg , that
> hard_smp_processor_id is called by prefill_possible_map in a very early boot
> stage.
> prefill_possible_map collects infos about CPUs to fill some data structures.
> hard_smp_processor_id makes calls to functions in the fixmap area, that make
> accesses to kind of memory mapped hardware registers, e.g. the APIC. The
> problem is, that in this early stage of booting this kind of memory
> management is not established yet, so this cannot work. This causes the
> message
> BUG: unable to handle kernel paging request at ffffc020
> visible in http://www.muc.de/~af/scr.jpg . The named address is exactly the
> one, where the APIC status register is expected. It is also present in the
> EAX register and the instruction at native_apic_mem_read + 17 is a mov from
> this address. So this access leads to an oops.
> Looking at prefill_possible_map one can see, that the value obtained from
> hard_smp_processor_id is used just for an informational output. The attached
> patch comments out the call to hard_smp_processor_id and replaces the cpu
> identifier in the output with the already available apic cpu id. This makes
> the kernel boot again.
> Probably the resulting output is not what the upstream maintainers want to
> see. However, for now it is imo better to have a somewhat unappropriate
> output compared to an unbootable linux.
> 
> RPM packages to test on i686 (no PAE !) can be downloaded here:
> http://www.muc.de/~af/linux
> 
> Feedback is welcome, upstream communication requested.

Is there an upstream thread for this (on LKML or other?)?

P.

Comment 30 Prarit Bhargava 2016-11-03 12:10:56 UTC
Can you also please test with linux.git commit 1e90a13d0c3d ("x86/smpboot: Init apic mapping before usage")?

Thanks,

P.

Comment 31 Albert Flügel 2016-11-06 13:23:07 UTC
I don't know, whether there is any upstream thread. I did not initiate one.
The problem does not show up with 4.8.6-201.
Commit 1e90a13d0c3d looks making sense to me in this context.
4.8.6-201 seems to fix it differently by calling hard_smp_processor_id only if boot_cpu_has(X86_FEATURE_APIC) is true.
Can be done this way.
To build 4.8.4 with just this commit added would take my laptop another 5 hours' build and if it does not work right away take me probably more hours to adapt the code around to build.

Comment 32 Rob van Nieuwkerk 2016-11-08 00:53:42 UTC
Kernel 4.8.6-201.fc24.i686 solves the problem for my Dell Inspiron 8600 laptop (non-PAE Pentium M, 32-bit, 2GB RAM): it boots fine again!  (see comment 8)

Comment 33 GMS 2016-11-09 19:30:37 UTC
(In reply to Albert Flügel from comment #31)
> I don't know, whether there is any upstream thread. I did not initiate one.
> The problem does not show up with 4.8.6-201.
...

Can confirm: Kernel 4.8.6-201 (32bit) is working on IBM ThinkPad X31.
(didn't try 4.8.4)

Comment 34 Shivaram Lingamneni 2016-11-09 20:15:40 UTC
+1, 4.8.4-200.fc24.i686 is broken and 4.8.6-201.fc24.i686 is working.

Comment 35 Nick Lee 2016-11-10 05:22:36 UTC
With 'APIC Mode - disabled' in BIOS kernel-4.8.6-201-x86_64 not bootable for me.  HW in Comment #2.

Comment 36 Albert Flügel 2016-11-10 07:41:04 UTC
Nick, do you have a chance to try the same with the patched 4.8.4 from http://www.muc.de/~af/linux ?

Comment 37 Nick Lee 2016-11-10 19:49:12 UTC
Albert,
unfortunately your kernel is not bootable too.
Please see my boot log:
http://imgur.com/a/PvCrO
http://imgur.com/a/CvrlA

Comment 38 Albert Flügel 2016-11-10 20:55:18 UTC
Interesting. Think i'll really give 4.8.4 + this commit 1e90a13d0c3d a try. Frankly i can't say, that i can really judge what way of structuring the functionality makes more sense here. This 1e90a13d0c3d looks clearer to me. Initializing the APIC access before using it seems more logical than to initialize later and skip certain APIC related consisteny checks under some conditions before, where these conditions seem not clear. Some upstream maintainer familiar with the APIC stuff should have an eye on this.
However, i'll try a build with 1e90a13d0c3d instead of the way 4.8.6-201 tries to avoid the problem. But this can take some time.

Comment 39 Albert Flügel 2016-11-11 14:37:45 UTC
I built a kernel 4.8.4 with "x86/smpboot: Init apic mapping before usage" (i find this as commit 0c524f819683e9f1c165d571256a9023b56f1f0c) included, currently only for x86_64 (and without the other patch that appeared in 4.8.6) as 4.8.4-301 in http://www.muc.de/~af/linux .

Builds for i686 and i686-PAE will follow as i find the time.

I had to modify the patch a bit and will attach it here. I added it as Patch899 in the SPEC file.

Comment 40 Albert Flügel 2016-11-11 14:39:52 UTC
Created attachment 1219816 [details]
Patch adding "x86/smpboot: Init apic mapping before usage" to kernel build as Patch899

Comment 41 Nick Lee 2016-11-12 05:31:46 UTC
Albert,

I got normal boot with your kernel-4.8.4-301. APIC disabled in BIOS.

Comment 42 Albert Flügel 2016-11-12 15:09:43 UTC
My laptop also boots with this source configuration 4.8.4-301 as outlined in comment 39. Packages for i686 are also in http://www.muc.de/~af/linux now. i686-PAE will follow tomorrow.

To me this seems the appropriate fix. The check boot_cpu_has(X86_FEATURE_APIC) in 4.8.6 probably yields true, but when for whatever reason (probably making problems) the APIC is switched off, this breaks the boot.

The attached patch should be after Patch849 in the spec file.

Comment 43 Nick Lee 2016-11-20 07:28:27 UTC
4.8.7-200.fc24.x86_64 from fedora repo boot ok with APIC disabled in BIOS.

Comment 44 Al Schapira 2016-11-28 01:29:05 UTC
Kernel 4.8.8-100.fc23.i686 #1 SMP boots on my Dell C840!
A big THANK-YOU to all who made this happen.

Comment 45 Justin M. Forbes 2017-04-11 14:39:19 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs.

Fedora 25 has now been rebased to 4.10.9-100.fc24.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.

If you experience different issues, please open a new bug report for those.

Comment 46 Albert Flügel 2017-04-14 17:06:41 UTC
Everything fine with 4.10.9-100.fc24 , this bug is gone since several versions, thank you.