Bug 668825

Summary:

Server cannot boot with kernel-2.6.32-85

Product:

Red Hat Enterprise Linux 6

Reporter:

Yvugenfi <yvugenfi>

Component:

kernel

Assignee:

bob picco <bpicco>

Status:

CLOSED ERRATA

QA Contact:

Zhang Kexin <kzhang>

Severity:

high

Docs Contact:

Priority:

medium

Version:

6.3

CC:

arozansk, bpicco, dzickus, kzhang, moshiro, mst, mzywusko, pbunyan, peterm, yugzhang

Target Milestone:

Flags:

bpicco: needinfo-

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

ptam

Fixed In Version:

kernel-2.6.32-112.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-23 20:35:40 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
sosreport output	none
Boot output - first screen shot	none
Boot output - second screen shot	none
Boot output with "nousb" - first screen shot	none
Boot output with "nousb" - second screen shot	none
EFI RTL physical mode patch	none

Description Yvugenfi@redhat.com 2011-01-11 17:52:46 UTC

Description of problem:
While trying to boot the server with kernel-2.6.32-85 and up, the boot process will hang.

Version-Release number of selected component (if applicable):
kernel-2.6.32-84 - boots OK
Starting from kernel-2.6.32-85 there are hangs.

How reproducible:
Each time.


Steps to Reproduce:
1. yum install -i kernel-firmware-2.6.32-85.el6.noarch.rpm
2. yum install -i kernel-2.6.32-85.el6.x86_64.rpm
3. reboot
  
Actual results:
Kernel is not booted

Expected results:
Kernel should boot

Additional info:
Will add sosreport file from the server.

Comment 2 Yvugenfi@redhat.com 2011-01-11 17:54:52 UTC

Created attachment 472861 [details]
sosreport output

Comment 3 Don Zickus 2011-01-11 20:44:13 UTC

Hi Yan,

Is there any console output?  A sosreport won't show me what is going with the hang.

Cheers,
Don

Comment 4 Yvugenfi@redhat.com 2011-01-11 22:11:21 UTC

Created attachment 472906 [details]
Boot output - first screen shot

Comment 5 Yvugenfi@redhat.com 2011-01-11 22:12:50 UTC

Created attachment 472907 [details]
Boot output - second screen shot

Comment 6 Yvugenfi@redhat.com 2011-01-11 22:14:00 UTC

Created attachment 472908 [details]
Boot output with "nousb" - first screen shot

Comment 7 Yvugenfi@redhat.com 2011-01-11 22:15:31 UTC

Created attachment 472909 [details]
Boot output with "nousb" - second screen shot

Comment 12 Takao Indoh 2011-01-13 23:00:31 UTC

In the page2_nousb.JPG, the following message can be seen.

EFI Variables Facility v0.08 2004-May-17
BUG: unable to handle kernel paging request at 00000000ffc0004c.

I guess that efivars_init(drivers/firmware/efivars.c) tried to access EFI
variable via runtime service(get_next_variable) and it failed.

The address 0xffc0004c seems to be physical address. In efi physical mode,
only efi runtime code/data can be accessed with physical address. According
to var/log/dmesg in sosreport, the following addresses are runtime
code/data. 0xffc0004c is out of these ranges hence page fault occurred.

[runtime service code]
range=[0x000000007d5e6000-0x000000007d604000) (0MB)
range=[0x000000007d641000-0x000000007d65f000) (0MB)
range=[0x000000007f60c000-0x000000007f614000) (0MB)
range=[0x000000007f63f000-0x000000007f68f000) (0MB)

[runtime service data]
range=[0x000000007f5ef000-0x000000007f601000) (0MB)
range=[0x000000007f601000-0x000000007f60c000) (0MB)
range=[0x000000007f614000-0x000000007f63f000) (0MB)

Anybody knows what address 0xffc0004c is? Yan, could you get /proc/iomem in this server?

Comment 13 Michael S. Tsirkin 2011-01-14 06:34:29 UTC

00000000-00000fff : reserved
00001000-0006bfff : System RAM
0006c000-0006cfff : ACPI Non-volatile Storage
0006d000-0009efff : System RAM
0009f000-0009ffff : ACPI Non-volatile Storage
00100000-7d5e5fff : System RAM
  01000000-014cda67 : Kernel code
  014cda68-01ba586f : Kernel data
  01ce3000-01f9e077 : Kernel bss
  02000000-0a0fffff : Crash kernel
7d5e6000-7d603fff : reserved
7d604000-7d640fff : System RAM
7d641000-7d65efff : reserved
7d65f000-7d7dafff : System RAM
7d7db000-7d88afff : reserved
7d88b000-7f5eefff : System RAM
7f5ef000-7f6defff : reserved
7f6df000-7f7defff : ACPI Non-volatile Storage
7f7df000-7f7fefff : ACPI Tables
7f7ff000-7f7fffff : System RAM
7f800000-7fffffff : RAM buffer
80000000-8fffffff : PCI MMCONFIG 0 [00-ff]
  80000000-8fffffff : reserved
    80000000-8fffffff : pnp 00:0a
90000000-901fffff : PCI Bus 0000:0b
  90000000-901fffff : 0000:0b:00.0
92000000-95ffffff : PCI Bus 0000:10
  92000000-93ffffff : 0000:10:00.0
    92000000-93ffffff : bnx2
  94000000-95ffffff : 0000:10:00.1
    94000000-95ffffff : bnx2
96000000-96ffffff : PCI Bus 0000:06
  96000000-96ffffff : PCI Bus 0000:07
    96000000-96ffffff : 0000:07:00.0
      96000000-965fffff : efifb
97000000-978fffff : PCI Bus 0000:06
  97000000-978fffff : PCI Bus 0000:07
    97000000-977fffff : 0000:07:00.0
    97800000-97803fff : 0000:07:00.0
97900000-979fffff : PCI Bus 0000:0b
  97900000-9790ffff : 0000:0b:00.0
    97900000-9790ffff : mpt
  97910000-97913fff : 0000:0b:00.0
    97910000-97913fff : mpt
97a00000-97a03fff : 0000:00:16.0
  97a00000-97a03fff : ioatdma
97a04000-97a07fff : 0000:00:16.1
  97a04000-97a07fff : ioatdma
97a08000-97a0bfff : 0000:00:16.2
  97a08000-97a0bfff : ioatdma
97a0c000-97a0ffff : 0000:00:16.3
  97a0c000-97a0ffff : ioatdma
97a10000-97a13fff : 0000:00:16.4
  97a10000-97a13fff : ioatdma
97a14000-97a17fff : 0000:00:16.5
  97a14000-97a17fff : ioatdma
97a18000-97a1bfff : 0000:00:16.6
  97a18000-97a1bfff : ioatdma
97a1c000-97a1ffff : 0000:00:16.7
  97a1c000-97a1ffff : ioatdma
97a21000-97a213ff : 0000:00:1d.7
  97a21000-97a213ff : ehci_hcd
97a21400-97a217ff : 0000:00:1a.7
  97a21400-97a217ff : ehci_hcd
97a21800-97a218ff : 0000:00:1f.3
fc000000-fcffffff : pnp 00:0a
fe710000-fe711fff : pnp 00:0a
fe800000-fe9fffff : pnp 00:0a
fea00000-feafffff : pnp 00:0a
feb00000-febfffff : pnp 00:0a
fec00000-fec00fff : IOAPIC 0
fec80000-fec80fff : IOAPIC 1
fed00000-fed003ff : HPET 0
  fed00000-fed003ff : pnp 00:05
fed1c000-fed1ffff : reserved
  fed1c000-fed1ffff : pnp 00:0a
fee00000-feefffff : pnp 00:0a
  fee00000-fee00fff : Local APIC
ff800000-ffffffff : reserved
  ffc00000-ffffffff : pnp 00:0a
100000000-27fffffff : System RAM

Comment 14 bob picco 2011-01-14 12:34:34 UTC

(In reply to comment #12)
> In the page2_nousb.JPG, the following message can be seen.
> 
> EFI Variables Facility v0.08 2004-May-17
> BUG: unable to handle kernel paging request at 00000000ffc0004c.
> 
> I guess that efivars_init(drivers/firmware/efivars.c) tried to access EFI
> variable via runtime service(get_next_variable) and it failed.
> 
> The address 0xffc0004c seems to be physical address. In efi physical mode,
> only efi runtime code/data can be accessed with physical address. According
> to var/log/dmesg in sosreport, the following addresses are runtime
> code/data. 0xffc0004c is out of these ranges hence page fault occurred.
> 
> [runtime service code]
> range=[0x000000007d5e6000-0x000000007d604000) (0MB)
> range=[0x000000007d641000-0x000000007d65f000) (0MB)
> range=[0x000000007f60c000-0x000000007f614000) (0MB)
> range=[0x000000007f63f000-0x000000007f68f000) (0MB)
> 
> [runtime service data]
> range=[0x000000007f5ef000-0x000000007f601000) (0MB)
> range=[0x000000007f601000-0x000000007f60c000) (0MB)
> range=[0x000000007f614000-0x000000007f63f000) (0MB)
> 
> Anybody knows what address 0xffc0004c is? Yan, could you get /proc/iomem in
> this server?

I looked at dmesg Wed or Thu. I want to do it again. The EFI memory descriptor
indicates from attribute that it is EFI run time and the memory type is MMIO.
This was my concern when I asked you about my patch back in middle of November.
You didn't believe it to be an issue. Now I think it definitely is. My patch
for commit 1deea99897b17206b3069b0e5ede7dafa068d117 has caused this problem. The commit corrected your use of EFI memory type but it exposed a larger issue. I should have looked
at this far closer. Also the page table entries need to be uncached for MMIO.

IBM EFI seems to use this MMIO location in EFI RTL. I'm willing to bet this
could be true of other EFI implementations. I think you need to look at EFI
virtual mode code which I did some.

This analysis needs to be reverified. It was done very quickly.

Unfortunately I won't have much time to look further until possibly Tuesday
next week. I'm on the hook for another issue which is due on close of business
Monday and a totally new and frustrating area for me (not kernel).

bob

Comment 15 bob picco 2011-01-18 18:02:55 UTC

Okay. I verified this.

This IBM machine has three EFI memory desriptors for MMIO. Two are uncached and 
one is cached(?). The region in question is uncached.
EFI: mem187: type=11, attr=0x8000000000000001, range=[0x00000000ff800000-0x0000000100000000) (8MB)
EFI: mem189: type=11, attr=0x8000000000000001, range=[0x0000000080000000-0x0000000090000000) (256MB)
EFI: mem190: type=11, attr=0x8000000000000000, range=[0x00000000fed1c000-0x00000000fed20000) (0MB)

I've done a brew build of the patch:
https://brewweb.devel.redhat.com/taskinfo?taskID=3043894
which is boot tested on DELL only. UEFI has to (at least it has been)
be installed in lab. Well I'm not physically there today and with weather
maybe not tomorrow either.


thanx,

bob

Comment 16 bob picco 2011-01-18 18:04:53 UTC

Created attachment 474116 [details]
EFI RTL physical mode patch

Comment 17 Takao Indoh 2011-01-18 22:58:18 UTC

I tested your patch on Fujitsu PRIMEQUEST(UEFI/x86_64). It booted normally and kdump worked.

Comment 18 bob picco 2011-01-19 13:48:54 UTC

I verified this patch on boiler.eng.lab.tlv.redhat.com (IBM UEFI blade) thanx to Michael. It booted twice. Second boot was to test nvram update by efibootmgr which I know little about but boot timeout remained.

Comment 19 Peter Martuccelli 2011-01-20 13:40:36 UTC

Reassigning to Picco as he has a solution to the regression.  Expect a patch posting this morning.

Comment 21 RHEL Program Management 2011-01-20 13:50:19 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 29 Aristeu Rozanski 2011-02-03 16:46:25 UTC

Patch(es) available on kernel-2.6.32-112.el6

Comment 37 errata-xmlrpc 2011-05-23 20:35:40 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html