Bug 240006 - mmap of VGA frame buffer causes MCA on ia64
mmap of VGA frame buffer causes MCA on ia64
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.0
All Linux
urgent Severity medium
: rc
: ---
Assigned To: Jarod Wilson
Martin Jenner
: OtherQA, Regression
: 243777 250841 252399 (view as bug list)
Depends On:
Blocks: 246139 222082 229340 245607 250288 252399 296411 372911 RHEL5u2_relnotes 420521 422431 422441 422491
  Show dependency treegraph
 
Reported: 2007-05-14 06:55 EDT by Martin Poole
Modified: 2010-10-22 10:57 EDT (History)
14 users (show)

See Also:
Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 10:42:54 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
ioremap patch (3.47 KB, patch)
2007-05-14 06:55 EDT, Martin Poole
no flags Details | Diff
ia64-ioremap patch (4.70 KB, patch)
2007-05-14 06:58 EDT, Martin Poole
no flags Details | Diff
ioremap align patch (2.42 KB, patch)
2007-05-14 06:58 EDT, Martin Poole
no flags Details | Diff
legacy mem patch (917 bytes, patch)
2007-05-14 06:59 EDT, Martin Poole
no flags Details | Diff
mmap validation patch (2.59 KB, patch)
2007-05-14 07:00 EDT, Martin Poole
no flags Details | Diff
linux-2.6-ioremap-initial-support.patch (3.48 KB, patch)
2007-08-06 18:28 EDT, Jarod Wilson
no flags Details | Diff
linux-2.6-ia64-ioremap-variable-name-consistency.patch (2.57 KB, patch)
2007-08-06 18:29 EDT, Jarod Wilson
no flags Details | Diff
linux-2.6-ia64-ioremap-avoid-unsupported-attrs.patch (4.24 KB, patch)
2007-08-06 18:30 EDT, Jarod Wilson
no flags Details | Diff
linux-2.6-ia64-ioremap-allow-cacheable-mmaps-legacy_mem.patch (917 bytes, patch)
2007-08-06 18:30 EDT, Jarod Wilson
no flags Details | Diff
linux-2.6-ia64-ioremap-mmap-validation.patch (2.59 KB, patch)
2007-08-06 18:32 EDT, Jarod Wilson
no flags Details | Diff
fixed linux-2.6-ia64-ioremap-avoid-unsupported-attrs.patch (4.26 KB, patch)
2007-08-07 10:47 EDT, Jarod Wilson
no flags Details | Diff
errdump mca from zx2000 (13.70 KB, text/plain)
2007-08-16 13:58 EDT, Jarod Wilson
no flags Details
fix ioremap/xen bug (1.34 KB, patch)
2007-08-21 19:32 EDT, Bjorn Helgaas
no flags Details | Diff

  None (edit)
Description Martin Poole 2007-05-14 06:55:34 EDT
Description of problem:


 This is the kernel piece.  The related user-side (X.org)
 piece is here:
https://enterprise.redhat.com/issue-tracker/?module=issues&ac tion=view&tid=115881
https:// bugzilla.redhat.com/bugzilla/show_bug.cgi?id=233981

 The kernel problem is that we don't properly validate
 requests to mmap /dev/mem.  We sometimes perform mmaps
 that should fail, and we sometimes use the wrong memory
 attribute.  For example, we might use an uncacheable
 attribute for a region that only supports cacheable
 access.

Version-Release number of selected component (if applicable):


rhel5-ga

How reproducible:


 Every time.

Steps to Reproduce:
1.   Start X.org on HP rx8640.
2.
3.
  
Actual results:

  MCA (system crash).

Expected results:

X.org should succeed or exit gracefully with an mmap
 failure.

Additional info:
Comment 1 Martin Poole 2007-05-14 06:55:35 EDT
Created attachment 154633 [details]
ioremap patch
Comment 2 Martin Poole 2007-05-14 06:58:12 EDT
Created attachment 154634 [details]
ia64-ioremap patch
Comment 3 Martin Poole 2007-05-14 06:58:50 EDT
Created attachment 154635 [details]
ioremap align patch
Comment 4 Martin Poole 2007-05-14 06:59:22 EDT
Created attachment 154636 [details]
legacy mem patch
Comment 5 Martin Poole 2007-05-14 07:00:05 EDT
Created attachment 154637 [details]
mmap validation patch

Apply the patches in this order:

 ioremap
 ia64-ioremap-align
 ia64-ioremap
 ia64-legacy_mem-wb
 ia64-mmap-validation

These patches are already upstream.
Comment 6 RHEL Product and Program Management 2007-05-14 07:03:55 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 7 Jarod Wilson 2007-08-06 15:54:03 EDT
Ugh. These patches didn't apply to the latest RHEL5 tree due to some other
patches we've recently picked up. That much is understandable. But on the
not-so-understandable side, I doubt if they were ever compile-tested before
posting. They don't even compile once massaged into place, and the reasons why
were pretty obvious upon inspecting the patches. First up, the piece that
switched 'offset' to 'phys_addr' missed four occurrences of 'offset'. Second,
the variable 'size' is still used within the __ioremap function, but was removed
from the function args. 

Build failure excerpt:
--
arch/ia64/mm/ioremap.c: In function '__ioremap':
arch/ia64/mm/ioremap.c:22: error: 'offset' undeclared (first use in this function)
arch/ia64/mm/ioremap.c:22: error: (Each undeclared identifier is reported only once
arch/ia64/mm/ioremap.c:22: error: for each function it appears in.)
arch/ia64/mm/ioremap.c:22: error: 'size' undeclared (first use in this function)
--

Anyhow, I've fixed up the patches, and a test build seems to be getting past
where it was falling down on those bits now. But *please* at least compile-test
patches before posting them, or make it clear that they're completely untested...
Comment 8 Jarod Wilson 2007-08-06 18:28:58 EDT
Created attachment 160776 [details]
linux-2.6-ioremap-initial-support.patch
Comment 9 Jarod Wilson 2007-08-06 18:29:36 EDT
Created attachment 160777 [details]
linux-2.6-ia64-ioremap-variable-name-consistency.patch
Comment 10 Jarod Wilson 2007-08-06 18:30:02 EDT
Created attachment 160778 [details]
linux-2.6-ia64-ioremap-avoid-unsupported-attrs.patch
Comment 11 Jarod Wilson 2007-08-06 18:30:26 EDT
Created attachment 160779 [details]
linux-2.6-ia64-ioremap-allow-cacheable-mmaps-legacy_mem.patch
Comment 12 Jarod Wilson 2007-08-06 18:32:27 EDT
Created attachment 160780 [details]
linux-2.6-ia64-ioremap-mmap-validation.patch

Some of the replacement files are probably identical content-wise, at least
one, maybe two are repaired so as to actually build on the rhel5 code base.
I've successfully built and tested these patches on multiple ia64 systems now,
with the desired good results.
Comment 13 Jarod Wilson 2007-08-07 10:47:07 EDT
Created attachment 160812 [details]
fixed linux-2.6-ia64-ioremap-avoid-unsupported-attrs.patch

From Akio Takebe w/Fujitsu:

----8<----
When dom0 bootup, the following CallTraces are happened.

Trying to vfree() bad address (a000000200193f28)
BUG: warning at mm/vmalloc.c:321/__vunmap() (Not tainted)

Call Trace:
 [<a00000010001d200>] show_stack+0x40/0xa0
				sp=a0000001007bfba0 bsp=a0000001007b9280
 [<a00000010001d290>] dump_stack+0x30/0x60
				sp=a0000001007bfd70 bsp=a0000001007b9268
 [<a0000001001475a0>] __vunmap+0x120/0x2a0
				sp=a0000001007bfd70 bsp=a0000001007b9238
 [<a0000001001477b0>] vunmap+0x90/0xc0
				sp=a0000001007bfd70 bsp=a0000001007b9218
 [<a000000100066e50>] iounmap+0x30/0x60
				sp=a0000001007bfd70 bsp=a0000001007b91f0
 [<a00000010031c160>] acpi_os_unmap_memory+0x20/0x40
				sp=a0000001007bfd70 bsp=a0000001007b91d0
 [<a000000100352620>] acpi_tb_get_table_header+0x160/0x1c0
				sp=a0000001007bfd70 bsp=a0000001007b9198
 [<a0000001003526b0>] acpi_tb_get_table+0x30/0xc0
				sp=a0000001007bfd80 bsp=a0000001007b9168
 [<a000000100352c00>] acpi_tb_get_table_rsdt+0x60/0x1e0
				sp=a0000001007bfdb0 bsp=a0000001007b9138
 [<a000000100353570>] acpi_load_tables+0xd0/0x200
				sp=a0000001007bfe00 bsp=a0000001007b9118
 [<a000000100778080>] acpi_early_init+0xa0/0x180
				sp=a0000001007bfe10 bsp=a0000001007b9100
 [<a000000100749a80>] start_kernel+0x840/0x880
				sp=a0000001007bfe20 bsp=a0000001007b90a0
 [<a000000100010600>] __end_ivt_text+0x6e0/0x700
				sp=a0000001007bfe30 bsp=a0000001007b90a0

(Though I don't test the below,)
I think this is caused by the following mistake.

The below is your patch.
===========================================================
+
+void
+iounmap (volatile void __iomem *addr)
+{
+	if (REGION_NUMBER(addr) == RGN_GATE)
+		vunmap((void __force *) addr);
+}
+EXPORT_SYMBOL(iounmap);
===========================================================

The below is kernel-2.6.22. PAGE_MASK is needed by addr of vunmap().
===========================================================
 104 void
 105 iounmap (volatile void __iomem *addr)
 106 {
 107	     if (REGION_NUMBER(addr) == RGN_GATE)
 108		     vunmap((void *) ((unsigned long) addr & PAGE_MASK));
 109 }
===========================================================
----8<----

Somehow, we didn't pick up that part. Could have changed after the initial
backport was done, not sure. Attached patch updates that chunk to match
upstream, compiling for testing shortly.
Comment 15 Don Zickus 2007-08-15 15:06:35 EDT
in 2.6.18-40.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 16 Jarod Wilson 2007-08-15 15:58:19 EDT
Swell. So as it turns out, the one ia64 box in-house that I didn't actually test
on with this patch set added is very unhappy with these changes when running a
xen kernel. It gets from the hypervisor into kernel-space, then:

Running on Xen! start_info_pfn=0x138e nr_pages=254460 flags=0x3
MCA related initialization done
Virtual mem_map starts at 0xa0007fffc7270000
SMP: Allowing 1 CPUs, 0 hotplug CPUs
Built 1 zonelists.  Total pages: 251136
Kernel command line: debug root=LABEL=/ ro
PID hash table entries: 4096 (order: 12, 32768 bytes)
CPU 0: base freq=200.000MHz, ITC ratio=9/2, ITC freq=900.000MHz+/-450ppm
Console: colour VGA+ 80x25
Memory: 3872512k/4018176k available (6383k code, 164160k reserved, 3955k data,
448k init)
Leaving McKinley Errata 9 workaround enabled
Calibrating delay loop... 1347.58 BogoMIPS (lpj=2695168)
Security Framework v1.0.0 initialized
SELinux:  Initializing.
SELinux:  Starting in permissive mode
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Dentry cache hash table entries: 524288 (order: 8, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 7, 2097152 bytes)
Mount-cache hash table entries: 1024
ACPI: Core revision 20060707
Boot processor id 0x0/0x0
Brought up 1 CPUs
Total of 1 processors activated (1347.58 BogoMIPS).
sizeof(vma)=176 bytes
sizeof(page)=56 bytes
sizeof(inode)=568 bytes
sizeof(dentry)=216 bytes
sizeof(ext3inode)=776 bytes
sizeof(buffer_head)=96 bytes
sizeof(skbuff)=240 bytes
checking if image is initramfs... it is
Freeing initrd memory: 3888kB freed
DMI 2.3 present.
grant table at e000010000104000
Grant tab1 0 0x0000A4 0x0000000000000000 start memory discovery
1 0 0x0000FB 0x0000000000000000 initialize memory only (don't test)
1 0 0x000081 0x0000000000000000 start I/O discovery

These last four lines are the firmware on the machine reinitializing after a
reset. Will open a new bug to track this in just a sec...
Comment 17 Adam Jackson 2007-08-15 16:35:10 EDT
*** Bug 250841 has been marked as a duplicate of this bug. ***
Comment 18 Don Zickus 2007-08-16 12:31:53 EDT
Reverting due to issues on the xen kernel side
Comment 19 Bjorn Helgaas 2007-08-16 13:43:00 EDT
The crash when running on xen is probably an MCA.  Can you collect
the MCA error records so I can try to figure out what happened?
On HP machines, I do this by:
  - "errdump all clear" at EFI shell
  - reproduce problem
  - "errdump all" at EFI shell
Capture all the console output while doing this.

What sort of machine is this?  Maybe I can reproduce it locally.
Comment 20 Jarod Wilson 2007-08-16 13:46:25 EDT
(In reply to comment #19)
> The crash when running on xen is probably an MCA.  Can you collect
> the MCA error records so I can try to figure out what happened?
> On HP machines, I do this by:
>   - "errdump all clear" at EFI shell
>   - reproduce problem
>   - "errdump all" at EFI shell
> Capture all the console output while doing this.

I'll attach that in just a few...

> What sort of machine is this?  Maybe I can reproduce it locally.

Take your pick of at least rx6600, rx4640, rx2600, rx2620, rx2660 and zx2000. :)
Comment 21 Jarod Wilson 2007-08-16 13:58:04 EDT
Created attachment 161671 [details]
errdump mca from zx2000

Here's the output of 'errdump mca' from an hp zx2000 after 'errdump mca clear'
and reproducing the bug.
Comment 22 Bjorn Helgaas 2007-08-16 14:37:32 EDT
The MCA is because address 0xf4f4f502 got routed to a PCI bus, but
no PCI device responded.  The IIP was 0xa00000010033ac20; if that's
in the static kernel, gdb should tell us what it is.

I'm xen-illiterate, but I'll ask Alex Williamson if he can help me
reproduce this.  I assume RHEL5 GA plus the kernel in comment #15 is
all we should need?  Do you have a VGA device in the zx2000?
Comment 23 Jarod Wilson 2007-08-16 14:56:24 EDT
> ...I assume RHEL5 GA plus the kernel in comment #15 is
> all we should need?

Yeah, I believe that should do the trick.

> Do you have a VGA device in the zx2000?

Yep:

00:00.0 VGA compatible controller: nVidia Corporation NV11GL [Quadro2 MXR/EX/Go]
(rev b2)
Comment 24 Doug Chapman 2007-08-17 11:40:50 EDT
(In reply to comment #22)
> The MCA is because address 0xf4f4f502 got routed to a PCI bus, but
> no PCI device responded.  The IIP was 0xa00000010033ac20; if that's
> in the static kernel, gdb should tell us what it is.
> 

Looks like this IP is in ACPI code:

(gdb) list *0xa00000010033ac20
0xa00000010033ac20 is in acpi_ex_system_memory_space_handler
(drivers/acpi/executer/exregion.c:158).
153                             window_size = ACPI_SYSMEM_REGION_WINDOW_SIZE;
154                     }
155
156                     /* Create a new mapping starting at the address given */
157
158                     status = acpi_os_map_memory(address, window_size,
159                                                 (void **)&mem_info->
160                                                 mapped_logical_address);
161                     if (ACPI_FAILURE(status)) {
162                             ACPI_ERROR((AE_INFO,
Comment 25 Doug Chapman 2007-08-17 11:48:41 EDT
Also, I should point out that the failure with the -40 kernel-xen is not always
an MCA.  On my rx2620 it fails with the following on bootup (and then just hangs):

ACPI: bus type pci registered
<G><2>mm.c:979:d0 efi_mmio: physaddr 0xffffffe0 size = 0x4000
(XEN) mm.c:681:d0 vcpu 0 iip 0xa00000010033ada0: bad mpa 0xffffffe0 (=> 0x1eceb000)
(XEN) mm.c:497:d0 Warning: UC to WB for mpaddr=ffffffe0
ACPI: Fatal opcode executed


I am wondering if the mm.c warning is significant.

Comment 26 Doug Chapman 2007-08-17 12:12:00 EDT
Not sure what this tells us but the mm.c Warning is pointing to a bit of code
close to where the MCA is triggered on other systems:

(gdb) list *0xa00000010033ada0
0xa00000010033ada0 is in acpi_ex_system_memory_space_handler
(drivers/acpi/executer/exregion.c:202).
197             case ACPI_READ:
198
199                     *value = 0;
200                     switch (bit_width) {
201                     case 8:
202                             *value = (acpi_integer) ACPI_GET8(logical_addr_ptr);
203                             break;
204
205                     case 16:
206                             *value = (acpi_integer)
ACPI_GET16(logical_addr_ptr);
(gdb)
Comment 27 Jarod Wilson 2007-08-17 12:46:45 EDT
These look possibly relevant:

[IA64] Issue ioremap hypercall in pci_acpi_scan_root()
http://xenbits.xensource.com/ext/ia64/linux-2.6.18-xen.hg?rev/8f0c93df3e11

[IA64] Revert paravirtualization to ioremap /proc/pci
http://xenbits.xensource.com/ext/ia64/linux-2.6.18-xen.hg?rev/6d84769b5256

I'll give 'em a go this afternoon...
Comment 28 Doug Chapman 2007-08-17 13:23:35 EDT
Jarod,

I tried the patches you point to in comment #27, no luck :(

- Doug
Comment 29 Doug Chapman 2007-08-17 15:25:25 EDT
I added a printk just before the call to acpi_os_map_memory as shown in comment
#24.  These value for "address" doesn't look valid to me:

last few lines with printk before the MCA:

apci_os_map_memory: 00000000ffffffe0, 8
<G><2>mm.c:979:d0 efi_mmio: physaddr 0xffffffe0 size = 0x4000
(XEN) mm.c:681:d0 vcpu 0 iip 0xa00000010033ace0: bad mpa 0xffffffe0 (=> 0x1ecdb000)
(XEN) mm.c:497:d0 Warning: UC to WB for mpaddr=ffffffe0
apci_os_map_memory: 00000000f4f4f502


Interestingly...
0xffffffe0 = -32 = -EPIPE

ok, that errno doesn't make sense, my guess is though that something is
returning either a 32bit address or and error which happens to be -32 and
whatever code is calling that doesn't interpret it correctly since it is looking
at a 32 bit addr in 64 bits.

Comment 30 Doug Chapman 2007-08-17 16:56:48 EDT
Looks like my assumption that the addresses "don't look right" in comment #29 is
incorrect.  I just booted a bare-metal kernel with my debug printf's and I get
the same addresses.

So, why would we get an MCA under xen but not under bare metal on these addresses?

Comment 31 Jarod Wilson 2007-08-21 16:38:41 EDT
*** Bug 243777 has been marked as a duplicate of this bug. ***
Comment 32 Bjorn Helgaas 2007-08-21 19:32:59 EDT
Created attachment 162020 [details]
fix ioremap/xen bug

We need "size" for the xen ioremap hypercall, but we rounded it up to
full pages before passing it to xen.  Use a different variable "page_size"
for the rounded-up value so we don't clobber "size".

This makes my rx1600 boot.  Without this patch, it hangs after
"ACPI: Fatal opcode executed" as Doug reported in comment #25.

I haven't seen the MCA (I've only tried an rx1600), so I don't
know whether it will fix that problem.
Comment 33 Jarod Wilson 2007-08-21 20:54:31 EDT
I'm rolling a test kernel with the patch in comment #32 this very minute, will boot test on multiple systems 
known to previously fail with the ioremap patchset included.
Comment 34 Jarod Wilson 2007-08-21 21:23:20 EDT
I've got it booting on a previously failing HP Integrity rx8620, and Doug is working on a few other systems 
as well. So far, lookin' good...
Comment 35 Doug Chapman 2007-08-22 09:30:16 EDT
I have booted this on several systmems where it would crash before: rx4640,
rx2620 and rx6600.  All boot find with this patch.  I booted the non-xen kernel
on one also as a santity test.

I have also confirmed that I can now run X on a full-virt guest.

Comment 36 Bjorn Helgaas 2007-08-22 12:56:21 EDT
[CC'd Isaku Yamahata.]

While working on this, Doug and I noticed several things we don't understand.  
I don't know enough about xen to know whether these are real problems or not.

In arch/ia64/mm/ioremap.c:__ioremap(),

        phys_addr = HYPERVISOR_ioremap(phys_addr, size);
        if (IS_ERR_VALUE(phys_addr))
                return (void __iomem*)phys_addr;

This looks like it returns a -ERRNO value, but ioremap() should return NULL on 
error, not an errno.

In include/asm/hypercall.h:HYPERVISOR_ioremap(), 

  static inline unsigned long
  HYPERVISOR_ioremap(unsigned long ioaddr, unsigned long size)
  {
        unsigned long ret = ioaddr;
        if (is_running_on_xen()) {
                ret = __HYPERVISOR_ioremap(ioaddr, size);
                if (unlikely(ret == -ENOSYS))
                        panic("hypercall %s failed with %ld. "
                              "Please check Xen and Linux config mismatch\n",
                              __func__, -ret);
                else if (unlikely(IS_ERR_VALUE(ret)))
                        ret = ioaddr;
        }
        return ret;
  }

This returns either the __HYPERVISOR_ioremap() result or the original ioaddr 
(if IS_ERR_VALUE(ret)).  But then __ioremap() checks for IS_ERR_VALUE() again, 
which seems questionable to me.

Doug observed that for ioremap(0xffffffe0, 0x8), the hypervisor 
returned -EINVAL, which caused HYPERVISOR_ioremap() to return 0xffffffe0 
(original ioaddr).  IS_ERR_VALUE(0xffffffe0) happens to be true, but only by 
coincidence -- 0xffffffe0 is clearly not an -ERRNO value in this case.
Comment 37 Isaku Yamahata 2007-08-23 02:15:17 EDT
(In reply to comment #36)
> [CC'd Isaku Yamahata.]
> 
> While working on this, Doug and I noticed several things we don't 
understand.  
> I don't know enough about xen to know whether these are real problems or not.
> 
> In arch/ia64/mm/ioremap.c:__ioremap(),
> 
>         phys_addr = HYPERVISOR_ioremap(phys_addr, size);
>         if (IS_ERR_VALUE(phys_addr))
>                 return (void __iomem*)phys_addr;
> 
> This looks like it returns a -ERRNO value, but ioremap() should return NULL 
on 
> error, not an errno.

Good catch. I sent out the patch for upstream.


> In include/asm/hypercall.h:HYPERVISOR_ioremap(), 
> 
>   static inline unsigned long
>   HYPERVISOR_ioremap(unsigned long ioaddr, unsigned long size)
>   {
>         unsigned long ret = ioaddr;
>         if (is_running_on_xen()) {
>                 ret = __HYPERVISOR_ioremap(ioaddr, size);
>                 if (unlikely(ret == -ENOSYS))
>                         panic("hypercall %s failed with %ld. "
>                               "Please check Xen and Linux config mismatch\n",
>                               __func__, -ret);
>                 else if (unlikely(IS_ERR_VALUE(ret)))
>                         ret = ioaddr;
>         }
>         return ret;
>   }
> 
> This returns either the __HYPERVISOR_ioremap() result or the original ioaddr 
> (if IS_ERR_VALUE(ret)).  But then __ioremap() checks for IS_ERR_VALUE() 
again, 
> which seems questionable to me.

Yes, that's somewhat bogus. In fact it's work around. 
Probably it would be better to add comments there.
At first it returned the __HYPERVISOR_ioremap() result,
however some device drivers use ioremap() and access it for device detection
not checking the ioremap() failure.   


> Doug observed that for ioremap(0xffffffe0, 0x8), the hypervisor 
> returned -EINVAL, which caused HYPERVISOR_ioremap() to return 0xffffffe0 
> (original ioaddr).  IS_ERR_VALUE(0xffffffe0) happens to be true, but only by 
> coincidence -- 0xffffffe0 is clearly not an -ERRNO value in this case.

I don't understand these. There are two points.

1st point.
The comment #29 says __HYPERVISOR_ioremap(0xffffffe0, PAGE_SIZE) fails
because it is beyond efi memory map boundary.
The patch of #32 fixes it. And it looks correct to me.
Does __HYPERVISOR_ioremap(0xffffffe0, 0x8) really return -EINVAL?
If so, what's the real efi memory map of the machine in question?

2nd point.
IS_ERR_VALUE(0xffffffe0) is false because it is treated as 64bit value.
(unisnged long)-MAX_ERRNO = (unsigned long)-4095 = 0xfffffffffffff001
Comment 39 Bjorn Helgaas 2007-08-23 11:07:04 EDT
> Does __HYPERVISOR_ioremap(0xffffffe0, 0x8) really return -EINVAL?

Oops, I think you are right.  Doug saw the -EINVAL before the patch of comment 
#32, so I think it probably was the __HYPERVISOR_ioremap(0xffffffe0, 
PAGE_SIZE) that failed with -EINVAL.

And you're right about IS_ERR_VALUE(0xffffffe0) being false, too.  We were 
messing around with casts there to understand what was going on, and I didn't 
think about it long enough.
Comment 40 Doug Chapman 2007-08-23 11:36:57 EDT
(In reply to comment #39)
> > Does __HYPERVISOR_ioremap(0xffffffe0, 0x8) really return -EINVAL?
> 
> Oops, I think you are right.  Doug saw the -EINVAL before the patch of comment 
> #32, so I think it probably was the __HYPERVISOR_ioremap(0xffffffe0, 
> PAGE_SIZE) that failed with -EINVAL.
> 

Close but this is not exactly what I was seeing.  I did some debugging down in
the hypervisor itself and found that it was returning -EINVAL.  The bit of code
I have a problem with is in HYPERVISOR_ioremap:

                ret = __HYPERVISOR_ioremap(ioaddr, size);
                if (unlikely(ret == -ENOSYS))
                        panic("hypercall %s failed with %ld. "
                              "Please check Xen and Linux config mismatch\n",
                              __func__, -ret);
                else if (unlikely(IS_ERR_VALUE(ret)))
                        ret = ioaddr;

so, if __HYPERVISOR_ioremap returns an address then the address gets passed back
to the caller, if __HYPERVISOR_ioremap returns and error (other than -ENOSYS)
then the ret=ioaddr means the address STILL gets passed back to the caller.  So,
the caller of HYPERVISOR_ioremap has no way to know that an error occured.

Comment 41 Isaku Yamahata 2007-08-23 22:16:39 EDT
(In reply to comment #40)
> Close but this is not exactly what I was seeing.  I did some debugging down in
> so, if __HYPERVISOR_ioremap returns an address then the address gets passed 
back
> to the caller, if __HYPERVISOR_ioremap returns and error (other than -ENOSYS)
> then the ret=ioaddr means the address STILL gets passed back to the caller.  
So,
> the caller of HYPERVISOR_ioremap has no way to know that an error occured.

As descrived in the comment #37, it is work around for device drivers.
But I only remember it very vaguely. I don't remember what device driver
caused it.

With fixing __ioremap() that should return NULL when error,
it might be possible to remove those lines. But I'm not sure, testing
will prove that.

-                else if (unlikely(IS_ERR_VALUE(ret)))
-                        ret = ioaddr;
Comment 42 Jarod Wilson 2007-08-24 11:52:05 EDT
*** Bug 252399 has been marked as a duplicate of this bug. ***
Comment 45 Don Zickus 2007-08-27 15:52:05 EDT
This patch was pulled a couple of weeks ago, need to set the flags to reflect
that to allow management to reconsider these patches for 5.1.
Comment 50 Issue Tracker 2007-08-28 09:36:39 EDT
But, Fujitsu thinks that this is regression, since this didn't occur with
EL5.0.
Fujitsu saw that *the graphical installation doesn't work* with ATI RAGE
XL card. This card is used by lots of customers broadly. So, this is
really critical because this will give our customer very bad impression.
Are we going to force our customer to do text install if they have this
graphic card? It's not good, I think.


This event sent from IssueTracker by mmatsuya 
 issue 128801
Comment 54 Larry Troan 2007-08-30 21:36:03 EDT
Since we appear adverse to taking in this patch at this time....

1) Fujitsu claims it is a regression from GA though RH Engineering claims it was 
   also broken in GA. Do we know who is correct? It appears Fujitsu is correct
   becasue the REGRESSION keyword is still set.
2) If a regression, then we should consider fixing it in 5.1 ONLY IF there is no
   schedule delay.
3) Can the Driver Update Model provide a mechanism on top of 5.1 to make this
   fix available to Fujitsu customers? Otherwise, we will be faced with a
   potential HOTFIX and/or EUS requirement soon after shipping 5.1. 
Comment 66 Jarod Wilson 2007-09-04 22:53:18 EDT
I'm actually out of the office and mostly unavailable this week...

Prarit, can I talk you into overseeing this one until I get back in
town this weekend? I'll be checking in at least nightly...
Comment 67 Prarit Bhargava 2007-09-05 09:17:43 EDT

(In reply to comment #66)
> I'm actually out of the office and mostly unavailable this week...
> 
> Prarit, can I talk you into overseeing this one until I get back in
> town this weekend? I'll be checking in at least nightly...

np.  Will do.

P.
Comment 68 Keiichiro Tokunaga 2007-09-05 09:58:47 EDT
I tried to identify the version of kernel which introduced the X problem on 
PRIMEQUEST and found out that X started fine even with the latest kernel (-
45.el5) on 5.0 GA distro.  Therefore, it may be a problem of other components 
that 5.1 distro has.  If anyone needs access to the PRIMEQUEST, please let me 
know.
Comment 69 Bjorn Helgaas 2007-09-05 11:08:23 EDT
There were several pieces of the ia64 ioremap fix.  Here's the outline from 
the original posting 
(http://www.gelato.unsw.edu.au/archives/linux-ia64/0703/20208.html):

  1  rename ioremap variables to match i386 (no functional change)
  2  use page table mappings in ioremap to avoid unsupported attributes
  3  allow WB mmap of /sys/.../legacy_mem
  4  fail mmaps that span areas with incompatible attributes
  5  update documentation & add tests

The one that conflicted with Xen was part 2 from the list above, so that one 
got dropped from RHEL5.1.  But the other ones may still be in RHEL5.1.  If so, 
part 4 will cause some mmaps that succeeded in RHEL5 to fail in RHEL5.1

These newly-failing mmaps are of particular interest to X, so they could 
explain the PRIMEQUEST X problems.  Here's the original posting and changelog: 
http://www.gelato.unsw.edu.au/archives/linux-ia64/0703/20210.html
Comment 70 Jarod Wilson 2007-09-06 00:56:19 EDT
The entire ioremap patchset was both added and removed from the 5.1 kernel builds wholesale as a single 
patch, so at the moment, none of the bits are there.
Comment 76 Issue Tracker 2007-09-11 03:16:47 EDT
On systems with *less* than 128 PCI devices, Fujitsu saw this X problem
with EL5.1 beta.


This event sent from IssueTracker by mmatsuya 
 issue 128801
Comment 82 Luming Yu 2007-09-12 08:47:47 EDT
I would suggest to fix it in firmware by reserving the memory address for the
VGA device. Is it possible ? 
Comment 84 Issue Tracker 2007-09-12 14:47:53 EDT
For what it's worth, I just tried to reproduce the problem on a Hitachi
ColdFusion 3e here in Raleigh. RHEL5GA worked as expected (no surprise
there). Then I yum updated to RHEL5.1 beta snapshot 6, and both startx and
'init 5' worked there as well.


This event sent from IssueTracker by gcase 
 issue 127825
Comment 85 Linda Wang 2007-09-12 18:49:05 EDT
can someone try RHEL5.1 beta snapshot 7 on ColdFusion 3e to be sure?

The change (backout the counterpart of 240006 kernel patch in X server package)
went into X server package (1.1-1-48.26.el5) in snspshot 7.
Comment 86 Jarod Wilson 2007-09-13 00:50:04 EDT
I've run the relevant snap 7 bits on a coldfusion 3e, and they do indeed provide working X there.
Comment 87 Larry Troan 2007-09-14 08:33:38 EDT
I've copied the ia64 xorg server packages in snapshot 7 to my people page in
case some of our partners want to test the change we've discussed above:
http://people.redhat.com/ltroan/fixes/.240006/

The snapshot7 packages include:
xorg-x11-server-sdk-1.1.1-48.26.el5.ia64.rpm
xorg-x11-server-Xdmx-1.1.1-48.26.el5.ia64.rpm
xorg-x11-server-Xephyr-1.1.1-48.26.el5.ia64.rpm
xorg-x11-server-Xnest-1.1.1-48.26.el5.ia64.rpm
xorg-x11-server-Xorg-1.1.1-48.26.el5.ia64.rpm
xorg-x11-server-Xvfb-1.1.1-48.26.el5.ia64.rpm

Please report results back in this bug or in your corresponding Issue Tracker. 
Comment 89 You, Yongkang 2007-09-17 02:01:43 EDT
(In reply to comment #87)
> I've copied the ia64 xorg server packages in snapshot 7 to my people page in
> case some of our partners want to test the change we've discussed above:
> http://people.redhat.com/ltroan/fixes/.240006/

I have tried these Xorg RPMs based on snapshot4 or snapshot6. X can work in
Tiger4 Dom0 and VTI domain. 
Comment 90 Ken Reilly 2007-09-17 10:44:34 EDT
As described in the comments above, Red Hat and its partners have made several
attempts since the GA release of RHEL5.0 to resolve both the kernel and X server
parts of this BZ. The patches, to date, have either proven too invasive and/or
have surfaced unacceptable side-effects during installation. As a result, we
have reverted back to the RHEL5.0 GA behavior and added relevant information in
the release notes as we prepare for the RC version of RHEL5.1.

Our current understanding is that the problem is only seen on Itanium systems
with greater than 128 PCI devices. We would appreciate receiving any additional
information on the impact of the problem.

    Red Hat will continue working with any impacted partners to
    resolve the problems, perform the necessary system testing,
    and also help them address issues that my arise thru hardware
    certification testing.
Comment 94 RHEL Product and Program Management 2007-11-01 19:25:47 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 99 Don Zickus 2007-12-14 13:34:08 EST
in 2.6.18-60.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 100 Don Domingo 2007-12-18 22:05:55 EST
can we verify this as fixed for RHEL5.2? please advise so i can move it from
"Known Issues" to "Resolved Issues". 

making this bug block RHEL5.2 release notes tracking bug (BZ#391221). thanks!
Comment 101 Jarod Wilson 2007-12-18 23:44:53 EST
We have the kernel bits in, need to double-check with ajax to find out if we've pulled the X server bits 
back in yet... (bug 233981)
Comment 104 Don Domingo 2008-02-07 00:21:38 EST
added to RHEL5.2 release notes  under "Resolved Issues":

<quote>
(ia64) The X server no longer attempts to utilize memory regions incompatible to
its needs. This fixes a bug that previously caused a Machine Check Abort (MCA)
on some Itanium systems.
</quote>

please advise if any further revisions are required. thanks!
Comment 110 Issue Tracker 2008-03-13 23:04:10 EDT
Hello Okado-san, Nagata-san,

Can you please create the new IT ticket for the problem which you saw on
5.2 public beta? Otherwise clone this ticket. Then, we need to create the
new bugzilla. We will deal with this problem as the high priority and
regression problem on the bugzilla.

Best Regards,

Internal Status set to 'Waiting on Customer'
Status set to: Waiting on Client

This event sent from IssueTracker by mmatsuya 
 issue 128801
Comment 112 John Poelstra 2008-03-20 23:51:52 EDT
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot1--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you
Comment 113 Don Domingo 2008-04-01 22:10:23 EDT
Hi,
the RHEL5.2 release notes will be dropped to translation on April 15, 2008, at
which point no further additions or revisions will be entertained.

a mockup of the RHEL5.2 release notes can be viewed at the following link:
http://intranet.corp.redhat.com/ic/intranet/RHEL5u2relnotesmockup.html

please use the aforementioned link to verify if your bugzilla is already in the
release notes (if it needs to be). each item in the release notes contains a
link to its original bug; as such, you can search through the release notes by
bug number.

Cheers,
Don
Comment 114 John Poelstra 2008-04-02 17:34:30 EDT
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot3--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you
Comment 115 John Poelstra 2008-04-09 18:41:34 EDT
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot4--available now on partners.redhat.com.  

Please test and confirm that your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you
Comment 116 John Poelstra 2008-04-23 13:39:12 EDT
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot6--available now on partners.redhat.com.  

We are nearing GA for 5.2 so please test and confirm that your issue is fixed ASAP.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you
Comment 118 John Poelstra 2008-05-01 12:49:34 EDT
Greetings Red Hat Partner,

A fix for this issue should be included in the latest packages contained in
RHEL5.2-Snapshot7--available now on partners.redhat.com.  

We are nearing GA for 5.2--this is the last opportunity to test and confirm that
your issue is fixed.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to ASSIGNED.

If you are receiving this message in Issue Tracker, please reply with a message
to Issue Tracker about your results and I will update bugzilla for you.  If you
need assistance accessing ftp://partners.redhat.com, please contact your Partner
Manager.

Thank you
Comment 120 errata-xmlrpc 2008-05-21 10:42:54 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html

Note You need to log in before you can comment on or make changes to this bug.