549465 – Cannot run NVIDIA display driver on 32-bit RHEL 5.3 or 5.4

Bug 549465 - Cannot run NVIDIA display driver on 32-bit RHEL 5.3 or 5.4

Summary: Cannot run NVIDIA display driver on 32-bit RHEL 5.3 or 5.4

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	i386
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	5.5
Assignee:	Danny Feng
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	540569
TreeView+	depends on / blocked

Reported:	2009-12-21 18:45 UTC by John Hubbard
Modified:	2018-10-27 15:13 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 07:13:10 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
ioremap change to address NVIDIA issue (2.62 KB, message/rfc822) 2010-02-18 21:17 UTC, Linda Wang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description John Hubbard 2009-12-21 18:45:58 UTC

Description of problem: When someone tries to run a recent NVIDIA display driver on 32-bit RHEL 5.3 or 5.4, the kernel hangs, due to hitting a BUG() call in the __change_page_attr() routine.


Version-Release number of selected component (if applicable):


How reproducible: Every time.


Steps to Reproduce:
1. Install Redhat Enterprise Linux 5.3 or 5.4, 32-bit
2. Install a recent NVIDIA card and driver
3. Run either a CUDA program, or X windows. Either of these will load the NVIDIA display driver.
  
Actual results: Kernel hangs during loading the NVIDIA driver.


Expected results: System should run normally


Additional info: I found that this is due to the iounmap() routine trying to change page attributes on one page too many, due to not subtracting off the size of the guard page. This is a known bug that was fixed in the 2.6.23 release. Here is the commit (which went into mainline kernel) that fixes the problem:

commit 9585116ba09f1d8c52d0a1346e20bb9d443e9c02 
Author: Jeremy Fitzhardinge <jeremy> 
Date:   Sat Jul 21 17:11:35 2007 +0200 
 
    i386: fix iounmap's use of vm_struct's size field 
 
    get_vm_area always returns an area with an adjacent guard page.  That guard 
    page is included in vm_struct.size.  iounmap uses vm_struct.size to 
    determine how much address space needs to have change_page_attr applied to 
    it, which will BUG if applied to the guard page. 
 
    This patch adds a helper function - get_vm_area_size() in linux/vmalloc.h - 
    to return the actual size of a vm area, and uses it to make iounmap do the 
    right thing.  There are probably other places which should be using 
    get_vm_area_size(). 
 
    Thanks to Dave Young <hidave.darkstar> for debugging the 
    problem. 
 
    [ Andi, it wasn't clear to me whether x86_64 needs the same fix. ] 
 
    Signed-off-by: Jeremy Fitzhardinge <jeremy> 
    Cc: Dave Young <hidave.darkstar> 
    Cc: Chuck Ebbert <cebbert> 
    Signed-off-by: Andrew Morton <akpm> 
    Signed-off-by: Andi Kleen <ak> 
    Signed-off-by: Linus Torvalds <torvalds> 
 
diff --git a/arch/i386/mm/ioremap.c b/arch/i386/mm/ioremap.c 
index fff08ae..0b27831 100644 
--- a/arch/i386/mm/ioremap.c 
+++ b/arch/i386/mm/ioremap.c 
@@ -196,7 +196,7 @@ void iounmap(volatile void __iomem *addr) 
        /* Reset the direct mapping. Can block */ 
        if ((p->flags >> 20) && p->phys_addr < virt_to_phys(high_memory) - 1) { 
                change_page_attr(virt_to_page(__va(p->phys_addr)), 
-                                p->size >> PAGE_SHIFT, 
+                                get_vm_area_size(p) >> PAGE_SHIFT, 
                                 PAGE_KERNEL); 
                global_flush_tlb(); 
        } 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h 
index c2b10ca..89338b4 100644 
--- a/include/linux/vmalloc.h 
+++ b/include/linux/vmalloc.h 
@@ -58,6 +58,13 @@ void vmalloc_sync_all(void); 
 /* 
  *     Lowlevel-APIs (not for driver use!) 
  */ 
+ 
+static inline size_t get_vm_area_size(const struct vm_struct *area) 
+{ 
+       /* return actual size without guard page */ 
+       return area->size - PAGE_SIZE; 
+} 
+ 
 extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags); 
 extern struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags, 
                                        unsigned long start, unsigned long end);

Comment 1 John Hubbard 2009-12-21 23:16:18 UTC

I see that I failed to include a critical detail: so far, this has only been seen on X58 chipsets.

Comment 2 Russell Doty 2010-01-18 20:45:39 UTC

Does Red Hat have any X58 hardware?

Comment 3 Russell Doty 2010-01-18 21:40:22 UTC

Per Nvidia, this is occuring on an Intel Tylersburg platform using the Intel X58 chipset when running the Nvidia binary driver.

Comment 5 Issue Tracker 2010-01-28 17:45:20 UTC

Event posted on 01-28-2010 10:05am EST by jkachuck

Hello,
Please let me know if you have able able to reproduce this on multiple x58
systems?
If so please let me know if this is a specific card, and revision of the
X58 system. 

At current engineering is planning on moving this to RHEL 5.6 request.
Please let me know if this would be acceptable for you.

Thank You
Joe Kachuck




This event sent from IssueTracker by jkachuck 
 issue 378356

Comment 6 John Hubbard 2010-02-02 19:24:16 UTC

Hi Joe, sorry for the delay in responding. I just now located a different X58 system here, and readily reproduced the problem. I also verified that the patch fixes the problem, so we know it's the same issue.

So that makes two systems, which is a trend, right? :)

Anyway, we are concerned that this might be fairly widespread.  If you don't have any X58 systems to try this with, maybe we could send you one, I can check on that. Let me know.

thanks,
John Hubbard

Comment 7 Issue Tracker 2010-02-02 19:50:59 UTC

Event posted on 02-02-2010 02:45pm EST by jkachuck

Hello,
We have a X58 system. However it is unable to reproduce this issue. The
X58 is also NDA. If you have a system you can send us that would be able
to reproduce this issue. This would aid in this issue.

Thank You
Joe Kachuck


This event sent from IssueTracker by jkachuck 
 issue 378356

Comment 8 Issue Tracker 2010-02-04 16:29:42 UTC

Event posted on 02-04-2010 11:29am EST by jkachuck

Hello,
I have received the hardware from Garrison Wo. With the install that was
on the system I saw the issue. However when I reinstalled the system with
RHEL 5.4 32 bit. I then went to Nvidia.com and downloaded the latest
driver. The driver installed fine. I am unable to see any issues.

Please give me the exact steps required to reproduce this issue. 
This is the kernel and Nvidia package that was used:
Linux dhcp211.gsslab.rdu.redhat.com 2.6.18-164.el5 #1 SMP Tue Aug 18
15:51:54 EDT 2009 i686 i686 i386 GNU/Linux
NVIDIA-Linux-x86-190.53-pkg1.run

Thank You
Joe Kachuck



This event sent from IssueTracker by jkachuck 
 issue 378356

Comment 9 John Hubbard 2010-02-04 17:27:18 UTC

Hi Joe, I see that you downloaded a 190.53 version of our driver, which doesn't reproduce the problem. The problem will reproduce with a 195 series driver, which I recall I placed in the /root directory. 

So, if you run:

# sh /root/NVIDIA-Linux-x86-195.36.03-pkg0.run -Nsb

...to install the driver, then you should be able to reproduce the problem. 

(I think we're doing some things a little differently between the 190 and 195 series drivers, evidently.)

Comment 10 Issue Tracker 2010-02-04 20:05:16 UTC

Event posted on 02-04-2010 03:05pm EST by jkachuck

Hello,
I am not sure if I can say this is a Red Hat issue. The current
stable(190.53) appears to work without issues. The new issue does not
appear to work correctly. This would be a push that something has changed
in how the install or driver is done. Would you be able to send us what
has changed with the new driver?

Thank You
Joe Kachuck



This event sent from IssueTracker by jkachuck 
 issue 378356

Comment 11 John Hubbard 2010-02-04 20:13:51 UTC

Joe, if you debug the kernel crash, you'll find that it really is due to the iounmap() bug that I described orginally. Also, applying the patch that I supplied (which is taken directly from the mainline kernel) fixes the problem.

The difference between the 190 and 195 NVIDIA drivers has to do with a PCI access pattern that has changed. This involves ioremap and related calls, naturally.

This really is a [known] kernel bug, you just have to look a little more closely at what is happening during the crash. A high-level observation that a newer NVIDIA driver exposes the problem does not prove that the problem lies outside of the kernel.

thanks,
John H.

Comment 12 Issue Tracker 2010-02-04 21:20:44 UTC

Event posted on 02-04-2010 04:20pm EST by jkachuck

Hello,
I have am seeing if we can push a little harder for the exception now. 
Do you have a patch made up for a RHEL kernel yet? If so which version?
I see the one in the summery, however I would like to confirm the RHEL
kernel it should be used on before I test it.

Thank You
Joe Kachuck


This event sent from IssueTracker by jkachuck 
 issue 378356

Comment 13 John Hubbard 2010-02-04 21:39:35 UTC

Hi Joe,
Thanks for looking into this, and for pushing for an exception. To elaborate just slightly on what NVIDIA is doing differently: in our latest drivers, we are now mapping in a couple of PCIe chipset registers (depending on the chipset...so that's why it only happens, so far, on X58 systems). This is done fairly early in our driver initialization. After a few operations, we then unmap, and that's when the bug triggers.

Fortunately, this patch seems to be very easy to move between kernel versions. I'm pasting in the exact patch that I used on both the RHEL 5.3 and RHEL 5.4 kernels, below:

From 7457ae53bbbe1f2a19043bf1e85104d8ed460977 Mon Sep 17 00:00:00 2001
From: John F. Hubbard <jhubbard>
Date: Sat, 19 Dec 2009 18:47:43 -0800
Subject: [PATCH] Fixed iounmap's use of change_page_attr

---
 arch/i386/mm/ioremap.c  |    2 +-
 include/linux/vmalloc.h |    7 +++++++
 2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/i386/mm/ioremap.c b/arch/i386/mm/ioremap.c
index 247fde7..2e786b2 100644
--- a/arch/i386/mm/ioremap.c
+++ b/arch/i386/mm/ioremap.c
@@ -268,7 +268,7 @@ void iounmap(volatile void __iomem *addr)
        /* Reset the direct mapping. Can block */
        if ((p->flags >> 20) && p->phys_addr < virt_to_phys(high_memory) - 1) {
                change_page_attr(virt_to_page(__va(p->phys_addr)),
-                                p->size >> PAGE_SHIFT,
+                                get_vm_area_size(p) >> PAGE_SHIFT,
                                 PAGE_KERNEL);
                global_flush_tlb();
        }
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 71b6363..4965665 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -58,6 +58,13 @@ extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 /*
  *     Lowlevel-APIs (not for driver use!)
  */
+
+static inline size_t get_vm_area_size(const struct vm_struct *area)
+{
+       /* return actual size without guard page */
+       return area->size - PAGE_SIZE;
+}
+
 extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags);
 extern struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
                                        unsigned long start, unsigned long end);
--
1.6.6-rc1.GIT

Comment 14 Danny Feng 2010-02-05 03:34:05 UTC

Hi John:

   I remember this commit when I'm looking into rhel5 code, and I think it did fix iounmap issue on i386 kernel, but I still want to make sure this commit fixes you problem ;-)

   I'll create a build with this commit, since we don't have such an environment to reproduce and test, could you please help me test the built rpm?

Comment 15 John Hubbard 2010-02-05 07:24:08 UTC

Hi Danny,

  Sure, it would be my pleasure to try out your rpm.  

Logistics: At the moment, my favorite system for reproducing this is with your engineer, Joe Kachuck (we shipped it over earlier this week).  I can probably locate another X58-based system, but the fastest thing to do would be to either ship that system back to me, or if Joe has the time, we could ask him to give it a quick try first.  Either way works for me.

thanks for looking at this!
John H.

Comment 17 John Hubbard 2010-02-05 07:59:49 UTC

Say, is that an internal server?  I can't resolve it from any of my machines.

Comment 18 Danny Feng 2010-02-05 08:45:01 UTC

(In reply to comment #17)
> Say, is that an internal server?  I can't resolve it from any of my machines.    

oops,sorry for I've forgotten this is internal server, I'm applying an external server, but I'm afraid I can get the server by tomorrow, so If Joe could upload the rpm to external server , it could be better...

Comment 19 Issue Tracker 2010-02-05 15:49:24 UTC

Event posted on 02-05-2010 10:49am EST by jkachuck

Hello,
I have attached both kernels to the IT. I have also tested both kernel and
confirmed they corrects the issue.

Thank You
Joe Kachuck


This event sent from IssueTracker by jkachuck 
 issue 378356

Comment 20 John Hubbard 2010-02-05 18:39:10 UTC

>I have also tested both kernel and confirmed they corrects the issue.

That's great news! 

Now what? Ship the machine back to NVIDIA, and start pestering management to include the patch in RHEL 5.x (preferably, x == 5)?

Comment 27 Linda Wang 2010-02-18 21:17:10 UTC

Created attachment 394987 [details]
ioremap change to address NVIDIA issue

Comment 28 Jarod Wilson 2010-02-23 20:05:32 UTC

in kernel-2.6.18-190.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 31 Joseph Kachuck 2010-03-08 22:37:36 UTC

Tested no issues:
Linux dhcp17.gsslab.rdu.redhat.com 2.6.18-191.el5PAE #1 SMP Mon Mar 1 16:07:17 EST 2010 i686 i686 i386 GNU/Linux

NVIDIA-Linux-x86-195.30-pkg1.run

Joe Kachuck

Comment 34 John Hubbard 2010-03-23 21:57:02 UTC

Tested at NVIDIA, using one of the the original machines that reproduced the failure. Works great: the bug is fixed. This was with:

Linux version 2.6.18-191.el5PAE (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Mon Mar 1 16:07:17 EST 2010

and our latest internal build of r195_00:

NVIDIA-Linux-x86-rel_gpu_drv_r195_r195_00-20100323_5686548-pkg0.run

Comment 35 errata-xmlrpc 2010-03-30 07:13:10 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Note You need to log in before you can comment on or make changes to this bug.