Bug 166594

Summary: cache aliasing with nvidia's driver
Product: Red Hat Enterprise Linux 4 Reporter: Andrew Shewmaker <ashewmaker>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-12-13 02:16:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrew Shewmaker 2005-08-23 18:03:16 UTC
Description of problem:

I believe I am experiencing instability caused by the cache aliasing described
below.  Can I expect the change_page_attr accounting fixes to be backported to
the RHEL4 kernel, or should I upgrade the kernel to 2.6.11+ as nvidia suggests?

Thank you.

From nvidia's readme under Appendix L. Known Issues:

The X86-64 platform (AMD64/EM64T) and 2.6 kernels

    Many 2.4 and 2.6 x86_64 kernels have an accounting issue in their
    implementation of the change_page_attr kernel interface. Early 2.6
    kernels include a check that triggers a BUG() when this situation is
    encountered (triggering a BUG() results in the current application
    being killed by the kernel; this application would be your opengl
    application or potentially the X Server). The accounting issue has
    been resolved in the 2.6.11 kernel.

    We have added checks to recognize that the NVIDIA kernel module is
    being compiled for the x86-64 platform on a kernel between 2.6.0 and
    2.6.11. In this case, we will disable usage of the change_page_attr
    kernel interface. This will avoid the accounting issue but leaves the
    system in danger of cache aliasing (see entry on Cache Aliasing for
    more information about cache aliasing). Note that this
    change_page_attr accounting issue and BUG() can be triggered by other
    kernel subsystems that rely on this interface.

    If you are using a 2.6 x86_64 kernel, it is recommended that you
    upgrade to a 2.6.11 or later kernel.

Cache Aliasing

    Cache aliasing occurs when multiple mappings to a physical page of
    memory have conflicting caching states, such as cached and uncached.
    Due to these conflicting states, data in that physical page may become
    corrupted when the processor's cache is flushed. If that page is being
    used for dma by a driver such as NVIDIA's graphics driver, this can
    lead to hardware stability problems and system lockups.

    NVIDIA has encountered bugs with some Linux kernel versions that lead
    to cache aliasing. Although some systems will run perfectly fine when
    cache aliasing occurs, other systems will experience severe stability
    problems, including random lockups. Users experiencing stability
    problems due to cache aliasing will benefit from updating to a kernel
    that does not cause cache aliasing to occur.

    NVIDIA has added driver logic to detect cache aliasing and to print a
    warning with a message similar to the following: NVRM: bad caching on
    address 0x1cdf000: actual 0x46 != expected 0x73 If you see this
    message in your log files and are experiencing stability problems, you
    should update your kernel to the latest version.

    If the message persists after updating your kernel, please send a bug
    report to NVIDIA.


Version-Release number of selected component (if applicable):  all RHEL4 kernels


How reproducible:  
A full system crash occurs between 2-12 restarts of the X server.


Steps to Reproduce:

Log in and log out through the window manager's menu, or ctrl-alt-backspace, or
"init 3; init 5".
  
Actual results:  Either a crash of just X or a full system crash.  A red and
green square might appear in the upper right hand corner of the vt.


Expected results:  No crash.


Additional info:

Comment 1 Jason Baron 2005-08-24 17:11:02 UTC
There were a number of changes to change_page_attr for rhel4 u2 kernel. It be
worth testing to see if it resovles your issue: 

http://people.redhat.com/~jbaron/rhel4/

Comment 2 Jason Baron 2005-12-07 04:34:44 UTC
any update on this?

Comment 3 Andrew Shewmaker 2005-12-10 22:59:11 UTC
The problem still occurred with rhel4 u2 and vanilla 2.6.13.2 kernels.  After much experimentation, I 
found that the crashes don't seem to happen (or are at least much less frequent) if the card doesn't 
probe my display device's specific EDID.  I am not seeing the instability when I use the CustomEDID 
option of the nvidia driver (the same EDID it probes) while simultaneously using a video modem device 
that reports a different EDID and simply passes the video on to the device with the troublesome EDID.  
Very strange.

In short, the instability does not appear to have anything to do with change_page_attr, but with the 
probing of the EDID.  I do still see warnings from the nvidia driver when I use the rhel4 u2 kernel, but I 
haven't seen any instability so far.

I still think that this should just work without me going through hoops, but I believe I'll have to work 
directly with nvidia (which I have been) to help them replicate this.

This is nvidia's problem and not redhat's.  Thanks for being interested, though.

Comment 4 Jason Baron 2005-12-13 02:16:53 UTC
ok. closing then. please re-open if you think this is a Red Hat issue. thanks.