Description of problem:
I believe I am experiencing instability caused by the cache aliasing described
below. Can I expect the change_page_attr accounting fixes to be backported to
the RHEL4 kernel, or should I upgrade the kernel to 2.6.11+ as nvidia suggests?
From nvidia's readme under Appendix L. Known Issues:
The X86-64 platform (AMD64/EM64T) and 2.6 kernels
Many 2.4 and 2.6 x86_64 kernels have an accounting issue in their
implementation of the change_page_attr kernel interface. Early 2.6
kernels include a check that triggers a BUG() when this situation is
encountered (triggering a BUG() results in the current application
being killed by the kernel; this application would be your opengl
application or potentially the X Server). The accounting issue has
been resolved in the 2.6.11 kernel.
We have added checks to recognize that the NVIDIA kernel module is
being compiled for the x86-64 platform on a kernel between 2.6.0 and
2.6.11. In this case, we will disable usage of the change_page_attr
kernel interface. This will avoid the accounting issue but leaves the
system in danger of cache aliasing (see entry on Cache Aliasing for
more information about cache aliasing). Note that this
change_page_attr accounting issue and BUG() can be triggered by other
kernel subsystems that rely on this interface.
If you are using a 2.6 x86_64 kernel, it is recommended that you
upgrade to a 2.6.11 or later kernel.
Cache aliasing occurs when multiple mappings to a physical page of
memory have conflicting caching states, such as cached and uncached.
Due to these conflicting states, data in that physical page may become
corrupted when the processor's cache is flushed. If that page is being
used for dma by a driver such as NVIDIA's graphics driver, this can
lead to hardware stability problems and system lockups.
NVIDIA has encountered bugs with some Linux kernel versions that lead
to cache aliasing. Although some systems will run perfectly fine when
cache aliasing occurs, other systems will experience severe stability
problems, including random lockups. Users experiencing stability
problems due to cache aliasing will benefit from updating to a kernel
that does not cause cache aliasing to occur.
NVIDIA has added driver logic to detect cache aliasing and to print a
warning with a message similar to the following: NVRM: bad caching on
address 0x1cdf000: actual 0x46 != expected 0x73 If you see this
message in your log files and are experiencing stability problems, you
should update your kernel to the latest version.
If the message persists after updating your kernel, please send a bug
report to NVIDIA.
Version-Release number of selected component (if applicable): all RHEL4 kernels
A full system crash occurs between 2-12 restarts of the X server.
Steps to Reproduce:
Log in and log out through the window manager's menu, or ctrl-alt-backspace, or
"init 3; init 5".
Actual results: Either a crash of just X or a full system crash. A red and
green square might appear in the upper right hand corner of the vt.
Expected results: No crash.
There were a number of changes to change_page_attr for rhel4 u2 kernel. It be
worth testing to see if it resovles your issue:
any update on this?
The problem still occurred with rhel4 u2 and vanilla 18.104.22.168 kernels. After much experimentation, I
found that the crashes don't seem to happen (or are at least much less frequent) if the card doesn't
probe my display device's specific EDID. I am not seeing the instability when I use the CustomEDID
option of the nvidia driver (the same EDID it probes) while simultaneously using a video modem device
that reports a different EDID and simply passes the video on to the device with the troublesome EDID.
In short, the instability does not appear to have anything to do with change_page_attr, but with the
probing of the EDID. I do still see warnings from the nvidia driver when I use the rhel4 u2 kernel, but I
haven't seen any instability so far.
I still think that this should just work without me going through hoops, but I believe I'll have to work
directly with nvidia (which I have been) to help them replicate this.
This is nvidia's problem and not redhat's. Thanks for being interested, though.
ok. closing then. please re-open if you think this is a Red Hat issue. thanks.