Bug 168936
Summary: | System instability when using the NVIDIA driver (i.e bad caching on address) | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Etienne Clement <etienne.clement> |
Component: | kernel | Assignee: | Ingo Molnar <mingo> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.2 | CC: | bmaly, george.liu, jbaron, k.georgiou, lwang, lwoodman, netllama, rmuthu, tburke, tjb, tkincaid |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHSA-2005-808 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-10-27 15:08:19 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Etienne Clement
2005-09-21 14:20:09 UTC
hmmm. so are you saying that the upstream > 2.6.11-rc1 kernels work fine? also, is there anyway to load the nvidia driver so that it uses change_page_attr. We have the upstream fixes in U2. thus, it doesn't need to be disabled. Our kernel 'version' is 2.6.9, so i'm not sure how the nvidia driver deals with that. but if there is a way to 'trick' it, it would be an interesting test case. (In reply to comment #1) > hmmm. so are you saying that the upstream > 2.6.11-rc1 kernels work fine? Hi Jason. Yes, the NVIDIA error message disapears in kernel >= 2.6.11-rc1. I have made some more tests today and it seems that I might be chasing 2 different bugs. I think the random reboot and the kernel oops might have two different causes. Here is the latest information I have gathered about the bugs: 1- Random reboot: ----------------- I was able to reproduce the random reboot problem in kernel 2.6.11-rc1 in which the NVIDIA error message is not present. I am now trying to reproduce the problem in kernel 2.6.11 and I haven't been able to reproduced it yet. I'll try to narrow it down some more. 2- Kernel oops: --------------- So far I've seen this problem only in kernel 2.6.9-22. I know that it was not present in kernel 2.6.9-5 and I haven't seen it on any non-redhat kernels. This problem is reproducible every time. Therefore, I could try to reproduce it on other kernel versions if you'd like. Enjoy, Etienne (In reply to comment #2) > also, is there anyway to load the nvidia driver so that it uses > change_page_attr. We have the upstream fixes in U2. thus, it doesn't need to be > disabled. Our kernel 'version' is 2.6.9, so i'm not sure how the nvidia driver > deals with that. but if there is a way to 'trick' it, it would be an interesting > test case. I'll try to get the information from NVIDIA. cool. It might be the case that the 'old' driver is making pre-2.6.11 assumptions which are no longer true and this is leading to the BUG(). Thus, getting the driver to 'think' its on 2.6.11+ *might* solve the oops. As another data point, if you don't mind, could you report on the following two kernels: 1) this kernel doesn't have the - fix NX text/large-page interaction bz #163238 http://people.redhat.com/~jbaron/tests/2.6.9-11.37.EL/ 2) the U1 kernel: http://people.redhat.com/~jbaron/tests/2.6.9-11.EL/ I really want to better understand which kernel changes during U2 are causing this changed behavior. thanks. No problems I'll run the tests right away and send you the results. For the random reboot problem it seems to have been fixed in 2.6.11-rc3 since I can reproduce it in 2.6.11-rc2 but not in 2.6.11-rc3. However, I don't know if it's fixed in 2.6.9-22 since I always hit the oops before. Etienne Here are the results: - 2.6.9-11.37.EL: No oops. No error message in the syslog. - 2.6.9-11.EL: No oops. The following error message appears in the syslog: NVRM: bad caching on address 0xf5c35000: actual 0x163 != expected 0x173 Etienne thanks. So 11.37 seems to be the best? Any observable problems with 11.37? Is the reboot problem present in -11 or -11.37.EL? I haven't played much with -11.37 yet. However, I am currently trying to reproduce the reboot problem in -11 and haven't been able yet (looks good). I am going to try to reproduce the reboot problem on -11 and -11.37 and let you know. -11 is the update 1 kernel and -11.37 is a post update 1 kernel right? Hi Jason, -11 and -11.37 do not seem to exhibit the random reboot problem and so far they both ran just fine. However, I'd be very curious to see if the problem is also fixed in -22. Therefore, let me know if you need anything else to help you find the oops problem in -22. Etienne thanks for updated. The problem in -22 is related to NX or the non-execute bit interacting with some changes we made there. An intresting experiment would be to try and disable nx (i think the only way to do this is via the BIOS), and see if -22 is stable. Also, i believe the reboot problem is still present in -11 and -11.37, but just hard to hit.... What lets you think that the reboot problem has not been fixed in -11 and -11.37? Remember that the reboot problem might not be due to the nvidia driver. I am not to sure what NX actually is. However, I am going to have a look in the BIOS to see if there is some settings for that and let you know the result. Etienne i think the reboot issue isn't fix b/c i've seen reports for it in -11 and in -22. NX is flag that marks regions of memory not executable, so that stack overflow exploits are not viable. In the BIOS on a box i have its called 'Execute Disable' I have just started trying to run this on a new system and had the same error messages for the NVIDIA driver. I upgraded my machine to the U2 beta channel, rebuilt the driver. Now when the NVIDIA modules loads I get the message about needing kernel 2.6.11 and the machine hard locks. (only the caps lock key and scroll lock lights are on) What is the graphics card you are using on your system? ok. Can you try this kernel please: http://people.redhat.com/~jbaron/nx/ Jason, Can I please get the src rpm for that above kernel? Hi Jason, Here are the results: 2.6.9-22: BIOS: No execute mode mem protection [Disabled] -> OK BIOS: No execute mode mem protection [Enabled] -> BUG 2.6.9-22.nx BIOS: No execute mode mem protection [Disabled] -> OK BIOS: No execute mode mem protection [Enabled] -> OK So 2.6.9-22.nx seems to fix the problem =). Also, another good news is that we haven't seen the random reboot problem since we are running 2.6.9-11. Etienne hi Etienne, thanks for testing the nx fixes. i believe that this kernel does indeed fix both the intermittent reboot issues as well as the kernel panics. I think i previously said that the reboot problem might still be present but that is not the case. thanks again. -Jason Jason, Here at a Red Hat customer site, we are using RHEL4 U2 on kernel-smp-2.6.9-22.EL on a DELL GX system with nVidia catd and have the same exact problem 1. From a shell start glxgears: /usr/X11R6/bin/glxgears 2. Kill it using Ctrl+C Actual Results: You get the OOPS in the description. Expected Results: No OOPS. I disabled the NX ( Execute Disabled ) from the BIOS and worked fine with no crash. The customer is ready to roll out RHEL4 U2 on their desktop by october'05, so my question is, when can we expect a kernel update that has a fix for this issue?. We don't want to be disabling NX in the BIOS. -Raja Muthu rmuthu An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-808.html |