From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 Description of problem: We are experiencing some system instability when using the NVIDIA driver on an IBM 622322U machine. In the syslog the following error message appear several times: NVRM: bad caching on address 0xf47a5000: actual 0x163 != expected 0x173 According to NVIDIA this is due to some change_page_attr bugs present in the linux kernel which where resolved in kernel 2.6.11. Therefore, I compiled some vanilla kernels from kernel.org and observed that the error message is present in 2.6.10 but not in 2.6.11-rc1. I also made some tests with different versions of the RHEW 4.0 kernels and made the following observations: Using kernel-smp-2.6.9-5.EL: ---------------------------- The system has the tendency to randomly reboot. Unfortunatly, it is not easy to reproduce but it happens while using an OpenGL application. Using kernel-smp-2.6.9-22.EL: ----------------------------- I get the following OOPS when exiting an OpenGL application. NVRM: loading NVIDIA Linux x86 NVIDIA Kernel Module 1.0-7676 Fri Jul 29 12:58:54 PDT 2005 NVRM: bad caching on address 0xf3365000: actual 0x63 != expected 0x163 NVRM: please see the README section on Cache Aliasing for more information NVRM: bad caching on address 0xf3365000: actual 0x63 != expected 0x163 NVRM: bad caching on address 0xf3366000: actual 0x163 != expected 0x173 NVRM: bad caching on address 0xf3366000: actual 0x163 != expected 0x173 NVRM: bad caching on address 0xf2c38000: actual 0x63 != expected 0x163 NVRM: bad caching on address 0xf2c39000: actual 0x63 != expected 0x163 NVRM: bad caching on address 0xf2c3a000: actual 0x63 != expected 0x163 NVRM: bad caching on address 0xf2c3b000: actual 0x63 != expected 0x163 NVRM: bad caching on address 0xf2c3c000: actual 0x63 != expected 0x163 NVRM: bad caching on address 0xf2c3d000: actual 0x63 != expected 0x163 ------------[ cut here ]------------ kernel BUG at arch/i386/mm/pageattr.c:155! invalid operand: 0000 [#1] SMP Modules linked in: nvidia(U) netconsole netdump nfsd exportfs parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd sunrpc ide_scsi dm_mod button ba ttery ac md5 ipv6 joydev wacom uhci_hcd ehci_hcd snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore tg3 ext3 jbd aic79xx sd_mod scsi_mod CPU: 0 EIP: 0060:[<c011bbd1>] Tainted: P VLI EFLAGS: 00010002 (2.6.9-22.ELsmp) EIP is at __change_page_attr+0x332/0x400 eax: 00000080 ebx: 00000080 ecx: 00000000 edx: 00000000 esi: 00000163 edi: 80000000 ebp: c0007d88 esp: f2cb1bf0 ds: 007b es: 007b ss: 0068 Process glxgears (pid: 5057, threadinfo=f2cb1000 task=f3138e30) Stack: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 c1000000 c10000e0 f630f000 00000163 80000000 c16c61e0 32cf2163 80000000 00000025 f90e0360 c16c61e0 00000000 00000000 00000001 Call Trace: [<f90e0360>] _nv002104rm+0x30/0x40 [nvidia] [<c011bccc>] change_page_attr+0x2d/0x50 [<f92e39ed>] nv_vm_free_pages+0x8d/0xf0 [nvidia] [<f92e26b5>] nv_free_pages+0x2cf/0x2f5 [nvidia] [<f90de2ba>] _nv002097rm+0x4e/0x58 [nvidia] [<f90de299>] _nv002097rm+0x2d/0x58 [nvidia] [<f90be177>] _nv007945rm+0x13/0x38 [nvidia] [<f90c764e>] _nv001485rm+0x1be/0x1d4 [nvidia] [<f90c81ca>] _nv001474rm+0x8e/0x9c [nvidia] [<f90c741b>] _nv001487rm+0xa3/0x118 [nvidia] [<f920bf1a>] _nv004362rm+0x8a/0x94 [nvidia] [<f90c7155>] _nv001492rm+0x3d/0x260 [nvidia] [<f90e4a66>] rm_disable_interrupts+0x42/0x54 [nvidia] [<f90ddca3>] _nv002058rm+0x1b/0x20 [nvidia] [<f90e3d2f>] _nv001472rm+0x67/0x94 [nvidia] [<f90e3d22>] _nv001472rm+0x5a/0x94 [nvidia] [<f90de10a>] _nv002123rm+0x12/0x18 [nvidia] [<f90e4f5e>] rm_free_unused_clients+0x2e/0x88 [nvidia] [<f90e4f95>] rm_free_unused_clients+0x65/0x88 [nvidia] [<f90e4f81>] rm_free_unused_clients+0x51/0x88 [nvidia] [<c02cf6e3>] __cond_resched+0x14/0x39 [<f92e1563>] nv_kern_ctl_close+0xa8/0xdf [nvidia] [<f92e0569>] nv_kern_close+0x41/0x193 [nvidia] [<c016f49e>] destroy_inode+0x3d/0x4c [<c015a8fe>] __fput+0x55/0x100 [<c0159545>] filp_close+0x59/0x5f [<c01235cf>] put_files_struct+0x57/0xc0 [<c01241e5>] do_exit+0x227/0x3de [<c0124487>] sys_exit_group+0x0/0xd [<c012c56b>] get_signal_to_deliver+0x350/0x378 [<c0105ba4>] do_signal+0x55/0xd9 [<c0129814>] del_timer+0x5d/0x65 [<c016a866>] do_pollfd+0x54/0x77 [<c0169fc0>] poll_freewait+0x33/0x38 [<c016ab75>] sys_poll+0x240/0x24f [<c0169fc5>] __pollwait+0x0/0x95 [<c0105c50>] do_notify_resume+0x28/0x38 [<c02d111a>] work_notifysig+0x13/0x15 Code: 89 f0 09 da 89 54 24 44 09 c8 8b 4c 24 44 89 44 24 40 8b 5c 24 40 8b 07 8b 57 04 f0 0f c7 0f 75 f5 8b 44 24 2c f0 ff 48 04 eb 08 <0f> 0b 9b 00 d5 26 2e c0 a1 8c 04 32 c0 a8 08 0f 84 af 00 00 00 Version-Release number of selected component (if applicable): kernel-smp-2.6.9-5.EL and kernel-smp-2.6.9-22.EL How reproducible: Always Steps to Reproduce: It is much easier to reproduce the problem on kernel-smp-2.6.9-22.EL. Therefore, here are the steps to reproduce it on that particular kernel. 1. From a shell start glxgears: /usr/X11R6/bin/glxgears 2. Kill it using Ctrl+C Actual Results: You get the OOPS in the description. Expected Results: No OOPS. Additional info: - The machine has the following configuration: o 2 x 3.6GHz Intel Xeon o 3GB RAM o Quadro FX 3400 PCI-Express using driver version 7676 - The problem does not seems to be specific to the model of the graphic board since we were able to reproduce it with different board models. - The problem does not seems to be specific to the NVIDIA driver version since we where able to reproduce it with different driver versions.
hmmm. so are you saying that the upstream > 2.6.11-rc1 kernels work fine?
also, is there anyway to load the nvidia driver so that it uses change_page_attr. We have the upstream fixes in U2. thus, it doesn't need to be disabled. Our kernel 'version' is 2.6.9, so i'm not sure how the nvidia driver deals with that. but if there is a way to 'trick' it, it would be an interesting test case.
(In reply to comment #1) > hmmm. so are you saying that the upstream > 2.6.11-rc1 kernels work fine? Hi Jason. Yes, the NVIDIA error message disapears in kernel >= 2.6.11-rc1. I have made some more tests today and it seems that I might be chasing 2 different bugs. I think the random reboot and the kernel oops might have two different causes. Here is the latest information I have gathered about the bugs: 1- Random reboot: ----------------- I was able to reproduce the random reboot problem in kernel 2.6.11-rc1 in which the NVIDIA error message is not present. I am now trying to reproduce the problem in kernel 2.6.11 and I haven't been able to reproduced it yet. I'll try to narrow it down some more. 2- Kernel oops: --------------- So far I've seen this problem only in kernel 2.6.9-22. I know that it was not present in kernel 2.6.9-5 and I haven't seen it on any non-redhat kernels. This problem is reproducible every time. Therefore, I could try to reproduce it on other kernel versions if you'd like. Enjoy, Etienne
(In reply to comment #2) > also, is there anyway to load the nvidia driver so that it uses > change_page_attr. We have the upstream fixes in U2. thus, it doesn't need to be > disabled. Our kernel 'version' is 2.6.9, so i'm not sure how the nvidia driver > deals with that. but if there is a way to 'trick' it, it would be an interesting > test case. I'll try to get the information from NVIDIA.
cool. It might be the case that the 'old' driver is making pre-2.6.11 assumptions which are no longer true and this is leading to the BUG(). Thus, getting the driver to 'think' its on 2.6.11+ *might* solve the oops.
As another data point, if you don't mind, could you report on the following two kernels: 1) this kernel doesn't have the - fix NX text/large-page interaction bz #163238 http://people.redhat.com/~jbaron/tests/2.6.9-11.37.EL/ 2) the U1 kernel: http://people.redhat.com/~jbaron/tests/2.6.9-11.EL/ I really want to better understand which kernel changes during U2 are causing this changed behavior. thanks.
No problems I'll run the tests right away and send you the results. For the random reboot problem it seems to have been fixed in 2.6.11-rc3 since I can reproduce it in 2.6.11-rc2 but not in 2.6.11-rc3. However, I don't know if it's fixed in 2.6.9-22 since I always hit the oops before. Etienne
Here are the results: - 2.6.9-11.37.EL: No oops. No error message in the syslog. - 2.6.9-11.EL: No oops. The following error message appears in the syslog: NVRM: bad caching on address 0xf5c35000: actual 0x163 != expected 0x173 Etienne
thanks. So 11.37 seems to be the best? Any observable problems with 11.37? Is the reboot problem present in -11 or -11.37.EL?
I haven't played much with -11.37 yet. However, I am currently trying to reproduce the reboot problem in -11 and haven't been able yet (looks good). I am going to try to reproduce the reboot problem on -11 and -11.37 and let you know. -11 is the update 1 kernel and -11.37 is a post update 1 kernel right?
Hi Jason, -11 and -11.37 do not seem to exhibit the random reboot problem and so far they both ran just fine. However, I'd be very curious to see if the problem is also fixed in -22. Therefore, let me know if you need anything else to help you find the oops problem in -22. Etienne
thanks for updated. The problem in -22 is related to NX or the non-execute bit interacting with some changes we made there. An intresting experiment would be to try and disable nx (i think the only way to do this is via the BIOS), and see if -22 is stable. Also, i believe the reboot problem is still present in -11 and -11.37, but just hard to hit....
What lets you think that the reboot problem has not been fixed in -11 and -11.37? Remember that the reboot problem might not be due to the nvidia driver. I am not to sure what NX actually is. However, I am going to have a look in the BIOS to see if there is some settings for that and let you know the result. Etienne
i think the reboot issue isn't fix b/c i've seen reports for it in -11 and in -22. NX is flag that marks regions of memory not executable, so that stack overflow exploits are not viable. In the BIOS on a box i have its called 'Execute Disable'
I have just started trying to run this on a new system and had the same error messages for the NVIDIA driver. I upgraded my machine to the U2 beta channel, rebuilt the driver. Now when the NVIDIA modules loads I get the message about needing kernel 2.6.11 and the machine hard locks. (only the caps lock key and scroll lock lights are on)
What is the graphics card you are using on your system?
ok. Can you try this kernel please: http://people.redhat.com/~jbaron/nx/
Jason, Can I please get the src rpm for that above kernel?
Hi Jason, Here are the results: 2.6.9-22: BIOS: No execute mode mem protection [Disabled] -> OK BIOS: No execute mode mem protection [Enabled] -> BUG 2.6.9-22.nx BIOS: No execute mode mem protection [Disabled] -> OK BIOS: No execute mode mem protection [Enabled] -> OK So 2.6.9-22.nx seems to fix the problem =). Also, another good news is that we haven't seen the random reboot problem since we are running 2.6.9-11. Etienne
hi Etienne, thanks for testing the nx fixes. i believe that this kernel does indeed fix both the intermittent reboot issues as well as the kernel panics. I think i previously said that the reboot problem might still be present but that is not the case. thanks again. -Jason
Jason, Here at a Red Hat customer site, we are using RHEL4 U2 on kernel-smp-2.6.9-22.EL on a DELL GX system with nVidia catd and have the same exact problem 1. From a shell start glxgears: /usr/X11R6/bin/glxgears 2. Kill it using Ctrl+C Actual Results: You get the OOPS in the description. Expected Results: No OOPS. I disabled the NX ( Execute Disabled ) from the BIOS and worked fine with no crash. The customer is ready to roll out RHEL4 U2 on their desktop by october'05, so my question is, when can we expect a kernel update that has a fix for this issue?. We don't want to be disabling NX in the BIOS. -Raja Muthu rmuthu
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-808.html