Bug 168936

Summary:	System instability when using the NVIDIA driver (i.e bad caching on address)
Product:	Red Hat Enterprise Linux 4	Reporter:	Etienne Clement <etienne.clement>
Component:	kernel	Assignee:	Ingo Molnar <mingo>
Status:	CLOSED ERRATA	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.2	CC:	bmaly, george.liu, jbaron, k.georgiou, lwang, lwoodman, netllama, rmuthu, tburke, tjb, tkincaid
Target Milestone:	---	Keywords:	Regression
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:	RHSA-2005-808	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-10-27 15:08:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Etienne Clement 2005-09-21 14:20:09 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

Description of problem:
We are experiencing some system instability when using the NVIDIA driver on an IBM 622322U machine.

In the syslog the following error message appear several times:

NVRM: bad caching on address 0xf47a5000: actual 0x163 != expected 0x173

According to NVIDIA this is due to some change_page_attr bugs present in the linux kernel which where resolved in kernel 2.6.11. Therefore, I compiled some vanilla kernels from kernel.org and observed that the error message is present in 2.6.10 but not in 2.6.11-rc1.

I also made some tests with different versions of the RHEW 4.0 kernels and made the following observations:

Using kernel-smp-2.6.9-5.EL:
----------------------------

The system has the tendency to randomly reboot. Unfortunatly, it is not easy to reproduce but it happens while using an OpenGL application.

Using kernel-smp-2.6.9-22.EL:
-----------------------------

I get the following OOPS when exiting an OpenGL application.

NVRM: loading NVIDIA Linux x86 NVIDIA Kernel Module  1.0-7676  Fri Jul 29 12:58:54 PDT 2005
NVRM: bad caching on address 0xf3365000: actual 0x63 != expected 0x163
NVRM: please see the README section on Cache Aliasing for more information
NVRM: bad caching on address 0xf3365000: actual 0x63 != expected 0x163
NVRM: bad caching on address 0xf3366000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0xf3366000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0xf2c38000: actual 0x63 != expected 0x163
NVRM: bad caching on address 0xf2c39000: actual 0x63 != expected 0x163
NVRM: bad caching on address 0xf2c3a000: actual 0x63 != expected 0x163
NVRM: bad caching on address 0xf2c3b000: actual 0x63 != expected 0x163
NVRM: bad caching on address 0xf2c3c000: actual 0x63 != expected 0x163
NVRM: bad caching on address 0xf2c3d000: actual 0x63 != expected 0x163
------------[ cut here ]------------
kernel BUG at arch/i386/mm/pageattr.c:155!
invalid operand: 0000 [#1]
SMP 
Modules linked in: nvidia(U) netconsole netdump nfsd exportfs parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd sunrpc ide_scsi dm_mod button ba
ttery ac md5 ipv6 joydev wacom uhci_hcd ehci_hcd snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart
 snd_rawmidi snd_seq_device snd soundcore tg3 ext3 jbd aic79xx sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c011bbd1>]    Tainted: P      VLI
EFLAGS: 00010002   (2.6.9-22.ELsmp) 
EIP is at __change_page_attr+0x332/0x400
eax: 00000080   ebx: 00000080   ecx: 00000000   edx: 00000000
esi: 00000163   edi: 80000000   ebp: c0007d88   esp: f2cb1bf0
ds: 007b   es: 007b   ss: 0068
Process glxgears (pid: 5057, threadinfo=f2cb1000 task=f3138e30)
Stack: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
       00000000 00000000 c1000000 c10000e0 f630f000 00000163 80000000 c16c61e0 
       32cf2163 80000000 00000025 f90e0360 c16c61e0 00000000 00000000 00000001 
Call Trace:
 [<f90e0360>] _nv002104rm+0x30/0x40 [nvidia]
 [<c011bccc>] change_page_attr+0x2d/0x50
 [<f92e39ed>] nv_vm_free_pages+0x8d/0xf0 [nvidia]
 [<f92e26b5>] nv_free_pages+0x2cf/0x2f5 [nvidia]
 [<f90de2ba>] _nv002097rm+0x4e/0x58 [nvidia]
 [<f90de299>] _nv002097rm+0x2d/0x58 [nvidia]
 [<f90be177>] _nv007945rm+0x13/0x38 [nvidia]
 [<f90c764e>] _nv001485rm+0x1be/0x1d4 [nvidia]
 [<f90c81ca>] _nv001474rm+0x8e/0x9c [nvidia]
 [<f90c741b>] _nv001487rm+0xa3/0x118 [nvidia]
 [<f920bf1a>] _nv004362rm+0x8a/0x94 [nvidia]
 [<f90c7155>] _nv001492rm+0x3d/0x260 [nvidia]
 [<f90e4a66>] rm_disable_interrupts+0x42/0x54 [nvidia]
 [<f90ddca3>] _nv002058rm+0x1b/0x20 [nvidia]
 [<f90e3d2f>] _nv001472rm+0x67/0x94 [nvidia]
 [<f90e3d22>] _nv001472rm+0x5a/0x94 [nvidia]
 [<f90de10a>] _nv002123rm+0x12/0x18 [nvidia]
 [<f90e4f5e>] rm_free_unused_clients+0x2e/0x88 [nvidia]
 [<f90e4f95>] rm_free_unused_clients+0x65/0x88 [nvidia]
 [<f90e4f81>] rm_free_unused_clients+0x51/0x88 [nvidia]
 [<c02cf6e3>] __cond_resched+0x14/0x39
 [<f92e1563>] nv_kern_ctl_close+0xa8/0xdf [nvidia]
 [<f92e0569>] nv_kern_close+0x41/0x193 [nvidia]
 [<c016f49e>] destroy_inode+0x3d/0x4c
 [<c015a8fe>] __fput+0x55/0x100
 [<c0159545>] filp_close+0x59/0x5f
 [<c01235cf>] put_files_struct+0x57/0xc0
 [<c01241e5>] do_exit+0x227/0x3de
 [<c0124487>] sys_exit_group+0x0/0xd
 [<c012c56b>] get_signal_to_deliver+0x350/0x378
 [<c0105ba4>] do_signal+0x55/0xd9
 [<c0129814>] del_timer+0x5d/0x65
 [<c016a866>] do_pollfd+0x54/0x77
 [<c0169fc0>] poll_freewait+0x33/0x38
 [<c016ab75>] sys_poll+0x240/0x24f
 [<c0169fc5>] __pollwait+0x0/0x95
 [<c0105c50>] do_notify_resume+0x28/0x38
 [<c02d111a>] work_notifysig+0x13/0x15
Code: 89 f0 09 da 89 54 24 44 09 c8 8b 4c 24 44 89 44 24 40 8b 5c 24 40 8b 07 8b 57 04 f0 0f c7 0f 75 f5 8b 44 24 2c f0 ff 48 04 eb 08 <0f> 0b 9b 00 d5
 26 2e c0 a1 8c 04 32 c0 a8 08 0f 84 af 00 00 00 

Version-Release number of selected component (if applicable):
kernel-smp-2.6.9-5.EL and kernel-smp-2.6.9-22.EL

How reproducible:
Always

Steps to Reproduce:
It is much easier to reproduce the problem on kernel-smp-2.6.9-22.EL. Therefore, here are the steps to reproduce it on that particular kernel.

1. From a shell start glxgears: /usr/X11R6/bin/glxgears
2. Kill it using Ctrl+C
  

Actual Results:  You get the OOPS in the description.

Expected Results:  No OOPS.

Additional info:

- The machine has the following configuration:
    o 2 x 3.6GHz Intel Xeon
    o 3GB RAM
    o Quadro FX 3400 PCI-Express using driver version 7676

- The problem does not seems to be specific to the model of the graphic board since we were able to reproduce it with different board models.

- The problem does not seems to be specific to the NVIDIA driver version since we where able to reproduce it with different driver versions.

Comment 1 Jason Baron 2005-09-21 20:42:04 UTC

hmmm. so are you saying that the upstream > 2.6.11-rc1 kernels work fine?

Comment 2 Jason Baron 2005-09-21 21:28:21 UTC

also, is there anyway to load the nvidia driver so that it uses
change_page_attr. We have the upstream fixes in U2. thus, it doesn't need to be
disabled. Our kernel 'version' is 2.6.9, so i'm not sure how the nvidia driver
deals with that. but if there is a way to 'trick' it, it would be an interesting
test case.

Comment 3 Etienne Clement 2005-09-22 02:41:07 UTC

(In reply to comment #1)
> hmmm. so are you saying that the upstream > 2.6.11-rc1 kernels work fine?

Hi Jason.

Yes, the NVIDIA error message disapears in kernel >= 2.6.11-rc1.

I have made some more tests today and it seems that I might be chasing 2
different bugs. I think the random reboot and the kernel oops might have two
different causes. Here is the latest information I have gathered about the bugs:

1- Random reboot:
-----------------

I was able to reproduce the random reboot problem in kernel 2.6.11-rc1 in which
the NVIDIA error message is not present. I am now trying to reproduce the
problem in kernel 2.6.11 and I haven't been able to reproduced it yet. I'll try
to narrow it down some more.

2- Kernel oops:
---------------

So far I've seen this problem only in kernel 2.6.9-22. I know that it was not
present in kernel 2.6.9-5 and I haven't seen it on any non-redhat kernels. This
problem is reproducible every time. Therefore, I could try to reproduce it on
other kernel versions if you'd like.

Enjoy,
Etienne

Comment 4 Etienne Clement 2005-09-22 02:42:32 UTC

(In reply to comment #2)
> also, is there anyway to load the nvidia driver so that it uses
> change_page_attr. We have the upstream fixes in U2. thus, it doesn't need to be
> disabled. Our kernel 'version' is 2.6.9, so i'm not sure how the nvidia driver
> deals with that. but if there is a way to 'trick' it, it would be an interesting
> test case.

I'll try to get the information from NVIDIA.

Comment 5 Jason Baron 2005-09-22 13:03:14 UTC

cool. It might be the case that the 'old' driver is making pre-2.6.11
assumptions which are no longer true and this is leading to the BUG(). Thus,
getting the driver to 'think' its on 2.6.11+ *might* solve the oops.

Comment 6 Jason Baron 2005-09-22 16:12:11 UTC

As another data point, if you don't mind, could you report on the following two
kernels:

1)

this kernel doesn't have the - fix NX text/large-page interaction bz #163238
http://people.redhat.com/~jbaron/tests/2.6.9-11.37.EL/

2)

the U1 kernel:
http://people.redhat.com/~jbaron/tests/2.6.9-11.EL/

I really want to better understand which kernel changes during U2 are causing
this changed behavior. thanks.

Comment 7 Etienne Clement 2005-09-22 17:42:45 UTC

No problems I'll run the tests right away and send you the results.

For the random reboot problem it seems to have been fixed in 2.6.11-rc3 since I
can reproduce it in 2.6.11-rc2 but not in 2.6.11-rc3. However, I don't know if
it's fixed in 2.6.9-22 since I always hit the oops before.

Etienne

Comment 8 Etienne Clement 2005-09-22 18:45:35 UTC

Here are the results:

- 2.6.9-11.37.EL: No oops. No error message in the syslog.

- 2.6.9-11.EL: No oops. The following error message appears in the syslog:
   NVRM: bad caching on address 0xf5c35000: actual 0x163 != expected 0x173

Etienne

Comment 9 Jason Baron 2005-09-22 19:04:36 UTC

thanks. So 11.37 seems to be the best? Any observable problems with 11.37?

Is the reboot problem present in -11 or -11.37.EL?

Comment 10 Etienne Clement 2005-09-22 19:23:06 UTC

I haven't played much with -11.37 yet. However, I am currently trying to
reproduce the reboot problem in -11 and haven't been able yet (looks good). I am
going to try to reproduce the reboot problem on -11 and -11.37 and let you know.

-11 is the update 1 kernel and -11.37 is a post update 1 kernel right?

Comment 15 Etienne Clement 2005-09-23 11:47:45 UTC

Hi Jason,

-11 and -11.37 do not seem to exhibit the random reboot problem and so far they
both ran just fine. However, I'd be very curious to see if the problem is also
fixed in -22. Therefore, let me know if you need anything else to help you find
the oops problem in -22.

Etienne

Comment 16 Jason Baron 2005-09-23 17:56:05 UTC

thanks for updated. The problem in -22 is related to NX or the non-execute bit
interacting with some changes we made there. An intresting experiment would be
to try and disable nx (i think the only way to do this is via the BIOS), and see
if -22 is stable. 

Also, i believe the reboot problem is still present in -11 and -11.37, but just
hard to hit....

Comment 17 Etienne Clement 2005-09-23 18:03:51 UTC

What lets you think that the reboot problem has not been fixed in -11 and
-11.37? Remember that the reboot problem might not be due to the nvidia driver.

I am not to sure what NX actually is. However, I am going to have a look in the
BIOS to see if there is some settings for that and let you know the result.

Etienne

Comment 18 Jason Baron 2005-09-23 18:55:25 UTC

i think the reboot issue isn't fix b/c i've seen reports for it in -11 and in -22.

NX is flag that marks regions of memory not executable, so that stack overflow
exploits are not viable. In the BIOS on a box i have its called 'Execute Disable'

Comment 23 Brett Morrow 2005-09-28 19:25:25 UTC

I have just started trying to run this on a new system and had the same error
messages for the NVIDIA driver. 

I upgraded my machine to the U2 beta channel, rebuilt the driver.  Now when the
NVIDIA modules loads I get the message about needing kernel 2.6.11 and the
machine hard locks.  (only the caps lock key and scroll lock lights are on)

Comment 24 Linda Wang 2005-09-29 00:43:01 UTC

What is the graphics card you are using on your system?

Comment 25 Jason Baron 2005-09-30 14:21:55 UTC

ok. Can you try this kernel please: http://people.redhat.com/~jbaron/nx/

Comment 26 Chris Williams 2005-09-30 16:36:44 UTC

Jason,
Can I please get the src rpm for that above kernel?

Comment 27 Etienne Clement 2005-09-30 19:02:34 UTC

Hi Jason,

Here are the results:

2.6.9-22:

BIOS: No execute mode mem protection [Disabled] -> OK
BIOS: No execute mode mem protection [Enabled] -> BUG

2.6.9-22.nx

BIOS: No execute mode mem protection [Disabled] -> OK
BIOS: No execute mode mem protection [Enabled] -> OK

So 2.6.9-22.nx seems to fix the problem =). Also, another good news is that we
haven't seen the random reboot problem since we are running 2.6.9-11.

Etienne

Comment 30 Jason Baron 2005-10-06 19:33:09 UTC

hi Etienne,

thanks for testing the nx fixes. i believe that this kernel does indeed fix both
the intermittent reboot issues as well as the kernel panics. I think i
previously said that the reboot problem might still be present but that is not
the case. thanks again.

-Jason

Comment 33 Raja 2005-10-11 21:21:42 UTC

Jason,
Here at a Red Hat customer site, we are using RHEL4 U2 on kernel-smp-2.6.9-22.EL
on a DELL GX system with nVidia catd and have the same exact problem

1. From a shell start glxgears: /usr/X11R6/bin/glxgears
2. Kill it using Ctrl+C
  
Actual Results:  You get the OOPS in the description.

Expected Results:  No OOPS.

I disabled the NX ( Execute Disabled ) from the BIOS and worked fine with no crash.

The customer is ready to roll out RHEL4 U2 on their desktop by october'05, so my
question is, when can we expect a kernel update that has a fix for this issue?.
We don't want to be disabling NX in the BIOS.

-Raja Muthu
rmuthu

Comment 37 Red Hat Bugzilla 2005-10-27 15:08:19 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-808.html