Bug 160135

Summary:

kernel panic in ioremap with four 1GB DIMMs (2.6.9-11.ELsmp)

Product:

Red Hat Enterprise Linux 4

Reporter:

Nate Faerber <nfaerber>

Component:

kernel

Assignee:

Jim Paradis <jparadis>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.0

CC:

bilbrey, brett.morrow, jparadis, knweiss, netllama, peterm, ppokorny, wusel+rhbug

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

RHSA-2005-808

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-10-27 15:06:56 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ioremap patch adapted from SLES9 SP2	none
linux-2.6.9-x86_64-ioremap	none
linux-2.6.9-x86_64-pfn_valid patch	none

Description Nate Faerber 2005-06-11 03:13:26 UTC

We have seen some motherboards with a BIOS that enables software memory
remapping kernel panic during boot with a specific memory configuration.  With
two 1GB DIMMS (single banked) on each CPU, the memory has been configured by the
BIOS as 2G at 0x00 (CPU0) and 2G at 0x100000000 (4G boundry, CPU1).

Booting the RHEL v4 Update 1 kernel (2.6.9-11.ELsmp) causes the system to kernel
panic as below.  Add 'mem=2G' to relieve the problem or boot UP kernel.

The current SLES9 SP2 kernel has a patch that fixes this.  I have adapted that
patch to work on this RHEL kernel.  I have attached it below.

Unable to handle kernel paging request at 00000000000018f0 RIP:
<ffffffff80122f32>{ioremap_nocache+196}
PML4 17f048067 PGD 17f45f067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: tg3 ext3 jbd sata_nv libata sd_mod scsi_mod
Pid: 1179, comm: modprobe Not tainted 2.6.9-11.ELsmp
RIP: 0010:[<ffffffff80122f32>] <ffffffff80122f32>{ioremap_nocache+196}
RSP: 0018:0000010037d0fcb8  EFLAGS: 00010213
RAX: 00000100fe8f0000 RBX: 00000000fe8f0000 RCX: 0000000000000019
RDX: ffffffff7fffffff RSI: 0000010180000000 RDI: 0000000000000000
RBP: 0000000000010000 R08: 0000000000000008 R09: dead4ead00000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffffff0000020000
R13: 000001017fbbc000 R14: 0000000000010000 R15: 0000010102964840
FS:  0000002a9556db00(0000) GS:ffffffff804c1780(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000018f0 CR3: 000000007ff82000 CR4: 00000000000006e0
Process modprobe (pid: 1179, threadinfo 0000010037d0e000, task
00000100081cb030)Stack: 000001017fbbc380 00000000fe8f0000 dead4ead00000001
ffffffffa007b924
       0000000000000004 0000010008026e40 000001007fe58930 ffffff000002f0a8
       0000010008026e40 0000000000000202
Call Trace:<ffffffffa007b924>{:tg3:tg3_init_one+687} <ffffffff801899f8>{new_ino
       <ffffffff801aac65>{sysfs_new_dirent+26} <ffffffff80187a9e>{dput+55}
       <ffffffff801aaeaa>{create_dir+305} <ffffffff801e5894>{pci_device_probe+1
       <ffffffff802388b1>{bus_match+57} <ffffffff802389af>{driver_attach+68}
       <ffffffff80238ccb>{bus_add_driver+143} <ffffffff801e5604>{pci_register_d
       <ffffffffa008900e>{:tg3:tg3_init+14} <ffffffff8014d52c>{sys_init_module+
       <ffffffff8011003e>{system_call+126}

Code: 48 8b 8f f0 18 00 00 76 10 48 b8 00 00 00 80 00 01 00 00 48
RIP <ffffffff80122f32>{ioremap_nocache+196} RSP <0000010037d0fcb8>
CR2: 00000000000018f0
 <0>Kernel panic - not syncing: Oops

Comment 1 Nate Faerber 2005-06-11 03:13:27 UTC

Created attachment 115323 [details]
ioremap patch adapted from SLES9 SP2

Comment 5 Jason Baron 2005-06-17 17:14:09 UTC

This patch has been added to a testing kernel. Any feed back is much appreciated:

http://people.redhat.com/~jbaron/rhel4/

Comment 8 Nate Faerber 2005-06-29 22:39:10 UTC

Jason,

The 2.6.9-11.20.ELsmp kernel that was in your rhel4 directory up until yesterday
sill exhibited the problem.  Do you think the 11.21 version will work better?

Comment 9 Jason Baron 2005-06-29 23:46:21 UTC

thanks for the feedback. no, 11.21 wouldn't make a difference. The patch from
comment #1 is basically in 11.20, so i'm bit surprised this is still failing.
we'll have to dig deeper....

Comment 10 Nate Faerber 2005-06-30 15:57:03 UTC

I looked at the Source RPM and the patch you included
(linux-2.6.9-ioremap-fixes.patch) is not identical to the patch in comment #1. 
Was your patch supposed to fix other things as well as this problem?  It will
take me some time to figure which part of the good patch is missing from yours.

I can assure you that applying the patch in comment #1 to a 2.6.9-11.EL kernel
will fix my problem.

Comment 11 Karsten Weiss 2005-07-15 10:33:49 UTC

We are getting the same kernel panic on a dual opteron hp xw9300 with
4 GB on Red Hat Linux 4 Update 1 WS with kernel 2.6.9-11.ELsmp x86-64 
when we boot with the kernel argument "acpi=off". (We'll get a different
kernel panic without acpi=off - see my other bug):

 audit(1121422223.323:0): initialized
Red Hat nash version 4.2.1.3 starting
File descriptor 3 left open
  Reading all physical volumes.  This may take a while...
  Found volume group "vg01" using metadata type lvm2
  Found volume group "vg00" using metadata type lvm2
File descriptor 3 left open
  2 logical volume(s) in volume group "vg00" now active
INIT: version 2.85 booting
                Welcome to Red Hat Enterprise Linux WS
                Press 'I' to enter interactive startup.
udev starten:  [  OK  ]
Initialisiere Hardware...  Speicher NetzwerkUnable to handle kernel paging
request at 00000000000018f0 RIP:
<ffffffff80122f32>{ioremap_nocache+196}
PML4 17ed1f067 PGD 17f29f067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd_page_alloc snd_mp
u401_uart snd_rawmidd
Pid: 1069, comm: modprobe Not tainted 2.6.9-11.ELsmp
RIP: 0010:[<ffffffff80122f32>] <ffffffff80122f32>{ioremap_nocache+196}
RSP: 0018:0000010037cebd58  EFLAGS: 00010213
RAX: 00000100f2101000 RBX: 00000000f2101000 RCX: 0000000000000019
RDX: ffffffff7fffffff RSI: 0000010180000000 RDI: 0000000000000000
RBP: 0000000000001000 R08: 0000000000000008 R09: 0000000000000246
R10: 0000000000000000 R11: 0000000000000000 R12: ffffff0000018000
R13: dead4ead00000001 R14: dead4ead00000001 R15: 000001017fc14dc0
FS:  0000002a9557db00(0000) GS:ffffffff804c1780(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000018f0 CR3: 0000000008032000 CR4: 00000000000006e0
Process modprobe (pid: 1069, threadinfo 0000010037cea000, task 0000010037d467f0)
Stack: 0000000000000000 000001017eb81780 0000000000000004 ffffffffa0141a57
       000001017fc64cc0 ffffffff801c6041 000001017eb186e0 0000000000000246
       000001007ffd6400 000001017ffe9800
Call Trace:<ffffffffa0141a57>{:snd_intel8x0:snd_intel8x0_probe+527}
       <ffffffff801c6041>{selinux_inode_alloc_security+72}
       <ffffffff801aac65>{sysfs_new_dirent+26} <ffffffff80187a9e>{dput+55}
       <ffffffff801aaeaa>{create_dir+305} <ffffffff801e5894>{pci_device_probe+110}
       <ffffffff802388b1>{bus_match+57} <ffffffff802389af>{driver_attach+68}
       <ffffffff80238ccb>{bus_add_driver+143}
<ffffffff801e5604>{pci_register_driver+119}
       <ffffffffa014b00e>{:snd_intel8x0:alsa_card_intel8x0_init+14}
       <ffffffff8014d52c>{sys_init_module+316} <ffffffff8011003e>{system_call+126}


Code: 48 8b 8f f0 18 00 00 76 10 48 b8 00 00 00 80 00 01 00 00 48
RIP <ffffffff80122f32>{ioremap_nocache+196} RSP <0000010037cebd58>
CR2: 00000000000018f0
 <0>Kernel panic - not syncing: Oops

Comment 12 Karsten Weiss 2005-07-15 10:39:54 UTC

Here's the panic we get without acpi=off (but still smp kernel):

Unable to handle kernel paging request at 00000000000018f0 RIP:
<ffffffff80122f32>{ioremap_nocache+196}
PML4 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.9-11.ELsmp
RIP: 0010:[<ffffffff80122f32>] <ffffffff80122f32>{ioremap_nocache+196}
RSP: 0000:000001017ffb1f08  EFLAGS: 00010213
RAX: 00000100e0000000 RBX: 00000000e0000000 RCX: 0000000000000019
RDX: ffffffff7fffffff RSI: 0000010180000000 RDI: 0000000000000000
RBP: 0000000010000000 R08: 0000000000000008 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffff0000080000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff804c1700(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000000018f0 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo 000001017ffb0000, task 0000010037e4a7f0)
Stack: ffffffff804ee4c8 0000000000000000 0000000000000000 ffffffff804e4bea
       0000000000000246 ffffffff8010c3eb 0000000000000246 0000000000000000
       0000000000000000 ffffffff80110c8f
Call Trace:<ffffffff804e4bea>{pci_mmcfg_init+32} <ffffffff8010c3eb>{init+474}
       <ffffffff80110c8f>{child_rip+8} <ffffffff8010c211>{init+0}
       <ffffffff80110c87>{child_rip+0}

Code: 48 8b 8f f0 18 00 00 76 10 48 b8 00 00 00 80 00 01 00 00 48
RIP <ffffffff80122f32>{ioremap_nocache+196} RSP <000001017ffb1f08>
CR2: 00000000000018f0
 <0>Kernel panic - not syncing: Oops

Comment 13 Jason Baron 2005-07-15 21:38:33 UTC

hmmm, i've dug into this one a bit...the crash here is happening in
'pfn_to_page' which is called by virt_to_page. It seems that NODE_DATA(nid) is
NULL and thus node_start_pfn is NULL. The reason, i suspect, the patch in
comment #1 works is b/c it no longer uses virt_to_page. However, i think the
underlying problem is still present. This seems to be an issue with how NUMA is
configured in RHEL4 x86_64. Also, consistent witht this is the fact that if i
boot with 'numa=off' rhel4 u1 seems to work fine. THus, i would suggest that as
a temporary workaround until we can get to the bottom of this. thanks.

Comment 14 Karsten Weiss 2005-07-18 08:38:06 UTC

With 

kernel /vmlinuz-2.6.9-11.ELsmp ro root=/dev/vg00/root rhgb quiet numa=off

the kernel still panics. Unfortunately, I can't show you the oops because the
machine is at a remote location.

I will try acpi=off and numa=off next.

Comment 15 Karsten Weiss 2005-07-18 15:24:31 UTC

Okay, I did another test with acpi=off *and* numa=off. Now the smp kernel
finally boots. However, I get the following kernel message now:

Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 128 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 129 has empty cpu mask
                                               .
                                               .
                                               .
Jul 18 17:00:05 esw0001 kernel: k8-bus.c: bus 254 has empty cpu mask
Jul 18 17:00:05 esw0001 kernel: k8-bus.c: bus 255 has empty cpu mask

Here's the boot log:

Jul 18 17:00:01 esw0001 syslogd 1.4.1: restart.
Jul 18 17:00:01 esw0001 syslog: Starten von syslogd succeeded
Jul 18 17:00:02 esw0001 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jul 18 17:00:02 esw0001 kernel: Bootdata ok (command line is ro
root=/dev/vg00/root rhgb quiet acpi=off numa=off)
Jul 18 17:00:02 esw0001 kernel: Linux version 2.6.9-11.ELsmp
(bhcompile.redhat.com) (gcc version 3.4.3 20050227 (Red Hat
3.4.3-22)) #1 SMP Fri M
ay 20 18:25:30 EDT 2005
Jul 18 17:00:02 esw0001 kernel: BIOS-provided physical RAM map:
Jul 18 17:00:02 esw0001 kernel:  BIOS-e820: 0000000000000000 - 000000000009d400
(usable)
Jul 18 17:00:02 esw0001 kernel:  BIOS-e820: 000000000009d400 - 00000000000a0000
(reserved)
Jul 18 17:00:02 esw0001 kernel:  BIOS-e820: 00000000000e8000 - 0000000000100000
(reserved)
Jul 18 17:00:02 esw0001 kernel:  BIOS-e820: 0000000000100000 - 000000007fff9500
(usable)
Jul 18 17:00:02 esw0001 kernel:  BIOS-e820: 000000007fff9500 - 0000000080000000
(reserved)
Jul 18 17:00:02 esw0001 kernel:  BIOS-e820: 00000000fec00000 - 0000000100000000
(reserved)
Jul 18 17:00:02 esw0001 kernel:  BIOS-e820: 0000000100000000 - 0000000180000000
(usable)
Jul 18 17:00:02 esw0001 syslog: Starten von klogd succeeded
Jul 18 17:00:02 esw0001 irqbalance: Starten von irqbalance succeeded
Jul 18 17:00:02 esw0001 kernel: Warning: acpi_table_parse(ACPI_SLIT) returned 0!
Jul 18 17:00:02 esw0001 kernel: NUMA turned off
Jul 18 17:00:02 esw0001 kernel: Faking a node at 0000000000000000-0000000180000000
Jul 18 17:00:02 esw0001 kernel: Bootmem setup node 0
0000000000000000-0000000180000000
Jul 18 17:00:02 esw0001 kernel: No mptable found.
Jul 18 17:00:02 esw0001 kernel: Nvidia board detected. Ignoring ACPI timer override.
Jul 18 17:00:02 esw0001 kernel: Intel MultiProcessor Specification v1.4
Jul 18 17:00:02 esw0001 kernel:     Virtual Wire compatibility mode.
Jul 18 17:00:02 esw0001 kernel: OEM ID: HP       <6>Product ID: workstation 
<6>APIC at: 0xFEE00000
Jul 18 17:00:02 esw0001 kernel: Processor #0 15:5 APIC version 16
Jul 18 17:00:02 esw0001 kernel: Processor #1 15:5 APIC version 16
Jul 18 17:00:02 esw0001 kernel: I/O APIC #8 Version 17 at 0xFEC00000.
Jul 18 17:00:02 esw0001 kernel: I/O APIC #9 Version 17 at 0xF2600000.
Jul 18 17:00:02 esw0001 kernel: I/O APIC #10 Version 17 at 0xF2601000.
Jul 18 17:00:02 esw0001 kernel: I/O APIC #11 Version 17 at 0xF2700000.
Jul 18 17:00:02 esw0001 kernel: Processors: 2
Jul 18 17:00:02 esw0001 kernel: Checking aperture...
Jul 18 17:00:02 esw0001 kernel: CPU 0: aperture @ 8000000 size 32 MB
Jul 18 17:00:02 esw0001 kernel: Aperture from northbridge cpu 0 too small (32 MB)
Jul 18 17:00:02 esw0001 kernel: No AGP bridge found
Jul 18 17:00:02 esw0001 portmap: Starten von portmap succeeded
Jul 18 17:00:02 esw0001 kernel: Your BIOS doesn't leave a aperture memory hole
Jul 18 17:00:02 esw0001 kernel: Please enable the IOMMU option in the BIOS setup
Jul 18 17:00:02 esw0001 kernel: This costs you 64 MB of RAM
Jul 18 17:00:02 esw0001 kernel: Mapping aperture over 65536 KB of RAM @ 8000000
Jul 18 17:00:02 esw0001 kernel: Built 1 zonelists
Jul 18 17:00:02 esw0001 kernel: Kernel command line: ro root=/dev/vg00/root rhgb
quiet acpi=off numa=off console=tty0
Jul 18 17:00:02 esw0001 kernel: Initializing CPU#0
Jul 18 17:00:02 esw0001 kernel: PID hash table entries: 4096 (order: 12, 131072
bytes)
Jul 18 17:00:02 esw0001 kernel: time.c: Using 1.193182 MHz PIT timer.
Jul 18 17:00:02 esw0001 kernel: time.c: Detected 2593.109 MHz processor.
Jul 18 17:00:02 esw0001 rpc.statd[2261]: Version 1.0.6 Starting
Jul 18 17:00:02 esw0001 kernel: Console: colour VGA+ 80x25
Jul 18 17:00:02 esw0001 kernel: Dentry cache hash table entries: 1048576 (order:
11, 8388608 bytes)
Jul 18 17:00:02 esw0001 kernel: Inode-cache hash table entries: 524288 (order:
10, 4194304 bytes)
Jul 18 17:00:02 esw0001 kernel: Memory: 4023244k/6291456k available (2033k
kernel code, 0k reserved, 1252k data, 188k init)
Jul 18 17:00:02 esw0001 kernel: Security Scaffold v1.0.0 initialized
Jul 18 17:00:02 esw0001 kernel: SELinux:  Initializing.
Jul 18 17:00:02 esw0001 rpc.statd[2261]: gethostbyname error for esw0001
Jul 18 17:00:02 esw0001 kernel: SELinux:  Starting in permissive mode
Jul 18 17:00:02 esw0001 kernel: There is already a security framework
initialized, register_security failed.
Jul 18 17:00:02 esw0001 nfslock: Starten von rpc.statd succeeded
Jul 18 17:00:02 esw0001 kernel: selinux_register_security:  Registering
secondary module capability
Jul 18 17:00:02 esw0001 kernel: Capability LSM initialized as secondary
Jul 18 17:00:02 esw0001 kernel: Mount-cache hash table entries: 256 (order: 0,
4096 bytes)
Jul 18 17:00:02 esw0001 kernel: CPU: L1 I Cache: 64K (64 bytes/line), D cache
64K (64 bytes/line)
Jul 18 17:00:02 esw0001 kernel: CPU: L2 Cache: 1024K (64 bytes/line)
Jul 18 17:00:02 esw0001 kernel: Using local APIC NMI watchdog using perfctr0
Jul 18 17:00:02 esw0001 kernel: CPU: L1 I Cache: 64K (64 bytes/line), D cache
64K (64 bytes/line)
Jul 18 17:00:02 esw0001 kernel: CPU: L2 Cache: 1024K (64 bytes/line)
Jul 18 17:00:02 esw0001 kernel: CPU0: AMD Opteron(tm) Processor 252 stepping 01
Jul 18 17:00:02 esw0001 kernel: per-CPU timeslice cutoff: 1023.90 usecs.
Jul 18 17:00:02 esw0001 kernel: task migration cache decay timeout: 2 msecs.
Jul 18 17:00:02 esw0001 kernel: Booting processor 1/1 rip 6000 rsp 10006455f58
Jul 18 17:00:02 esw0001 kernel: Initializing CPU#1
Jul 18 17:00:02 esw0001 kernel: CPU: L1 I Cache: 64K (64 bytes/line), D cache
64K (64 bytes/line)
Jul 18 17:00:02 esw0001 kernel: CPU: L2 Cache: 1024K (64 bytes/line)
Jul 18 17:00:02 esw0001 kernel: AMD Opteron(tm) Processor 252 stepping 01
Jul 18 17:00:02 esw0001 kernel: Total of 2 processors activated (10305.53 BogoMIPS).
Jul 18 17:00:02 esw0001 kernel: Using IO-APIC 8
Jul 18 17:00:02 esw0001 kernel: Using IO-APIC 9
Jul 18 17:00:02 esw0001 kernel: Using IO-APIC 10
Jul 18 17:00:02 esw0001 kernel: Using IO-APIC 11
Jul 18 17:00:02 esw0001 kernel: Using local APIC timer interrupts.
Jul 18 17:00:02 esw0001 kernel: Detected 12.466 MHz APIC timer.
Jul 18 17:00:02 esw0001 kernel: checking TSC synchronization across 2 CPUs: passed.
Jul 18 17:00:02 esw0001 kernel: time.c: Using PIT/TSC based timekeeping.
Jul 18 17:00:02 esw0001 kernel: Brought up 2 CPUs
Jul 18 17:00:02 esw0001 kernel: checking if image is initramfs... it is
Jul 18 17:00:02 esw0001 kernel: NET: Registered protocol family 16
Jul 18 17:00:02 esw0001 kernel: PCI: Using configuration type 1
Jul 18 17:00:02 esw0001 kernel: mtrr: v2.0 (20020519)
Jul 18 17:00:02 esw0001 kernel: ACPI: Subsystem revision 20040816
Jul 18 17:00:02 esw0001 kernel: ACPI: Interpreter disabled.
Jul 18 17:00:02 esw0001 kernel: usbcore: registered new driver usbfs
Jul 18 17:00:02 esw0001 kernel: usbcore: registered new driver hub
Jul 18 17:00:02 esw0001 kernel: PCI: Probing PCI hardware
Jul 18 17:00:02 esw0001 kernel: PCI: Probing PCI hardware (bus 00)
Jul 18 17:00:02 esw0001 rpcidmapd: Starten von rpc.idmapd succeeded
Jul 18 17:00:02 esw0001 kernel: PCI: Transparent bridge - 0000:00:09.0
Jul 18 17:00:02 esw0001 kernel: PCI: Discovered primary peer bus 41 [IRQ]
Jul 18 17:00:02 esw0001 kernel: PCI: Discovered primary peer bus 61 [IRQ]
Jul 18 17:00:02 esw0001 kernel: PCI: Discovered primary peer bus 81 [IRQ]
Jul 18 17:00:02 esw0001 kernel: PCI: Using IRQ router default [10de/0051] at
0000:00:01.0
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B0,I1,P0) -> 11
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B0,I2,P0) -> 5
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B0,I2,P1) -> 10
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B0,I4,P0) -> 11
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B0,I7,P0) -> 5
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B0,I8,P0) -> 10
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B0,I10,P0) -> 11
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B5,I5,P0) -> 10
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B10,I0,P0) -> 5
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B97,I6,P0) -> 5
Jul 18 17:00:02 esw0001 kernel: PCI->APIC IRQ transform: (B97,I6,P1) -> 10
Jul 18 17:00:02 esw0001 kernel: PCI-DMA: Disabling AGP.
Jul 18 17:00:03 esw0001 kernel: PCI-DMA: aperture base @ 8000000 size 65536 KB
Jul 18 17:00:03 esw0001 kernel: PCI-DMA: Reserving 64MB of IOMMU area in the AGP
aperture
Jul 18 17:00:03 esw0001 netfs: Andere Dateisysteme einhÃâ¬ngen:  succeeded
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 128 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 129 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 130 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 131 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 132 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 133 has empty cpu mask
Jul 18 17:00:03 esw0001 rc: lm_sensors starten:  succeeded
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 134 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 135 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 136 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 137 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 138 has empty cpu mask
Jul 18 17:00:03 esw0001 kernel: k8-bus.c: bus 139 has empty cpu mask
etc.

Comment 22 Jason Baron 2005-08-12 15:40:37 UTC

could you plese test the u2 beta. you should be able to boot with both acpi=off
and numa=off *NOT* set. http://people.redhat.com/~jbaron/rhel4/

thanks.

Comment 24 Karsten Weiss 2005-08-15 11:07:09 UTC

(In reply to comment #22)
> could you plese test the u2 beta. you should be able to boot with both acpi=off
> and numa=off *NOT* set. http://people.redhat.com/~jbaron/rhel4/

Hi Jason!

Good news: I've tried your u2 beta kernel (x86_64). This kernel now boots fine
without the acpi=off and numa=off kernel parameters.

Bad news: It crashed as soon as loaded the rebuilt nvidia kernel module
(NVIDIA-Linux-x86_64-1.0-7667-pkg2.run). Unfortunately, I can't give you more
details right now because the machine is at a remote location. (BTW: This is
a 3d graphics workstation and the NVIDIA driver is mandatory.)

Comment 25 Jim Paradis 2005-08-15 15:21:51 UTC

This may be a separate issue, then.  Please attach a console capture of the
crash at your earliest convenience; we may recommend filing a separate issue.

Comment 26 Karsten Weiss 2005-08-22 10:44:28 UTC

I'll try to capture a new oops as soon as possible.

Regarding NVIDIA: Even with 2.6.9-11.ELsmp I get the following warnings:

NVRM: loading NVIDIA Linux x86_64 NVIDIA Kernel Module  1.0-7676  Fri Jul 29
13:15:16 PDT 2005
NVRM: WARNING: Your Linux kernel has problems in its implementation of
NVRM: the change_page_attr kernel interface.  The NVIDIA kernel
NVRM: module will attempt to work around these problems, but
NVRM: system stability may be affected.  It is recommended that
NVRM: you update to a 2.6.11 or newer kernel.

NVRM: bad caching on address 0x1014fe2f000: actual 0x163 != expected 0x173
NVRM: please see the README section on Cache Aliasing for more information
NVRM: bad caching on address 0x10150c59000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x1016deec000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x1016deed000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x101648c4000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x101648c5000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x10173d76000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x10173d77000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x10169de8000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x10169de9000: actual 0x163 != expected 0x173

The NVIDIA README says

Cache Aliasing

    Cache aliasing occurs when multiple mappings to a physical page of
    memory have conflicting caching states, such as cached and uncached.
    Due to these conflicting states, data in that physical page may become
    corrupted when the processor's cache is flushed. If that page is being
    used for dma by a driver such as NVIDIA's graphics driver, this can
    lead to hardware stability problems and system lockups.

    NVIDIA has encountered bugs with some Linux kernel versions that lead
    to cache aliasing. Although some systems will run perfectly fine when
    cache aliasing occurs, other systems will experience severe stability
    problems, including random lockups. Users experiencing stability
    problems due to cache aliasing will benefit from updating to a kernel
    that does not cause cache aliasing to occur.

    NVIDIA has added driver logic to detect cache aliasing and to print a
    warning with a message similar to the following: NVRM: bad caching on
    address 0x1cdf000: actual 0x46 != expected 0x73 If you see this
    message in your log files and are experiencing stability problems, you
    should update your kernel to the latest version.

    If the message persists after updating your kernel, please send a bug
    report to NVIDIA.

Is there any chance that we'll get a kernel for RHEL4 without those Cache
Aliasing issues?

Comment 27 Karsten Weiss 2005-08-22 13:03:20 UTC

Okay, here are some new kernel Ooopses:

=======================================================================

This is 2.6.9-15ELsmp booted with the following grub settings:

title Red Hat Enterprise Linux WS (2.6.9-15.ELsmp) Update2 Beta Kernel
        root (hd0,0)
        kernel /vmlinuz-2.6.9-15.ELsmp ro root=/dev/vg00/root rhgb quiet
console=tty0 console=ttyS0,38400n8
        initrd /initrd-2.6.9-15.ELsmp.img

This one boots fine. But when I recompile the NVIDIA-1.0-7676 kernel
module and load it into the kernel I get the following oops:

Unable to handle kernel paging request at 00000000000018f0 RIP:
<ffffffff80123d00>{iounmap+304}
PML4 1791fe067 PGD 177312067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: nvidia(U) nfs nfsd exportfs lockd md5 ipv6 i2c_dev i2c_core
sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd ehci_hcd sndd
Pid: 4839, comm: X Tainted: P      2.6.9-15.ELsmp
RIP: 0010:[<ffffffff80123d00>] <ffffffff80123d00>{iounmap+304}
RSP: 0018:0000010177673ab8  EFLAGS: 00010213
RAX: 00000100e0000000 RBX: 000001017fd9df00 RCX: 0000000000000019
RDX: ffffffff7fffffff RSI: 0000000000002000 RDI: 00000000e0000000
RBP: ffffff0000030000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000002a9557f3e0(0000) GS:ffffffff804d3300(0000) knlGS:00000000f7fd66c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000018f0 CR3: 0000000002c22000 CR4: 00000000000006e0
Process X (pid: 4839, threadinfo 0000010177672000, task 000001007f77c030)
Stack: ffffffff803cbda8 ffffff0000030000 000001007f124800 ffffffffa04c56da
       0000000000000080 ffffffffa02c60e0 000001006ec91500 00000000005e10de
       0000000000000000 ffffffffa02c2d18
Call Trace:<ffffffffa04c56da>{:nvidia:os_unmap_kernel_space+9}
       <ffffffffa02c60e0>{:nvidia:_nv002012rm+42}
<ffffffffa02c2d18>{:nvidia:_nv002316rm+208}
       <ffffffffa02c219b>{:nvidia:_nv002327rm+255}
<ffffffffa02c2370>{:nvidia:_nv002284rm+100}
       <ffffffffa02ba06b>{:nvidia:_nv002166rm+39}
<ffffffffa02c2458>{:nvidia:_nv002328rm+64}
       <ffffffffa02c8df7>{:nvidia:_nv003667rm+141}
<ffffffffa02c8d3b>{:nvidia:_nv003623rm+275}
       <ffffffffa043c778>{:nvidia:_nv003247rm+126}
<ffffffffa03ef948>{:nvidia:_nv004556rm+68}
       <ffffffffa03ef726>{:nvidia:_nv004385rm+104}
<ffffffffa02c8b04>{:nvidia:_nv001453rm+96}
       <ffffffffa03a0338>{:nvidia:_nv000393rm+20}
<ffffffffa03a04b3>{:nvidia:_nv000397rm+125}
       <ffffffffa02cb951>{:nvidia:_nv001426rm+141}
<ffffffffa02c9542>{:nvidia:_nv001458rm+668}
       <ffffffffa02cc8f4>{:nvidia:rm_init_adapter+104}
<ffffffffa04bf66f>{:nvidia:nv_kern_open+684}
       <ffffffff8017eb3c>{chrdev_open+412} <ffffffff801760a0>{dentry_open+223}
       <ffffffff801761db>{filp_open+62} <ffffffff801e9c85>{strncpy_from_user+74}
       <ffffffff801762cd>{get_unused_fd+230} <ffffffff801763bc>{sys_open+57}
       <ffffffff80110052>{system_call+126}

Code: 49 8b 88 f0 18 00 00 76 1b 48 b8 00 00 00 80 00 01 00 00 48
RIP <ffffffff80123d00>{iounmap+304} RSP <0000010177673ab8>
CR2: 00000000000018f0
 <0>Kernel panic - not syncing: Oops

==========================================================

This is 2.6.9-16.ELsmp booted with the following grub settings:

title Red Hat Enterprise Linux WS (2.6.9-16.ELsmp) Update2 Beta Kernel
        root (hd0,0)
        kernel /vmlinuz-2.6.9-16.ELsmp ro root=/dev/vg00/root rhgb quiet
console=tty0 console=ttyS0,38400n8
        initrd /initrd-2.6.9-16.ELsmp.img

(The line "Pid: 1, comm: swapper Not tainted 2.6.9-11.EL" really confuses me!)

Unable to handle kernel NULL pointer dereference at 0000000000000002 RIP:
<ffffffff80241567>{acpi_pci_root_add+296}
PML4 7fdcc067 PGD 0
Oops: 0000 [1]
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.9-11.EL
RIP: 0010:[<ffffffff80241567>] <ffffffff80241567>{acpi_pci_root_add+296}
RSP: 0018:000001007ff83e08  EFLAGS: 00010206
RAX: 0000000000ff0002 RBX: 000001007fdb72c0 RCX: ffffffff804d478c
RDX: 0000000000000002 RSI: 0000000000000001 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 000001007ffde828
R10: 0000000000000000 R11: 0000000000000000 R12: 00000100065a6c00
R13: ffffffff80432320 R14: 0000000000000000 R15: 0000010037ff0f00
FS:  0000000000000000(0000) GS:ffffffff8051e980(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000002 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo 000001007ff82000, task 000001007ff81110)
Stack: 000000000000001a 0000000000000000 0000000000000000 00000000000000ff
       ffffffff80431d00 00000100065a6c00 ffffffff80431d00 ffffffff80245f64
       ffffffff80431d00 00000100065a6c00
Call Trace:<ffffffff80245f64>{acpi_bus_driver_init+49}
<ffffffff802471be>{acpi_bus_add+2715}
       <ffffffff80226639>{acpi_os_wait_semaphore+133}
<ffffffff8023c7e4>{acpi_ut_acquire_mutex+114}
       <ffffffff80539506>{acpi_scan_init+450} <ffffffff8010c3d9>{init+336}
       <ffffffff80111373>{child_rip+8} <ffffffff8010c289>{init+0}
       <ffffffff8011136b>{child_rip+0}

Code: 48 8b 02 0f 18 08 48 81 fa 10 1e 43 80 74 68 8b 43 18 39 42
RIP <ffffffff80241567>{acpi_pci_root_add+296} RSP <000001007ff83e08>
CR2: 0000000000000002
 <0>Kernel panic - not syncing: Oops

As you can see it crashes right at the beginning of the boot process.

Comment 28 Jim Paradis 2005-08-22 20:51:10 UTC

(The line "Pid: 1, comm: swapper Not tainted 2.6.9-11.EL" really confuses me!)

It means that you're running the wrong kernel.  On panic, the kernel prints the
version string that was compiled in; if you see the wrong one then it means
you're running the wrong kernel (you may have accidentally overwritten the 16.EL
file or somesuch).

In the meantime, pull down the SMP test kernel from
http://people.redhat.com/~jparadis/numa and tell me how that works.  It's the
.16 kernel with an additional fix for the "Unable to handle kernel paging
request at 00000000000018f0" issue.

Comment 29 Karsten Weiss 2005-08-23 07:28:16 UTC

Hi Jim!

The 2.6.9-16.EL.root.numafixsmp kernel boots fine but it still oopses 
immediately when I load the latest NVIDIA driver (1.0-7676)

BTW: I did the following to compile the NVIDIA driver because there
was no devel package for your root.numafix kernel:

root@esw0001 # cd /lib/modules/2.6.9-16.EL.root.numafixsmp/
root@esw0001 # ln -s/usr/src/kernels/2.6.9-16.EL-smp-x86_64/ build
root@esw0001 # ln -s build/ source

Here's the oops:

Unable to handle kernel paging request at 00000000000018f0 RIP:
<ffffffff80123cb0>{iounmap+304}
PML4 7a4cf067 PGD 7c910067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: nvidia(U) nfs nfsd exportfs lockd md5 ipv6 i2c_dev i2c_core
sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd ehci_hcd sndd
Pid: 4873, comm: X Tainted: P      2.6.9-16.EL.root.numafixsmp
RIP: 0010:[<ffffffff80123cb0>] <ffffffff80123cb0>{iounmap+304}
RSP: 0018:000001017ed1fab8  EFLAGS: 00010213
RAX: 00000100e0000000 RBX: 00000100082b0dc0 RCX: 0000000000000019
RDX: ffffffff7fffffff RSI: 0000000000002000 RDI: 00000000e0000000
RBP: ffffff0000030000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000002a9557f3e0(0000) GS:ffffffff804d3480(0000) knlGS:00000000f7fd66c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000018f0 CR3: 0000000000101000 CR4: 00000000000006e0
Process X (pid: 4873, threadinfo 000001017ed1e000, task 000001017b2cc030)
Stack: ffffffff803cbfa8 ffffff0000030000 000001007ef3e800 ffffffffa04c66da
       0000000000000080 ffffffffa02c70e0 0000010170e19500 00000000005e10de
       0000000000000000 ffffffffa02c3d18
Call Trace:<ffffffffa04c66da>{:nvidia:os_unmap_kernel_space+9}
       <ffffffffa02c70e0>{:nvidia:_nv002012rm+42}
<ffffffffa02c3d18>{:nvidia:_nv002316rm+208}
       <ffffffffa02c319b>{:nvidia:_nv002327rm+255}
<ffffffffa02c3370>{:nvidia:_nv002284rm+100}
       <ffffffffa02bb06b>{:nvidia:_nv002166rm+39}
<ffffffffa02c3458>{:nvidia:_nv002328rm+64}
       <ffffffffa02c9df7>{:nvidia:_nv003667rm+141}
<ffffffffa02c9d3b>{:nvidia:_nv003623rm+275}
       <ffffffffa043d778>{:nvidia:_nv003247rm+126}
<ffffffffa03f0948>{:nvidia:_nv004556rm+68}
       <ffffffffa03f0726>{:nvidia:_nv004385rm+104}
<ffffffffa02c9b04>{:nvidia:_nv001453rm+96}
       <ffffffffa03a1338>{:nvidia:_nv000393rm+20}
<ffffffffa03a14b3>{:nvidia:_nv000397rm+125}
       <ffffffffa02cc951>{:nvidia:_nv001426rm+141}
<ffffffffa02ca542>{:nvidia:_nv001458rm+668}
       <ffffffffa02cd8f4>{:nvidia:rm_init_adapter+104}
<ffffffffa04c066f>{:nvidia:nv_kern_open+684}
       <ffffffff8017eb50>{chrdev_open+412} <ffffffff801760b4>{dentry_open+223}
       <ffffffff801761ef>{filp_open+62} <ffffffff801e9cf5>{strncpy_from_user+74}
       <ffffffff801762e1>{get_unused_fd+230} <ffffffff801763d0>{sys_open+57}
       <ffffffff80110056>{system_call+126}

Code: 49 8b 88 f0 18 00 00 76 1b 48 b8 00 00 00 80 00 01 00 00 48
RIP <ffffffff80123cb0>{iounmap+304} RSP <000001017ed1fab8>
CR2: 00000000000018f0
 <0>Kernel panic - not syncing: Oops

Comment 30 Jim Paradis 2005-08-25 21:19:53 UTC

Just for yuks, I put the kernel-smp-devel package up on my people page as well.
 Could you undo the symlinks you have above, install the devel package, and try
again?  I'd like to make sure we're using all the right bits before digging into
this...

Comment 31 Jim Paradis 2005-09-08 20:01:58 UTC

This *looks* like Bug 160230, but it turns out to be the same issue as Bug
166785, for which a patch has been submitted.

*** This bug has been marked as a duplicate of 166785 ***

Comment 32 Jim Paradis 2005-09-09 20:45:28 UTC

For those who can't access Bug 166785, here's a summary:  We discovered a bug in
the computation of the address-hash function with which the kernel access the
memnodemap[] table (this is a table that maps address ranges to NUMA nodes). 
This bug is benign if the highest physical memory address in the system is less
than 4G.  Once it exceeds this point (either due to large memory config or
memory hoisting) we overrun the table and things get very bad.  We pulled in a
couple of fixes from upstream that fix the problem (specifically new
implementations of compute_hash_shift() and pfn_valid()).

Comment 33 Karsten Weiss 2005-09-12 06:03:14 UTC

Jim, is there a new kernel we could test to make sure it really fixes our problem?
(I was on holiday the last couple of days and couldn't do the test with the
kernel-smp-devel package that you've requested yet.)

Comment 34 Jim Paradis 2005-09-12 20:54:09 UTC

There is a test kernel at:
people.redhat.com/~jparadis/numa/kernel-smp-2.6.9-18.EL.jparadis.x86_64.rpm that
you can try.  Please treat it as a test kernel only; it is *not* to be used for
any production purpose.  I have verified, however, that the fixes I have made
are slated for release.

Comment 35 Karsten Weiss 2005-09-13 06:29:25 UTC

Could you please provide a -devel package for it, too?

Comment 36 Jim Paradis 2005-09-13 19:50:09 UTC

-devel package has been uploaded to the same place.

Comment 37 Lonni J Friedman 2005-09-13 20:23:59 UTC

Jim, 
Is the -devel package that you posted the one that corresponds with the smp kernel?

thanks!

Comment 38 Jim Paradis 2005-09-13 21:06:58 UTC

Oops... sorry.  I uploaded the UP devel pkg by mistake.  I just uploaded the smp
-devel package.  Try it now.

Comment 39 Lonni J Friedman 2005-09-13 22:00:48 UTC

Thanks Jim.  Unfortunately, this new kernel is still exhibiting the same Oops
with the 1.0-7676 NVIDIA driver as with the -16 SMP kernel.  Do you have any
additional suggestions?

Comment 40 Jim Paradis 2005-09-13 22:04:45 UTC

Is this a closed-source driver?  It might need to be rebuilt to take advantage
of the fixed macro in mmzone.h...

Comment 41 Lonni J Friedman 2005-09-13 22:09:22 UTC

Yes, this is the closed source driver.  

By 'rebuilt' do you mean reinstalled so that its kernel module is rebuilt, or do
you mean rebuilt by NVIDIA developers?

I've reinstalled the driver, so the nvidia.ko kernel module is current with
respect to the new kernel.

Thanks.

Comment 42 Jim Paradis 2005-09-13 22:14:52 UTC

I mean it must be recompiled by nVidia, or we need to come up with another
solution...

Comment 43 Lonni J Friedman 2005-09-13 22:42:48 UTC

The nvidia driver doesn't include any Linux kernel headers, so shouldn't any
changes to mmzone.h be picked up automatically?

Comment 44 Jim Paradis 2005-09-13 22:54:03 UTC

No, a driver picks up the Linux kernel headers from the system it was *built*
on.  Changing the headers on the runtime system does nothing.

Comment 45 Lonni J Friedman 2005-09-13 23:09:05 UTC

The nvidia kernel module (nvidia.ko) is built on the system when it is
installed.   The driver package doesn't ship with a pre-compiled nvidia.ko for
every kernel in existence. 

Can you elaborate on what was changed/fixed in mmzone.h to address the this bug?

Comment 46 Lonni J Friedman 2005-09-18 19:36:55 UTC

Jim,
This is a known kernel bug that was fixed many months ago:
http://linux.bkbits.net:8080/linux-2.6/diffs/arch/x86_64/mm/ioremap.c@1.23?nav=index.html|src/|src/arch|src/arch/x86_64|src/arch/x86_64/mm|hist/arch/x86_64/mm/ioremap.c

Can you please merge this change into the kernel?

thanks,
Lonni

Comment 47 Jim Paradis 2005-09-19 17:52:10 UTC

Lonni,

That patch makes sense, but we need to know: does that patch fix your problem?

Comment 48 Lonni J Friedman 2005-09-19 23:01:48 UTC

Hi Jim,
We just completed testing, and the short answer is yes, it resolves the problem.

Details:
applied the patch to the 2.6.9-17.EL kernel and rebuilt it, but saw another
crash in remap_page_range(); this crash also reproduced without the nvidia
driver (i.e. with just 'nv') with RedHat's 2.6.9-17.EL build, apparently trying
to map VGA registers via /dev/mem.  checked 2.6.9-18.EL.jparadis and found that
it doesn't crash in remap_page_range().  diff'd 2.6.9-18.EL.jparadis and
2.6.9-17.EL's asm-x86_64/mmzone.h (this is the file you stated held the fix for
this bugzilla bug) and rebuilt 2.6.9-17.EL again with a NUMA related change to
pfn_valid() included. With the two patches applied, X comes up fine. 

I'll attach the two patches we generated.

Comment 49 Lonni J Friedman 2005-09-19 23:02:38 UTC

Created attachment 119005 [details]
linux-2.6.9-x86_64-ioremap

Comment 50 Lonni J Friedman 2005-09-19 23:03:40 UTC

Created attachment 119006 [details]
linux-2.6.9-x86_64-pfn_valid patch

Comment 51 Lonni J Friedman 2005-09-21 16:25:22 UTC

Jim,
Do you need anything else from me to integrate the patches I attached?  Is there
a kernel RPM available that already has them integrated that I could test?

thanks,
Lonni

Comment 52 Karsten Weiss 2005-09-22 06:25:04 UTC

I would like to test a new kernel with those fixes, too.

Comment 53 Karsten Weiss 2005-09-22 06:28:18 UTC

BTW: Jim, do you have any comment regarding the Cache Aliasing Issue? See
comment #26 above.

Comment 55 Lonni J Friedman 2005-09-28 23:17:14 UTC

Jim,
Do you need anything else from me to integrate the patches I attached?  Is there
a kernel RPM available that already has them integrated that I could test?

thanks,
Lonni

Comment 56 Brett Morrow 2005-09-29 21:52:28 UTC

I have tried the .22 kernel and it fixed the other problems I have had on ADM
systems. Any word on a test kernel with these patches to fix the NVIDIA problems?
Anyone tried adding them to the .22 release?

Comment 57 Brett Morrow 2005-09-30 00:43:03 UTC

Thank you for the patches.  I got the .22 kernel source and applied the patches
through the SPEC file.  Built the new kernel and now the NVIDIA drivers work.

Comment 61 Brett Morrow 2005-10-06 02:27:42 UTC

Anyone know if the new kernel 2.6.9-22 released in WS4 Update 2 has the fixes in
it?  I hope to test soon.

Comment 62 Lonni J Friedman 2005-10-06 03:05:21 UTC

2.6.9-22 does not fix the bug that impacts the NVIDIA driver.  I've been told,
unofficially, that Redhat will not be fixing this bug prior to the RHEL4-U2
final release.

Comment 64 Brett Morrow 2005-10-07 14:39:19 UTC

Well, just tried the RHEL4-U2 kernel.  (did a fresh install) and the bug is not
fixed. Here is the output:


NVRM: loading NVIDIA Linux x86_64 NVIDIA Kernel Module  1.0-8163  Wed Sep 21 12:
54:25 PDT 2005
Unable to handle kernel paging request at 00000000000018f0 RIP:
<ffffffff80123c18>{iounmap+304}
PML4 2a090067 PGD 2b1ca067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: nvidia(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_
core sunrpc ds yenta_socket pcmcia_core dm_mirror dm_multipath dm_mod joydev but
ton battery ac ohci_hcd ehci_hcd shpchp snd_emu10k1 snd_rawmidi snd_pcm_oss snd_
mixer_oss snd_pcm snd_timer snd_seq_device snd_ac97_codec snd_page_alloc snd_uti
l_mem snd_hwdep snd soundcore forcedeth floppy ext3 jbd sata_nv libata sd_mod sc
si_mod
Pid: 4506, comm: X Tainted: P      2.6.9-22.ELsmp
RIP: 0010:[<ffffffff80123c18>] <ffffffff80123c18>{iounmap+304}
RSP: 0018:000001011f4afbf8  EFLAGS: 00010213
RAX: 00000100e0000000 RBX: 000001007feeea40 RCX: 0000000000000019
RDX: ffffffff7fffffff RSI: 0000000000002000 RDI: 00000000e0000000
RBP: ffffff000053c000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000000000 R14: 0000000000000000 R15: 000001012812d680
FS:  0000002a95586920(0000) GS:ffffffff804d3100(0000) knlGS:00000000f7fcf6c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000018f0 CR3: 0000000000101000 CR4: 00000000000006e0
Process X (pid: 4506, threadinfo 000001011f4ae000, task 0000010037c45030)
Stack: ffffffff803cbba8 ffffff000053c000 0000000000000000 ffffffffa048e8c7
       0000000000000080 ffffffffa026affa 000001017fc20400 00000000005e10de
       0000000000000000 ffffffffa026529a
Call Trace:<ffffffffa048e8c7>{:nvidia:os_unmap_kernel_space+9}
       <ffffffffa026affa>{:nvidia:_nv002222rm+42} <ffffffffa026529a>{:nvidia:_nv
002554rm+208}
       <ffffffffa0264687>{:nvidia:_nv002563rm+255} <ffffffffa026485c>{:nvidia:_n
v002520rm+100}
       <ffffffffa0264944>{:nvidia:_nv002564rm+64} <ffffffffa0257e9d>{:nvidia:_nv
001643rm+351}
       <ffffffffa026d9af>{:nvidia:_nv002285rm+45} <ffffffffa026e4e0>{:nvidia:_nv
001653rm+368}
       <ffffffff80113555>{setup_irq+194} <ffffffffa0272492>{:nvidia:rm_init_adap
ter+104}
       <ffffffffa048869b>{:nvidia:nv_kern_open+697} <ffffffff8017eed0>{chrdev_op
en+412}
       <ffffffff80176434>{dentry_open+223} <ffffffff8017656f>{filp_open+62}
       <ffffffff801ea045>{strncpy_from_user+74} <ffffffff80176661>{get_unused_fd
+230}
       <ffffffff80176750>{sys_open+57} <ffffffff80110052>{system_call+126}


Code: 49 8b 88 f0 18 00 00 76 1b 48 b8 00 00 00 80 00 01 00 00 48
RIP <ffffffff80123c18>{iounmap+304} RSP <000001011f4afbf8>
CR2: 00000000000018f0
 <0>Kernel panic - not syncing: Oops

Comment 66 Edgar Villanueva 2005-10-11 02:03:51 UTC

I've got an H8DCE motherboard with 1xOpteron Dual Core and 2x2GB Dimms.  I get
intermittent kernel panics with the configuration. Dual Core Opteron. 1
processor on the box.
If I remove 1 of the 2GB sticks the machine is stable for weeks. Never left it
up longer than that.

I'd like to put the kernel dumps on the list to help troubleshoot the core issue
should I log the core here or is there another bug ID that would be better.

Also what is the best way to grab the kernel panic?  Is it stored somewhere?

Comment 67 Karsten Weiss 2005-10-12 09:08:20 UTC

Brett reports the problem persists with 2.6.9-22 (which doesn't have the
available patches). Can we please get this fixed?

Comment 68 Lonni J Friedman 2005-10-12 14:16:21 UTC

The attached patches should apply on 2.6.9-22 as well.  Do they not work for you?

Comment 70 Jason Baron 2005-10-12 17:16:42 UTC

this should be fixed in -22.3.EL, see: http://people.redhat.com/~jbaron/rhel4/

Comment 71 Brett Morrow 2005-10-12 21:42:17 UTC

Just got the chance to install the new kernel.  Rebuilt and loaded the NVIDIA
drivers and "IT WORKS!!! :)"

Are we going to see this in a release soon???? Please :)

Comment 73 Karsten Weiss 2005-10-17 09:36:27 UTC

I finally was able to test the latest beta kernel 2.6.9-22.3.ELsmp on our 64-bit
hp xw9300 RHEL4-Update2 system, too. Here are the test results:

1. I was able to boot the kernel without acpi=off and numa=off.

2. It was possible to load the nvidia 1.0-7676 driver without the oops.
   However, I've the immediately got the following kernel messages:

NVRM: loading NVIDIA Linux x86_64 NVIDIA Kernel Module  1.0-7676  Fri Jul 29 13:
NVRM: WARNING: Your Linux kernel has problems in its implementation of
NVRM: the change_page_attr kernel interface.  The NVIDIA kernel
NVRM: module will attempt to work around these problems, but
NVRM: system stability may be affected.  It is recommended that
NVRM: you update to a 2.6.11 or newer kernel.

NVRM: bad caching on address 0x10074b08000: actual 0x163 != expected 0x173
NVRM: please see the README section on Cache Aliasing for more information
NVRM: bad caching on address 0x10074a0b000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x10073d09000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x10074258000: actual 0x163 != expected 0x173
NVRM: bad caching on address 0x10074259000: actual 0x163 != expected 0x173

3. Then I tried to reload the nvidia kernel module with the option
NVreg_UseCPA=1 which forces the module to use the kernel's change_page_attr()
api. When I use this module option the nvidia module loads without those
warnings. We are using the machine with this setup now and we'll see how
stable it runs.

Does anybody know if the two nvidia driver warnings regarding change_page_attr()
and cache aliasing are false warnings with the 2.6.9-22.3 kernel+1.0-7676 driver
or if these kernel issues still persist?

Comment 74 Lonni J Friedman 2005-10-17 14:16:52 UTC

The bad caching warnings occur automatically with 1.0-7xxx nvidia drivers and
Redhat kernels.

Comment 75 Jason Baron 2005-10-19 17:36:51 UTC

As far as i know, all the kernel issues with change_page_attr are resovled in
2.6.9-22.3, thus as long as the nvidia 7676 is given the NVreg_UseCPA=1
everthing should work fine. my understanding is that the nvidia 8163+ drivers
will automatically detect the change_page_attr changes and thus shouldn't
require the command line options.

Comment 78 Karsten Weiss 2005-10-24 12:54:40 UTC

So when can we expect an official kernel errata rpm >= 2.6.9-22.3?

Side note: *Please* make sure that the next official kernel errata rpm addresses
all known kernel bugs mentioned in the very nice summary from NVIDIA which can
be found at http://www.nvnews.net/vbulletin/showthread.php?t=58498.

Comment 79 Red Hat Bugzilla 2005-10-27 15:06:56 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-808.html

Comment 80 Brian Bilbrey 2005-12-30 17:07:25 UTC

(In reply to comment #70)
> this should be fixed in -22.3.EL, see: http://people.redhat.com/~jbaron/rhel4/

Please re-open this bug. Configuration and circumstances follow...

Intel Server Motherboard SE7520JR2
  Dual Xeon 3.4GHz
  4 x 1GB RAM

RHEL4 ES Update 2 stock kernel-smp-2.6.9-22.EL 
   boot fails unless memory remap disabled in BIOS.

Recommended fix: Errata kernel-smp-2.6.9-22.0.1.EL
   boot fails unless memory remap disabled in BIOS.

~jbaron kernel-smp-2.6.9-24.1.EL of 12-Dec-2005 16:40
   boot fine with memory remap enabled, full 4G of RAM useable.