Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 3 product line. The current stable release is 3.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 156023

Summary:

Memory corruption

Product:

Red Hat Enterprise Linux 3

Reporter:

Greg Marsden <greg.marsden>

Component:

kernel

Assignee:

Larry Woodman <lwoodman>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.0

CC:

anderson, aviro, bill.irwin, peterm, petrides, rkenna, tao, wim.coekaerts

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-05-25 16:42:38 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

156321

Attachments:

Description	Flags
Patch to prevent using pte_clear when the valid bit is set on an x86 in PAE mode.	none

Description Greg Marsden 2005-04-26 18:17:49 UTC

Comment 1 Larry Woodman 2005-04-26 18:21:41 UTC

...network console shutdown...]
[...network console startup...]
Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c01545b1
*pde = 129f6001
*pte = 00000000
Oops: 0002
netconsole soundcore ide-cd cdrom cpqci usbserial lp parport autofs4 nfs lockd
sunrpc e1000 tg3 floppy sg microcode keybdev mousedev hid input usb-ohci usbcor
CPU:    0
EIP:    0060:[<c01545b1>]    Tainted: P  
EFLAGS: 00010246

EIP is at launder_page [kernel] 0x81 (2.4.21-27.0.1.ELsmp/i686)
eax: c03a8240   ebx: c2ef0fe0   ecx: c03a7080   edx: 00000000
esi: c2ef0ffc   edi: 00000001   ebp: c03a7080   esp: c8e73f64
ds: 0068   es: 0068   ss: 0068
Process kswapd (pid: 11, stackpage=c8e73000)
Stack: 00000001 0000001e c2ef0ff4 00000000 000668de c03a7080 00000001 00000040  
      c015657b c03a7080 000001d0 c2ef0fe0 c03a8240 00000004 c03a7080 0001aad8  
      00000001 00000040 c0156b7b c03a7080 00000100 000001d0 0001b86b 00000000  
Call Trace:   [<c015657b>] rebalance_dirty_zone [kernel] 0xab (0xc8e73f84)
[<c0156b7b>] do_try_to_free_pages_kswapd [kernel] 0x1eb (0xc8e73fac)
[<c0156ca8>] kswapd [kernel] 0x68 (0xc8e73fd0)
[<c0156c40>] kswapd [kernel] 0x0 (0xc8e73fe4)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xc8e73ff0)

Code: 89 02 c7 46 04 00 00 00 00 c7 43 1c 00 00 00 00 f0 0f ba 73

CPU#0 is executing netdump.
CPU#1 is frozen.
CPU#2 is frozen.
CPU#3 is frozen.
< netdump activated - performing handshake with the client. >

Pid/TGid: 11/11, comm:               kswapd
EIP: 0060:[<c01545b1>] CPU: 0
EIP is at launder_page [kernel] 0x81 (2.4.21-27.0.1.ELsmp)
ESP: 001e:00000001 EFLAGS: 00010246    Tainted: P  
EAX: c03a8240 EBX: c2ef0fe0 ECX: c03a7080 EDX: 00000000
ESI: c2ef0ffc EDI: 00000001 EBP: c03a7080 DS: 0068 ES: 0068 FS: 0000 GS: 0000
CR0: 8005003b CR2: 00000000 CR3: 19db1b80 CR4: 000006f0
Call Trace:   [<c015657b>] rebalance_dirty_zone [kernel] 0xab (0xc8e73f84)
[<c0156b7b>] do_try_to_free_pages_kswapd [kernel] 0x1eb (0xc8e73fac)
[<c0156ca8>] kswapd [kernel] 0x68 (0xc8e73fd0)
[<c0156c40>] kswapd [kernel] 0x0 (0xc8e73fe4)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xc8e73ff0)

Comment 2 Greg Marsden 2005-04-26 18:29:31 UTC

------------[ cut here ]------------
kernel BUG at slab.c:1957!
invalid operand: 0000
cpqci usbserial lp parport netconsole autofs4 nfs lockd sunrpc e1000 tg3
floppy sg microcode keybdev mousedev hid input usb-ohci usbcore ext3 jbd
cciss sd_mod
CPU:    1
EIP:    0060:[<c0152765>]    Tainted: P
EFLAGS: 00010016
.
EIP is at kmem_cache_reap [kernel] 0x335 (2.4.21-27.0.1.ELsmp/i686)
eax: 0047004e   ebx: 00000000   ecx: 0000001e   edx: c8e0f450
esi: 0000837c   edi: c8e0f450   ebp: c8e0f450   esp: f2d5df10
ds: 0068   es: 0068   ss: 0068
Process bgscollect (pid: 29941, stackpage=f2d5d000)
Stack: ecc4fa00 c02bd1ed 000003f0 000000fc 00000000 00000080 00000000
00000000
       00000001 c0000000 00000462 00000246 f17bdc6c 00000000 00000000
00000000
       ecc4fa00 c8e0f450 000009ea c0185213 ecc4fa00 c8e0f450 f2d5df74
ecc4fa18
Call Trace:   [<c0185213>] seq_read [kernel] 0x173 (0xf2d5df5c)
[<c0164087>] sys_read [kernel] 0x97 (0xf2d5df94)
.
Code: 0f 0b a5 07 b7 d1 2b c0 ff 44 24 28 01 ce 8b 00 39 e8 75 e7
.
CPU#0 is frozen.
CPU#1 is executing netdump.
CPU#2 is frozen.
CPU#3 is frozen.
< netdump activated - performing handshake with the client. >
.
**********
mtrr: no more MTRRs available
mtrr: no more MTRRs available
------------[ cut here ]------------
kernel BUG at vmscan.c:795!
invalid operand: 0000
cpqci usbserial lp parport netconsole autofs4 nfs lockd sunrpc e1000 tg3
floppy sg microcode keybdev mousedev hid input usb-ohci usbcore ext3 jbd
cciss sd_mod
CPU:    0
EIP:    0060:[<c01564b0>]    Tainted: P
EFLAGS: 00010246
.
EIP is at rebalance_laundry_zone [kernel] 0x960 (2.4.21-27.0.1.ELsmp/i686)
eax: 00000001   ebx: c4dbf12c   ecx: c03a8250   edx: c5291bf0
esi: c4dbf110   edi: 00000010   ebp: c03a7080   esp: c8e73f80
ds: 0068   es: 0068   ss: 0068
Process kswapd (pid: 11, stackpage=c8e73000)
Stack: 00000000 00000001 00000000 c03a8248 00000000 00000000 0000002f
c03a7080
       0001ac3b 00000000 00000040 c0156b94 c03a7080 00000040 00000000
00000000
       00000000 00004160 00000001 00000000 c0156ca8 000001d0 00000002
000001d0
Call Trace:   [<c0156b94>] do_try_to_free_pages_kswapd [kernel] 0x204
(0xc8e73fac)
[<c0156ca8>] kswapd [kernel] 0x68 (0xc8e73fd0)
[<c0156c40>] kswapd [kernel] 0x0 (0xc8e73fe4)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xc8e73ff0)
.
Code: 0f 0b 1b 03 10 d2 2b c0 e9 2a f7 ff ff b8 04 00 00 00 e8 49
.
CPU#0 is executing netdump.
CPU#1 is frozen.
CPU#2 is frozen.
CPU#3 is frozen.
< netdump activated - performing handshake with the client. >
.
***********
.
[...network console startup...]
mtrr: no more MTRRs available
mtrr: no more MTRRs available
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c0155a9c
*pde = 3389e001
*pte = 00000000
Oops: 0000
cpqci usbserial lp parport netconsole autofs4 nfs lockd sunrpc e1000 tg3
floppy sg microcode keybdev mousedev hid input usb-ohci usbcore ext3 jbd
cciss sd_mod
CPU:    3
EIP:    0060:[<c0155a9c>]    Tainted: P
EFLAGS: 00010247
.
EIP is at scan_active_list [kernel] 0xcc (2.4.21-27.0.1.ELsmp/i686)
eax: 00000011   ebx: c39ff030   ecx: c39ff030   edx: 00000000
esi: 00000000   edi: 00000003   ebp: c03a8158   esp: c8e71fa8
ds: 0068   es: 0068   ss: 0068
Process kscand (pid: 12, stackpage=c8e71000)
Stack: c2740f98 00000003 00000000 00000000 c03a7080 c03a8158 00000003
c8e70000
       c0157240 c03a7080 00000003 c03a8158 c8e70000 00000000 c01571a0
00000000
       00000000 00000000 c01095ad 00000000 00000000 00000000
Call Trace:   [<c0157240>] kscand [kernel] 0xa0 (0xc8e71fc8)
[<c01571a0>] kscand [kernel] 0x0 (0xc8e71fe0)
[<c01095ad>] kernel_thread_helper [kernel] 0x5 (0xc8e71ff0)
.
Code: 8b 36 39 ea 0f 85 7d ff ff ff 8b 44 24 24 85 c0 74 2e b0 01

Comment 3 Greg Marsden 2005-04-26 18:30:30 UTC

*** WIRWIN  04/13/05 03:25 pm ***
If this isn't bad RAM, there's a good chance it's a use-after-free with some
scribbling past the end of the allocated buffer. kmem_cache_reap() runs in
the background, so it's stumbled upon slab state corruption dissociated in
time from the use of slab entrypoints. Polling /proc/slabinfo has at most
some effect on timing as it performs read-only traversals with some locking.

Comment 4 William Lee Irwin III 2005-04-26 22:17:07 UTC

A strong potential for use-after-free exists in prune_icache() vs.
invalidate_inodes(), however, it's not yet clear how flags usage (e.g. I_LOCK)
was intended to prevent this or how to show that it fails to do so. It's not a
terribly promising lead as this general idiom is unchanged for several major
releases.

Comment 5 Dave Anderson 2005-04-27 14:21:00 UTC

Analysis of this particular vmcore shows that the dentry_hashtable[]
array has been corrupted:

crash> sys
      KERNEL: /usr/tmp/vmlinux-2.4.21-31.EL.VMTESTsmp
   DEBUGINFO: /usr/tmp/vmlinux-2.4.21-31.EL.VMTESTsmp.debug
    DUMPFILE: vmcore-4-26-05
        CPUS: 4
        DATE: Tue Apr 26 01:50:54 2005
      UPTIME: 3 days, 09:02:06
LOAD AVERAGE: 11.88, 4.67, 3.39
       TASKS: 790
    NODENAME: spplap04
     RELEASE: 2.4.21-31.EL.VMTESTsmp
     VERSION: #1 SMP Fri Apr 22 14:07:41 EDT 2005
     MACHINE: i686  (3049 Mhz)
      MEMORY: 5 GB
       PANIC: "Oops: 0002" (check log for details)
crash> bt
PID: 13164  TASK: dc2ee000  CPU: 3   COMMAND: "perl"
 #0 [dc2efe04] netconsole_netdump at f8c1f703
 #1 [dc2efe18] try_crashdump at c0128ed3
 #2 [dc2efe28] die at c010c682
 #3 [dc2efe3c] do_page_fault at c0120229
 #4 [dc2eff00] error_code (via page_fault) at c03f41c0
    EAX: 00000000  EBX: d9b0b400  ECX: c6b3d0a0  EDX: d9b0b410  EBP: f3d39400
    DS:  0068      ESI: ee601f00  ES:  0068      EDI: d9b0b400
    CS:  0060      EIP: c017ebac  ERR: ffffffff  EFLAGS: 00010246
 #5 [dc2eff3c] d_rehash at c017ebac
 #6 [dc2eff44] do_pipe at c017255f
 #7 [dc2effa0] sys_pipe at c0113612
 #8 [dc2effc0] system_call at c03f4068
    EAX: 0000002a  EBX: bfffec60  ECX: 00000000  EDX: bfffec60
    DS:  002b      ESI: 00000000  ES:  002b      EDI: 08154bc4
    SS:  002b      ESP: bfffec1c  EBP: bfffec68
    CS:  0023      EIP: b7536b2d  ERR: 0000002a  EFLAGS: 00000246
crash>

The actual dentry_hashtable[] list_head that's being bumped into
by d_rehash() is at c6b3d0a0.  The corruption starts at c6b3d000,
and extends to c6b3d100, although there seems to be proper list_head
values intermingled near the end of the corruption:

c6b3cfc0:  c6b3cfc0 c6b3cfc0 c6b3cfc8 c6b3cfc8   ................
c6b3cfd0:  c6b3cfd0 c6b3cfd0 c6b3cfd8 c6b3cfd8   ................
c6b3cfe0:  c6b3cfe0 c6b3cfe0 c6b3cfe8 c6b3cfe8   ................
c6b3cff0:  e3f9ed10 e3f9ed10 c6b3cff8 c6b3cff8   ................
c6b3d000:  00000001 afac31f8 92c19fd0 00000000   .....1..........
c6b3d010:  00000000 00000000 00000000 00000000   ................
c6b3d020:  00000000 00000000 00000000 00000000   ................
c6b3d030:  00000000 00000000 00000000 00000000   ................
c6b3d040:  00000000 00000000 00000000 00000000   ................
c6b3d050:  00000000 8ed260f0 8ed26100 00000000   .....`...a......
c6b3d060:  00000000 00000000 00000000 00000000   ................
c6b3d070:  00000000 00000000 00000000 00000000   ................
c6b3d080:  00000000 00000000 00000000 00000000   ................
c6b3d090:  00000000 00000000 00000000 00000000   ................
c6b3d0a0:  00000000 00000000 f6290f10 f6290f10   ..........)...).
c6b3d0b0:  f35d2a90 e5346a90 c6b3d0b8 c6b3d0b8   .*]..j4.........
c6b3d0c0:  f4af4a10 daa14c10 c6b3d0c8 c6b3d0c8   .J...L..........
c6b3d0d0:  c6b3d0d0 c6b3d0d0 c6b3d0d8 c6b3d0d8   ................
c6b3d0e0:  e1365f10 e1365f10 c6b3d0e8 c6b3d0e8   ._6.._6.........
c6b3d0f0:  00000001 aed255e8 00000001 af6f64a0   .....U.......do.
c6b3d100:  00000001 aed25370 00000001 8ed26000   ....pS.......`..
c6b3d110:  ec9ff390 cf7e0890 c6b3d118 c6b3d118   ......~.........
c6b3d120:  c6b3d120 c6b3d120 c6b3d128 c6b3d128    ... ...(...(...
c6b3d130:  f14dc210 f14dc210 c6b3d138 c6b3d138   ..M...M.8...8...
c6b3d140:  c6b3d140 c6b3d140 c6b3d148 c6b3d148   @...@...H...H...

We've seen this in other cases, where corruption occurs always starting
at a page address, for the first few hundred bytes of the page.  The 
corruption typically has a "1" in the first word, and the second word
usually contains an address (like afac31f8 above) that appears to be
a user virtual address that is *only* found in the virtual address
space of java processes.  The associated physical address is usually
shared among several java processes, with there typically being a handfull
of those shared physical addresses being mapped to many java instances
of that same virtual address.  Usually, but not in this case, the data
following the "1" and the java virtual address consists of what looks
like java ASCII/unicode, with two bytes per ASCII character.

This case shows the corruption in the middle of the dentry_hashtable[]
array, allocated via alloc_bootmem(), we've also seen the same type of
corruption the dentry_hashtable[], in the page_hash_table[] array, 
in the mem_map[] array, and most commonly, an inode or dentry slab page.
It always occurs at the beginning of a page.  And the page is not mapped
(mistakenly) into any user virtual address space.

Comment 8 Larry Woodman 2005-05-03 13:26:48 UTC

Currently we are still investigating the cause of these memory corruption
issues.  The latest thought is that there is a race between inode freeing and
reuse within the inode cache.  We are running two patches in a test kernel at
trhe customer site that has been experiencing these crashes, they are:

At this point I *think* what might be happening is the automount daemon running
at the sametime prune_icache(or any other caller of __refile_inode) and hitting
a race between dispose_list() and __refile_inode.  In the example below, CPU0 is
in prune_icache() called by kswapd and CPU1 is in invalidate_inodes() called by
the auto-mount daemon.

1.) CPU0: prune_icache() sets the I_LOCK bit in an inode on the
inode_unused_pagecache list, releases the inode_lock and calls
invalidate_inode_pages.

2.) CPU1: invalidate_inodes() calls invalidate_list() for the
inode_unused_pagecache list with the node_lock held and sets the I_FREEING bit
in the inode->i_state.

3.) CPU0: prune_icache() acquires the inode_lock and clears the I_LOCK bit in
the inode->i_state.

4.) CPU1: dispose_list() calls clear_inode() without the inode_lock held.  Since
the I_LOCK bit is clear, clear_inode() sets inode->i_state = I_CLEAR, clearing
the I_FREEING bit.

5.) CPU0: prune_icache() calls __refile_inode() because clear_inode() cleared
I_FREEING without holding the inode_lock.  This inode that is no longer on the
inode_unused_pagecache list which results in that inode being placed on the
inode_unused list.

6.) CPU1: dispose_list() calls destroy_inode() which kmem_cache_free()s an inode
that is also on the inode_unused list.


At this point there is an inode that has been kmem_cache_free()'d and is also on
the inode_unused list.

This patch to clear_inode() acquires the inode_lock before manipulating the
inode->i_state field.  This is the only place in the kernel that manipulates the
inode without holding the inode_lock.
----------------------------------------------------------------------
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -636,7 +636,9 @@ void clear_inode(struct inode *inode)
 		cdput(inode->i_cdev);
 		inode->i_cdev = NULL;
 	}
+	spin_lock(&inode_lock);
 	inode->i_state = I_CLEAR;
+	spin_unlock(&inode_lock);
 }
 
 /*
---------------------------------------------------------------------


We are seeing inode list corruption that has been chased down to a race between
iput() and __refile_inode() working on the same inode.  cpu0 is running
kupdated, writing old buffers and cpu1 is unlinking a file.

1.) cpu0 is in __sync_one() just about to call __refile_inode() after taking the
inode_lock and clearing I_LOCK.
---------------------------------------------------------
       spin_lock(&inode_lock);
       inode->i_state &= ~I_LOCK;
       if (!(inode->i_state & I_FREEING))
               __refile_inode(inode);
       wake_up(&inode->i_wait);
---------------------------------------------------------

2.) cpu1 is in iput() where it has dropped the inode_lock and calls
clear_inode().  It doesnt block because I_LOCK is clear so it sets the inode state.
------------------------------------------------------
   void clear_inode(struct inode *inode)
   {
         ...
       wait_on_inode(inode);
         ...
       inode->i_state = I_CLEAR;
         ...
   }
------------------------------------------------------

3.) cpu0 calls __refile_inode which places is on one of the four possible inode
lists
-------------------------------------------------------------------------
   static inline void __refile_inode(struct inode *inode)
   {
       if (inode->i_state & I_DIRTY)
               to = &inode->i_sb->s_dirty;
       else if (atomic_read(&inode->i_count))
               to = &inode_in_use;
       else if (inode->i_data.nrpages)
               to = &inode_unused_pagecache;
       else
               to = &inode_unused;
       list_del(&inode->i_list);
       list_add(&inode->i_list, to);
   }
---------------------------------------------------------------------

4.) cpu1 returns from clear_inode() then calls destroy_inode() which
kmem_cache_free()s it.
--------------------------------------------------------------------
   static void destroy_inode(struct inode *inode)
   {                                            
       if (inode->i_sb->s_op->destroy_inode)
               inode->i_sb->s_op->destroy_inode(inode);
       else
               kmem_cache_free(inode_cachep, inode);
   }
--------------------------------------------------------------------

5.) at this point we have an inode that has been kmem_cache_free()'d that is
also sitting one of the lists determined by __refile_inode(), that cant be
good!!!   Also, the code looks the same in RHEL4.

This patch stops this from happening but I dont know enough about inode races to
feel comfortable with it.
---------------------------------------------------------------------
--- linux-2.4.21/fs/inode.c.orig
+++ linux-2.4.21/fs/inode.c
@@ -296,7 +296,7 @@ static inline void __refile_inode(struct
 {
 	struct list_head *to;
 	
-	if (inode->i_state & I_FREEING)
+	if (inode->i_state & (I_FREEING|I_CLEAR))
 		return;
 	if (list_empty(&inode->i_hash))
 		return;
-----------------------------------------------------------------------

Comment 9 Zach Brown 2005-05-04 18:55:26 UTC

Greg asked me to take a peek at this bug.

Larry, how do i_hash and i_list operate in these scenarios?  It looks to me like
__refile_inode() won't add the inode back on any lists because whoever set
I_FREEING will also have removed i_hash.  (the paste of __refile_inode() doesn't
have the i_hash test, but the patch towards the bottom does..)  It also seems
that an inode is only returned to one caller of invalidate_list() because i_list
is removed in the process.

I'm being deliberately vague both because I'm not intimitely familiar with the
VFS locking labyrinth and because there has been some confusion in getting the
exact source that you're working against.  Can someone point me in the direction
of code that we're triply sure is involved in the bug?

Comment 10 Larry Woodman 2005-05-04 19:22:09 UTC

I'm thinking that the problem is that prune_icache(and other callers to
__refile_inode) release the inode_lock do something then re-acquire the
inode_lock assuming that the inode being operated on is still allocated(not been
kmem_cache_free()'d yet).  If prune_icache releases the inode_lock does some
stuff then re-acquires the inode_lock and while it doesnt hold the lock, some
other cpu invalidates the inode(via autofs timeout) prune_icache might refile an
inode that has been freed!

Larry

Comment 11 Larry Woodman 2005-05-06 20:06:47 UTC

Greg, after looking at the changes made to that kernel the problem appears to be
pte_clear() on the x86 with PAE.  It is now called from establish_pte as of
RHEL3-U4.

-------------------------------------------------------------------------------
#define pte_clear(xp)   do { set_pte(xp, __pte(0)); } while (0)

static inline void set_pte(pte_t *ptep, pte_t pte)
{
       ptep->pte_high = pte.pte_high;
       smp_wmb();
       ptep->pte_low = pte.pte_low;
}
-------------------------------------------------------------------------------

When clearing the pte, if we clear the high 4 bytes then execute a memory
barrier before clearing the low 4 bytes we end up with a transient pte with
a zeroed out upper half and a random lower half with valid bit set!!!

We want to do this in exactly reverse order when clearing a PAE pte.

This patch fixes this problem, however it still needs testing:
-----------------------------------------------------------------------------
--- linux-2.4.21/include/asm-i386/pgtable-2level.h.orig
+++ linux-2.4.21/include/asm-i386/pgtable-2level.h
@@ -40,6 +40,7 @@ static inline int pgd_present(pgd_t pgd)
  * hook is made available.
  */
 #define set_pte(pteptr, pteval) (*(pteptr) = pteval)
+#define set_pte_zero(pteptr, pteval) (*(pteptr) = pteval)
 #define set_pte_atomic(pteptr, pteval) (*(pteptr) = pteval)
 
 /*
--- linux-2.4.21/include/asm-i386/pgtable-3level.h.orig
+++ linux-2.4.21/include/asm-i386/pgtable-3level.h
@@ -51,6 +51,17 @@ static inline void set_pte(pte_t *ptep, 
 	smp_wmb();
 	ptep->pte_low = pte.pte_low;
 }
+
+/* Called by pte_clear, we need to clear the low half first so
+ * the present bit gets cleared before clearing the upper half.
+ */
+static inline void set_pte_zero(pte_t *ptep, pte_t pte)
+{
+	ptep->pte_low = pte.pte_low;
+	smp_wmb();
+	ptep->pte_high = pte.pte_high;
+}
+
 #define set_pmd(pmdptr,pmdval) \
 		set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
 #define set_pgd(pgdptr,pgdval) \
--- linux-2.4.21/include/asm-i386/pgtable.h.orig
+++ linux-2.4.21/include/asm-i386/pgtable.h
@@ -325,7 +325,7 @@ extern unsigned long pg0[1024];
 
 #define pte_present(x)	((x).pte_low & (_PAGE_PRESENT | _PAGE_PROTNONE))
 #define pte_user(x)	((x).pte_low & _PAGE_USER)
-#define pte_clear(xp)	do { set_pte(xp, __pte(0)); } while (0)
+#define pte_clear(xp)	do { set_pte_zero(xp, __pte(0)); } while (0)
 
 #define pmd_none(x)	(!pmd_val(x))
 #define pmd_present(x)	(pmd_val(x) & _PAGE_PRESENT)

Comment 12 William Lee Irwin III 2005-05-07 03:11:55 UTC

This was actually found a while ago by other persons and it's unclear how and
why the fix failed to get propagated. Could you please send this on to 2.4.x and
2.6.x mainline ASAP? This could resolve a number of pending bugreports against
those trees.

Comment 13 Larry Woodman 2005-05-07 11:08:48 UTC

Sure, as soon as the official patch is complete and tested.  We have a few
dissagreements as to the exact implememtation that are being worked out as we
speek.  I'll post it as soon as we all agree.

Thanks Bill.


Larry

Comment 14 Larry Woodman 2005-05-11 13:47:03 UTC

Created attachment 114246 [details]
Patch to prevent using pte_clear when the valid bit is set on an x86 in PAE mode.


The final version of this patch changes unmap_hugepage_range() and
establish_pte() to use ptep_get_and_clear() instead of pte_clear().  In these 2
paths the pte is valid and pte_clear() is not safe while ptep_get_and_clear()
is safe.

If Oracle is experiencing this problem and can wait until the official RHEL3
kernel is available, please use this patch.


Larry Woodman

Comment 15 Ernie Petrides 2005-05-14 05:19:13 UTC

A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.4.EL).

Comment 18 Ernie Petrides 2005-05-17 22:14:39 UTC

A fix for this problem has also been committed to the RHEL3 E6
patch pool this evening (in kernel version 2.4.21-32.0.1.EL).

Comment 24 Josh Bressers 2005-05-25 16:42:38 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-472.html

Comment 33 Larry Woodman 2005-07-07 18:27:55 UTC

*** Bug 161986 has been marked as a duplicate of this bug. ***

Comment 34 Ernie Petrides 2005-07-13 23:33:47 UTC

For the record, bug 161986 is not a dup of this one.

Comment 37 Ernie Petrides 2005-08-10 21:38:44 UTC

Who sent this HOTFIX request?  Please get your facts straight.  This bug
has been fixed in a released RHEL3 kernel.  See comment #24.

Comment 38 Johnray Fuller 2005-08-10 21:44:51 UTC

 Ernie,

Apologies for the confusion...

There was a HOTFIX issued for a bonding dirver multicast issue (2.4.21-32.2)
that did *not* contain the fix in 32.0.1 (also in 2.4.21-32.4).

The customer using the 2.4.21-32.2 experienced the PTE issue, so we issued the
HOTFIX which included *both* the bonding fix and the PTE fix (2.4.21-34).

Thanks,
J

Comment 40 Johnray Fuller 2005-08-10 22:26:06 UTC

Not a problem. It was a confusing situation.

J