From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Description of problem:
This bug is fixed in latest 2.4 and 2.6 bktrees.
We might get into page fault handler even if the region 5 address is
valid, due to the VHPT walker inserting a non present translation
that becomes stale. And as page fault handler in EL3 doesn't handle
not-present translations for region 5, it will oops.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Kernel will fail to boot if there is lot of interrupt activity
handled by the modules(vmalloc'd text)
Created attachment 103049 [details]
Patch handling not-present faults for region 5
Patch is straight from bkbits.
Suresh, pardon my ignorance here but how does this happen? If the
kernel only performs atomic updates to the ptes(never clears one bit
at a time leaving the pte in an inconsistant state) how does the
VHPTwalker insert a TLBentry thats half-baked? If there are cases
that the pte is in some inconsistant/interm state, should we fix that
Here is the failing sequence
t0: On cpu1, while the kernel is servicing requests from driver
module A, hardware VHPT walker inserts the empty pte's(page not
present entries) around the module code address 'A' into the TLB's
t1: On cpu0, as part of loading new module 'B', vmalloc_area_pages()
sets up the pte's for module 'B' in swapper_pg_dir without doing
flush_tlb_all() (This is OK because we do flush_tlb_all() in
vmfree_area_pages()). But this module 'B' address happens to be same
as the empty pte's(page not present entries) that got loaded onto
cpu1 tlbs in step 't0' above.
t2: When the module 'B' code starts executing on cpu1, because of
page not present entries in cpu1's TLB it gets a page_not_present
fault. And as the page_fault handler doesn't handle faults in
region '5' it simply oops.
As page_not_present handler purges the corresponding not present TLB
entry, next page rewalk will succeed.
Either my patch:
or Norm Murray's patch:
will address this issue. Norm's was generated from an LLNL IT, but
is identical except for the addition of a KERN_CRIT to the beginning
of a printk() in do_page_fault().
I can't access the above mentioned post-office URL. Please let me
know if you need any more info or if you think patch posted in
comment #1 isn't enough
Hi, Suresh. The URLs in comment #5 are restricted to Red Hat.
A minor variation of your patch (due to a RHEL3 porting issue)
is on track for U4. I'll update this bug report when the patch
is committed (in the next day or two).
Thanks for isolating the problem and providing the patch.
A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.6.EL).
An errata has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.