From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Description of problem: This bug is fixed in latest 2.4 and 2.6 bktrees. We might get into page fault handler even if the region 5 address is valid, due to the VHPT walker inserting a non present translation that becomes stale. And as page fault handler in EL3 doesn't handle not-present translations for region 5, it will oops. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1.Kernel will fail to boot if there is lot of interrupt activity handled by the modules(vmalloc'd text) 2. 3. Additional info:
Created attachment 103049 [details] Patch handling not-present faults for region 5 Patch is straight from bkbits. http://linux.bkbits.net:8080/linux-2.4/gnupatch@3ec5621bdgHJtDWJBf1fOJP3ZZA8hA
Suresh, pardon my ignorance here but how does this happen? If the kernel only performs atomic updates to the ptes(never clears one bit at a time leaving the pte in an inconsistant state) how does the VHPTwalker insert a TLBentry thats half-baked? If there are cases that the pte is in some inconsistant/interm state, should we fix that instead? Thanks, Larry
Here is the failing sequence t0: On cpu1, while the kernel is servicing requests from driver module A, hardware VHPT walker inserts the empty pte's(page not present entries) around the module code address 'A' into the TLB's t1: On cpu0, as part of loading new module 'B', vmalloc_area_pages() sets up the pte's for module 'B' in swapper_pg_dir without doing flush_tlb_all() (This is OK because we do flush_tlb_all() in vmfree_area_pages()). But this module 'B' address happens to be same as the empty pte's(page not present entries) that got loaded onto cpu1 tlbs in step 't0' above. t2: When the module 'B' code starts executing on cpu1, because of page not present entries in cpu1's TLB it gets a page_not_present fault. And as the page_fault handler doesn't handle faults in region '5' it simply oops. As page_not_present handler purges the corresponding not present TLB entry, next page rewalk will succeed.
OK. Larry
Either my patch: http://post-office.corp.redhat.com/archives/rhkernel-list/2004-August/msg00394.html or Norm Murray's patch: http://post-office.corp.redhat.com/archives/rhkernel-list/2004-August/msg00405.html will address this issue. Norm's was generated from an LLNL IT, but is identical except for the addition of a KERN_CRIT to the beginning of a printk() in do_page_fault().
I can't access the above mentioned post-office URL. Please let me know if you need any more info or if you think patch posted in comment #1 isn't enough
Hi, Suresh. The URLs in comment #5 are restricted to Red Hat. A minor variation of your patch (due to a RHEL3 porting issue) is on track for U4. I'll update this bug report when the patch is committed (in the next day or two). Thanks for isolating the problem and providing the patch.
A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.6.EL).
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html