Bug 127515 (mark_clean)

Summary: undef ENABLE_MARK_CLEAN in arch/ia64/hp/common/sba_iommu.c?
Product: Red Hat Enterprise Linux 2.1 Reporter: Don Howard <dhoward>
Component: kernelAssignee: Don Howard <dhoward>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: cww, jparadis, ltroan, mike.miller, riel, tao
Target Milestone: ---   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-04-28 15:10:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 132992    
Attachments:
Description Flags
untested backport of updated locking from 2.6
none
Disable mark_clean in sba_iommu.c to avoid memory corruption. none

Description Don Howard 2004-07-09 07:38:18 UTC
RHN: 
  Customer: BTI - Australia 
   Platform: ia64 (rx2600) with RHAS2.1 and 2.4.9-e.31smp /  
                 2.4.18-e.43smp 
  Host prod1 product id is 94ad-5abf-40b5-1119  
  Host prod2 product id is 40db-6f7b-b4b1-a274 
  RHN username is Worldcare 
- Quantity 2: MCT0219US RHEL 2.1 for IPF Level 3 (9-9 Monday-Friday) 
 
Problem: 
Context: two rx2600, RH EL AS 2.1, 2.4.9-e.31smp, cciss, in 
production for the past two or three weeks. Systems hang on heavy 
system load, no ping, no console response, no console output, 
management processor event log inconclusive, SysRq and serial 
console enabled. 
 
Regards, 
Dilip Daya. 
---------- 
Action by: ddaya 
Hi Chris, 
 
Latest update: 
 
1./ HP-Australia has involved Red Hat Support in Australia. Please 
investigate that you and your Austalian Red Hat engineers are in 
sync with this issue. I was not given a issue or a contact person 
within Red Hat/Australia. 
 
2./ The customer rx2600 system kernel hangs were occurring on 
2.4.18-e.31smp and on 2.4.18-e.43 (as shown in the gmiller.2380 
sysreport), but since then the customer has been willing to try a HP 
kernel change provided by their CCISS driver engineer. As of at 
9:15am (Australian Time) Tuesday the customer claims the systems 
have not hung with the 2.4.18-e.31.hp kernel and Oracle combination. 
The kernel modification and explanation is as follows from teh CCISS 
driver developer within HP: 
 
--- The following is also related to Red Hat Issues: 37897 _and_  
34923 and this issue: 42071 ---- 
 
Even though this is an unsupported configuration is important that 
we try this workaround. If it proves to help as it did in the db lab 
then we can go back to RH for a fix. 
 
Soo, I reviewed source code: 
/usr/src/linux-2.4.18-e.41/arch/ia64/hp/common/sba_iommu.c 
 
--- 
 
Change line #43: 
 
 43 #define ENABLE_MARK_CLEAN 
 
To: 
 
43 #undef ENABLE_MARK_CLEAN 
 
--- 
 
...and recompile the kernel and test... 
 
--- 
 
=> HP-Australia made the kernel modification and provided the 
customer "2.4.18-e.31.hp" and since executing this kernel the 
customer's systems are up and fine without any kernel 
panics/hangs/crash.  Soo, could RH Support Engineering review the 
above code modification and reply with any implications or orther 
workaround.

Comment 1 Don Howard 2004-07-09 07:44:08 UTC
<0>Kernel panic: not continuing 
In interrupt handler - not syncing 
 <6>Syncing device 68:04 ... kernel BUG at sched.c:834! 
Unable to handle kernel NULL pointer dereferencecp[2095]: Oops 
11003706212352 
--> schedule [kernel] 0x81 <-- 
 
Pid: 2095, comm:                   cp 
psr : 0000121008026018 ifs : 8000000000000813 ip  : 
[<e000000004470781>]    Not tainted 
unat: 0000000000000000 pfs : 0000000000000813 rsc : 0000000000000003 
rnat: 0000000000001000 bsps: e0000040dadf8000 pr  : 80000000ff615565 
ldrs: 0000000000000000 ccv : 000000007fffffff fpsr: 0009804c8a70033f 
b0  : e000000004470780 b6  : e0000000045e8d40 b7  : e00000000440e2b0 
f6  : 0fffbccccccccc8c00000 f7  : 0ffdca200000000000000 
f8  : 100028000000000000000 f9  : 10002a000000000000000 
r1  : e000000004bb2310 r2  : 00000000000051d7 r3  : e00000000485d2d5 
r8  : 000000000000001b r9  : 0000000000000000 r10 : 0000000000000000 
r11 : 80000000ff611a65 r12 : e0000040d674f970 r13 : e0000040d6748000 
r14 : 0000000000000000 r15 : e00000000495da20 r16 : e00000000495da08 
r17 : 0000000000000000 r18 : 0000000000000001 r19 : e000000004a13750 
r20 : e000000004a13748 r21 : e0000000049bdc58 r22 : 000000000000ffff 
r23 : 0000000000000000 r24 : 0000000000000058 r25 : 0000000000000059 
r26 : 000000000000005a r27 : 00000000000000e0 r28 : 0000000000000000 
r29 : 0000000000000001 r30 : 0000000000000005 r31 : 0000000000000894 
 
Call Trace: [<e000000004412d90>] sp=0xe0000040d674f560 
bsp=0xe0000040d67498d0 decoded to show_stack [kernel] 0x50 
[<e0000000044135c0>] sp=0xe0000040d674f720 bsp=0xe0000040d6749878 
decoded to show_regs [kernel] 0x7c0 
[<e00000000442c7e0>] sp=0xe0000040d674f740 bsp=0xe0000040d6749850 
decoded to die [kernel] 0x120 
[<e00000000444bc20>] sp=0xe0000040d674f740 bsp=0xe0000040d67497e8 
decoded to ia64_do_page_fault [kernel] 0x780 
[<e00000000440dce0>] sp=0xe0000040d674f7d0 bsp=0xe0000040d67497e8 
decoded to ia64_leave_kernel [kernel] 0x0 
[<e000000004470780>] sp=0xe0000040d674f970 bsp=0xe0000040d6749750 
decoded to schedule [kernel] 0x80 
[<e0000000044d9750>] sp=0xe0000040d674f980 bsp=0xe0000040d6749710 
decoded to __wait_on_buffer [kernel] 0xf0 
[<e0000000044dc800>] sp=0xe0000040d674f9b0 bsp=0xe0000040d67496e8 
decoded to bread [kernel] 0xe0 
[<e000000004540540>] sp=0xe0000040d674f9c0 bsp=0xe0000040d6749688 
decoded to ext2_update_inode [kernel] 0x2e0 
[<e000000004540cb0>] sp=0xe0000040d674f9d0 bsp=0xe0000040d6749668 
decoded to ext2_write_inode [kernel] 0x30 
[<e000000004507d60>] sp=0xe0000040d674f9d0 bsp=0xe0000040d6749600 
decoded to sync_inodes_sb [kernel] 0x2c0 
[<e000000004508740>] sp=0xe0000040d674f9d0 bsp=0xe0000040d67495e0 
decoded to sync_inodes [kernel] 0x60 
[<e0000000044da260>] sp=0xe0000040d674f9d0 bsp=0xe0000040d67495c8 
decoded to fsync_dev [kernel] 0x40 
[<e0000000045ef270>] sp=0xe0000040d674f9d0 bsp=0xe0000040d6749598 
decoded to go_sync [kernel] 0x370 
[<e0000000045ef460>] sp=0xe0000040d674f9e0 bsp=0xe0000040d6749568 
decoded to do_emergency_sync [kernel] 0x180 
[<e000000004478320>] sp=0xe0000040d674f9e0 bsp=0xe0000040d6749510 
decoded to panic [kernel] 0x300 
[<e00000000442c870>] sp=0xe0000040d674fa20 bsp=0xe0000040d67494e8 
decoded to die [kernel] 0x1b0 
[<e00000404764b580>] sp=0xe0000040d674fa20 bsp=0xe0000040d67494c0 
decoded to vlan_ioctl_hook_R570c4b11 [] 0x42c8dae0 
[<e00000404764b580>] sp=0xe0000040d674fa20 bsp=0xe0000040d6749498 
decoded to vlan_ioctl_hook_R570c4b11 [] 0x42c8dae0 
[<e00000404764b580>] sp=0xe0000040d674fa20 bsp=0xe0000040d6749470 
decoded to vlan_ioctl_hook_R570c4b11 [] 0x42c8dae0 
... 

Comment 2 Don Howard 2004-07-09 07:45:55 UTC
Unable to handle kernel paging request at virtual address 
3030203030303030 
kswapd[5]: Oops 8813272891392 
--> kmem_cache_reap [kernel] 0x570 <-- 
 
Pid: 5, comm:               kswapd 
psr : 0000101008022038 ifs : 8000000000000c1a ip  : 
[<e0000000044d0b50>]    Not tainted 
unat: 0000000000000000 pfs : 0000000000000c1a rsc : 0000000000000003 
rnat: 80000000ff602939 bsps: e0000000044141c0 pr  : 80000000ff602979 
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a74433f 
b0  : e0000000044d0970 b6  : e0000000044141c0 b7  : e00000000440d990 
f6  : 0fff6fffffffff0000000 f7  : 0ffe7b800000000000000 
f8  : 1000bb800000000000000 f9  : 100078000000000000000 
r1  : e000000004cf5760 r2  : 0000000000000000 r3  : e000004046a37d98 
r8  : 0000000000000017 r9  : ffffffffffffffff r10 : 0000000000000000 
r11 : 0000000000000a98 r12 : e000004046a37e20 r13 : e000004046a30000 
r14 : 3030203030303030 r15 : e00000404722b210 r16 : 0000000000000032 
r17 : 000000000000242b r18 : e0000040d3608008 r19 : 0000000000000000 
r20 : 0000000000000000 r21 : 0000000066666667 r22 : 0000000000000000 
r23 : 0000000000000000 r24 : ffffffffffff781a r25 : e0000040fef68058 
r26 : 0000000000000000 r27 : e000004046a37e30 r28 : e000004046a37e38 
r29 : 0000000000000001 r30 : 0000000000000000 r31 : 0000000000000000 
 
Call Trace: [<e000000004414910>] sp=0xe000004046a37a10 
bsp=0xe000004046a313a0 
decoded to show_stack [kernel] 0x50 
[<e000000004415140>] sp=0xe000004046a37bd0 bsp=0xe000004046a31348 
decoded to show_regs [kernel] 0x7c0 
[<e00000000442fad0>] sp=0xe000004046a37bf0 bsp=0xe000004046a31320 
decoded to die [kernel] 0x190 
[<e000000004452580>] sp=0xe000004046a37bf0 bsp=0xe000004046a312c0 
decoded to ia64_do_page_fault [kernel] 0x780 
[<e00000000440df20>] sp=0xe000004046a37c80 bsp=0xe000004046a312c0 
decoded to ia64_leave_kernel [kernel] 0x0 
[<e0000000044d0b50>] sp=0xe000004046a37e20 bsp=0xe000004046a311e8 
decoded to kmem_cache_reap [kernel] 0x570 
[<e0000000044d8820>] sp=0xe000004046a37e30 bsp=0xe000004046a311c8 
decoded to do_try_to_free_pages [kernel] 0xa0 
[<e0000000044d9150>] sp=0xe000004046a37e30 bsp=0xe000004046a311a0 
decoded to kswapd [kernel] 0x330 
[<e000000004415f30>] sp=0xe000004046a37e50 bsp=0xe000004046a31168 
decoded to arch_kernel_thread [kernel] 0x70 
[<e000000004484010>] sp=0xe000004046a37e50 bsp=0xe000004046a31138 
decoded to kernel_thread [kernel] 0xd0 
[<e0000000048caf30>] sp=0xe000004046a37e50 bsp=0xe000004046a31128 
decoded to kswapd_init [kernel] 0x50 
... 

Comment 3 Larry Troan 2004-07-13 19:13:04 UTC
Opening bug to HP per Summer to request help from HP Engineering.

Comment 4 Larry Troan 2004-07-14 13:49:04 UTC
Reference Issue Trackers 42071 (HP L3 escalation), 44090 (HP-IPF)

Comment 5 Mike Miller (OS Dev) 2004-07-27 23:41:00 UTC
ia64 i-caches are not coherent with respect to processor stores.  So 
in general, when mapping an executable page, we have to flush the 
i-cache to avoid executing stale instructions.  This flush normally 
happens in update_mmu_cache(). 
 
However, the i-cache IS coherent with respect to DMA.  So if we DMA 
over an entire page and subsequently map it as executable, we can 
skip the flush.  mark_clean() performs this optimization by setting 
the PG_arch_1 bit.  update_mmu_cache() skips the i-cache flush if 
PG_arch_1 is set. 
 
I expect that the effectiveness of this optimization depends on the 
percentage of DMA-read pages that are subsequently mapped executable. 
If very few of them are ever executed (as is probably the case for 
Oracle), the time spent doing mark_clean() is wasted. 
 
On the other hand, if we're often reading executable pages from the 
disk, we can do a lot of mark_clean()s for the cost of a cache flush, 
so it's probably a win overall. 
 
The system should operate correctly either with or without 
mark_clean(). 
For general-purpose use, I think we want to keep it, but it might be 
worthwhile to consider a tunable for systems where almost all DMA 
reads 
are for non-executable data. 
 
While looking at the code, I noticed that RHEL3 U3 calls mark_clean() 
while holding the ioc->res_lock(), which is not needed (this is fixed 
in 2.6 already).  Before adding a tunable, I'd propose moving the 
mark_clean() outside the critical section to make sure it's not just 
a lock contention problem they're seeing. 
 
Hope this helps. 

Comment 13 Bastien Nocera 2004-12-06 12:34:26 UTC
Do we have any preliminary patches for us to build test kernels with,
for the customer to test, apart from Don's untested patch?

Comment 14 Don Howard 2004-12-07 07:26:55 UTC
Created attachment 108021 [details]
Disable mark_clean in sba_iommu.c to avoid memory corruption.

Comment 16 Don Howard 2005-01-26 17:19:38 UTC
I am currently working to get this included in the U7 update.

Comment 17 Don Howard 2005-02-01 07:18:31 UTC
This seems interesting: derry isn't specifying GFP_DMA when allocating pages for
just that: dma.  I would expect IO failure, rather than the hangs and oppses
that are described, so I'm unsure if it would relate here.

[snipped from derry vs taroon diff of sba_iommu.c]

@@ -947,7 +972,7 @@ sba_alloc_consistent(struct pci_dev *hwd
                return 0;
        }

-        ret = (void *) __get_free_pages(GFP_ATOMIC, get_order(size));
+       ret = (void *) __get_free_pages(GFP_ATOMIC|GFP_DMA, get_order(size));

        if (ret) {
                memset(ret, 0, size);
@

Comment 20 Don Howard 2005-02-09 18:01:54 UTC
Larry Woodman pointed out to me that omission of GFP_DMA could
potentially cause trouble on ia64 machines that lack an iommu, but
most (all?) of the reports I've seen of this are on hp hardware that
has an iommu.


Comment 21 Larry Troan 2005-03-01 20:33:48 UTC
Is this IN or OUT of U7?

Comment 22 Don Howard 2005-03-01 21:33:23 UTC
The patch that disables mark_clean() is in U7.

Comment 23 Jim Paradis 2005-03-03 04:22:32 UTC
A fix for this problem has just been committed to the RHEL2.1 U7
patch pool this evening (in kernel version 2.4.18-54.1).


Comment 24 Jim Paradis 2005-03-03 04:25:46 UTC
Make that kernel version 2.4.18-55...

Comment 25 John Flanagan 2005-04-28 15:10:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-284.html