Bug 236087

Summary: GFS2: mmap problems with distributed test cases
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED ERRATA QA Contact: Dean Jansa <djansa>
Severity: medium Docs Contact:
Priority: high    
Version: 5.0CC: jbacik, kanderso, lwang, swhiteho
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0959 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 19:46:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch that was applied to gfs2 when we ran this test.
none
Patch to attempt to fix the problem encountered
none
Patch for RHEL 5.1 none

Description Nate Straz 2007-04-11 21:30:35 UTC
Description of problem:

While running dd_io on an experimental upstream kernel the d_mmap* test cases
were causing panics.


Version-Release number of selected component (if applicable):
2.6.21-rc6

How reproducible:
100%

Steps to Reproduce:
1.
2.
3.
  
Actual results:
BUG: unable to handle kernel paging request at virtual address 40000000
 printing eip:
e0534bee
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: lock_dlm gfs2 dlm configfs lpfc
CPU:    0
EIP:    0060:[<e0534bee>]    Not tainted VLI
EFLAGS: 00010293   (2.6.21-rc6 #2)
EIP is at gfs2_readpage+0x65/0x17a [gfs2]
eax: 000016cf   ebx: cafa607c   ecx: 40000000   edx: 40000000
esi: 00000480   edi: dce82740   ebp: cabbd030   esp: cae49e90
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process d_doio (pid: 5839, ti=cae48000 task=cabd15b0 task.ti=cae48000)
Stack: 40000000 c11428c0 cad7b000 cabbd0d4 00000000 00000000 c11321e0 cabbd0d4 
       dce82740 c0137404 00000480 cabbd0d4 c11428c0 00000480 dce82740 cabbd0d4 
       c0139441 00000020 0004f9c4 00000000 00000000 cae49f60 d372cf44 dce82788 
Call Trace:
 [<c0137404>] page_cache_read+0x97/0xa0
 [<c0139441>] filemap_nopage+0x231/0x2f7
 [<c0141b36>] __handle_mm_fault+0x15d/0x7c9
 [<c0105098>] do_IRQ+0x7e/0x92
 [<c040a73d>] __sched_text_start+0x715/0x7c4
 [<c040d418>] do_page_fault+0x213/0x510
 [<c040d205>] do_page_fault+0x0/0x510
 [<c040bebc>] error_code+0x7c/0x84
 [<c0400000>] rpcauth_marshcred+0x39/0x52
 =======================



Expected results:


Additional info:

Comment 1 Nate Straz 2007-04-11 21:33:37 UTC
A secondary oops on a different node:

BUG: unable to handle kernel paging request at virtual address 40000001
 printing eip:
e05fabee
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: lock_dlm gfs2 dlm configfs qla2xxx
CPU:    0
EIP:    0060:[<e05fabee>]    Not tainted VLI
EFLAGS: 00010297   (2.6.21-rc6 #2)
EIP is at gfs2_readpage+0x65/0x17a [gfs2]
eax: 000015d8   ebx: ca8ee07c   ecx: 40000001   edx: 40000001
esi: c115e7a0   edi: 0001e9ec   ebp: cafa2030   esp: cb045dcc
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process d_doio (pid: 5592, ti=cb044000 task=cb297a30 task.ti=cb044000)
Stack: 000004de c115e7a0 d18c2000 c115e7a0 c115e7a0 cafa20d4 00000000 c0137335 
       000201d2 c115e7a0 0001e9ec 00000000 000201d2 c115e7a0 0001e9ec 000004de 
       c0137866 000000d0 00000020 df70dec0 df70df08 cafa20d4 cafa2030 00001387 
Call Trace:
 [<c0137335>] add_to_page_cache+0x60/0x70
 [<c0137866>] do_generic_mapping_read+0x209/0x43b
 [<c013967a>] generic_file_aio_read+0x173/0x1a5
 [<c0136f78>] file_read_actor+0x0/0xd1
 [<c015165b>] do_sync_read+0xc7/0x10a
 [<c01297a5>] autoremove_wake_function+0x0/0x35
 [<c0151594>] do_sync_read+0x0/0x10a
 [<c0151dbe>] vfs_read+0x88/0x10a
 [<c01521bc>] sys_read+0x41/0x67
 [<c01030d8>] sysenter_past_esp+0x5d/0x81
 =======================
Code: 85 d1 00 00 00 8b 9d 90 01 00 00 8d 43 1c e8 38 10 e1 df 8b 53 38 eb 13 64
a1 08 00 00 00 8b 80 a4 00 00 00 39 42 0c 74 4f 89 ca <8b> 0a 0f 18 01 90 8d 43
38 39 c2 75 e0 31 c0 c6 43 1c 01 85 c0 
EIP: [<e05fabee>] gfs2_readpage+0x65/0x17a [gfs2] SS:ESP 0068:cb045dcc
dlm: connecting to 1

Comment 2 Abhijith Das 2007-04-11 21:37:31 UTC
Created attachment 152323 [details]
Patch that was applied to gfs2 when we ran this test.

Comment 3 Nate Straz 2007-04-12 17:04:11 UTC
We reverted the portion of the patch from gfs2_readpage, but were still able
to hit panics with d_mmap1. 

kernel BUG at fs/gfs2/ops_address.c:200!
invalid opcode: 0000 [#1]
SMP 
Modules linked in: qla2xxx lock_dlm gfs2 dlm configfs
CPU:    0
EIP:    0060:[<e0533336>]    Not tainted VLI
EFLAGS: 00010202   (2.6.21-rc6 #2)
EIP is at stuffed_readpage+0x15/0xe4 [gfs2]
eax: c7a91678   ebx: c7a91678   ecx: 0000018a   edx: c134f0e0
esi: 00000000   edi: c134f0e0   ebp: c66b1dd8   esp: c66b1d9c
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process d_doio (pid: 4167, ti=c66b0000 task=c15ebab0 task.ti=c66b0000)
Stack: 00000000 00000003 00000000 0000019d c134f0e0 c7a91678 dcac7a30 c7a91678 
       00000000 c134f0e0 c66b1dd8 e0533992 c66b1dd8 00000001 dc8ed000 c7331dd4 
       c7331dd4 c7331d9c 00001047 00000003 00000202 00000000 000000c2 e0533960 
Call Trace:
 [<e0533992>] gfs2_readpage+0x83/0xef [gfs2]
 [<e0533960>] gfs2_readpage+0x51/0xef [gfs2]
 [<c0137866>] do_generic_mapping_read+0x209/0x43b
 [<c013967a>] generic_file_aio_read+0x173/0x1a5
 [<c0136f78>] file_read_actor+0x0/0xd1
 [<c015165b>] do_sync_read+0xc7/0x10a
 [<c01297a5>] autoremove_wake_function+0x0/0x35
 [<c0151594>] do_sync_read+0x0/0x10a
 [<c0151dbe>] vfs_read+0x88/0x10a
 [<c01521bc>] sys_read+0x41/0x67
 [<c01030d8>] sysenter_past_esp+0x5d/0x81
 [<c0400000>] rpcauth_marshcred+0x39/0x52
 =======================
Code: ee 0f 95 c2 85 db 0f 94 c0 01 fb 08 c2 75 cb 5b 5e 5b 5e 5f 5d c3 55 57 56
53 83 ec 1c 89 44 24 14 89 54 24 10 83 7a 14 00 74 04 <0f> 0b eb fe 8b 44 24 14
8b 90 48 01 00 00 8b 88 4c 01 00 00 8d 
EIP: [<e0533336>] stuffed_readpage+0x15/0xe4 [gfs2] SS:ESP 0068:c66b1d9c

Comment 4 RHEL Program Management 2007-04-13 18:45:34 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Steve Whitehouse 2007-04-19 08:35:19 UTC
Its interesting to note that Ken had left a case to cover this particular
occurance in the code, which was then removed:

http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=61057c6bb3a3d14cf2bea6ca20dc6d367e1d852e

The remaining question then, is why this is an apparently valid code path since
we'd expect all inodes to be converted from being stuffed upon mmap().


Comment 6 Steve Whitehouse 2007-04-19 16:13:30 UTC
Now I see whats happening. Its related to the ordering in the page fault path.
We call the VFS's nopage (which calls gfs2_readpage()) before we add the buffers
to the page (since before readpage, there is no page). As a result if the first
page fault to a stuffed file which has been extended is above the initial page
mark then we can come down this code path.

So we need to add back Ken's fix for this, but with an appropriate
flush_dcache_page() for the less coherent architectures.


Comment 7 Steve Whitehouse 2007-04-19 16:24:34 UTC
Created attachment 153023 [details]
Patch to attempt to fix the problem encountered


When testing this patch, the one from bz #236039 should also be applied (i.e.
the cleaned up version of the other patch attached to this bug). I'll push this
patch upstream shortly.

Josef, Nate, is one of you in a position to give this a test run with the QE
test suite?

I'm pretty sure this is the right fix.

Comment 8 Steve Whitehouse 2007-04-24 10:59:36 UTC
Patch has been pushed upstream now into the -nmw git tree.


Comment 9 Steve Whitehouse 2007-05-10 10:23:50 UTC
Created attachment 154462 [details]
Patch for RHEL 5.1

The attached patch takes RHEL 5.1 up to the same level as upstream.

Again, this is one I'd like to get in, even if we need further changes later on
since Don is off on holiday shortly, so please open another bug if we need some
more changes in this area rather than getting this one back from POST.

Comment 10 Don Zickus 2007-05-11 22:07:26 UTC
in 2.6.18-19.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 11 Nate Straz 2007-07-03 22:18:34 UTC
I have not been able to hit this with recent builds (-32).

Comment 12 Don Zickus 2007-07-13 17:37:04 UTC
moving to MODIFIED for errata tool
QE note this bz has already been verified.

Comment 15 errata-xmlrpc 2007-11-07 19:46:35 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html