236087 – GFS2: mmap problems with distributed test cases

Bug 236087 - GFS2: mmap problems with distributed test cases

Summary: GFS2: mmap problems with distributed test cases

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Don Zickus
QA Contact:	Dean Jansa
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-11 21:30 UTC by Nate Straz
Modified:	2007-11-30 22:07 UTC (History)
CC List:	4 users (show)
Fixed In Version:	RHBA-2007-0959
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-07 19:46:35 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch that was applied to gfs2 when we ran this test. (1.90 KB, patch) 2007-04-11 21:37 UTC, Abhijith Das	no flags	Details \| Diff
Patch to attempt to fix the problem encountered (1.06 KB, patch) 2007-04-19 16:24 UTC, Steve Whitehouse	no flags	Details \| Diff
Patch for RHEL 5.1 (1.61 KB, patch) 2007-05-10 10:23 UTC, Steve Whitehouse	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0959	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5 Update 1	2007-11-08 00:47:37 UTC

Description Nate Straz 2007-04-11 21:30:35 UTC

Description of problem:

While running dd_io on an experimental upstream kernel the d_mmap* test cases
were causing panics.


Version-Release number of selected component (if applicable):
2.6.21-rc6

How reproducible:
100%

Steps to Reproduce:
1.
2.
3.
  
Actual results:
BUG: unable to handle kernel paging request at virtual address 40000000
 printing eip:
e0534bee
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: lock_dlm gfs2 dlm configfs lpfc
CPU:    0
EIP:    0060:[<e0534bee>]    Not tainted VLI
EFLAGS: 00010293   (2.6.21-rc6 #2)
EIP is at gfs2_readpage+0x65/0x17a [gfs2]
eax: 000016cf   ebx: cafa607c   ecx: 40000000   edx: 40000000
esi: 00000480   edi: dce82740   ebp: cabbd030   esp: cae49e90
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process d_doio (pid: 5839, ti=cae48000 task=cabd15b0 task.ti=cae48000)
Stack: 40000000 c11428c0 cad7b000 cabbd0d4 00000000 00000000 c11321e0 cabbd0d4 
       dce82740 c0137404 00000480 cabbd0d4 c11428c0 00000480 dce82740 cabbd0d4 
       c0139441 00000020 0004f9c4 00000000 00000000 cae49f60 d372cf44 dce82788 
Call Trace:
 [<c0137404>] page_cache_read+0x97/0xa0
 [<c0139441>] filemap_nopage+0x231/0x2f7
 [<c0141b36>] __handle_mm_fault+0x15d/0x7c9
 [<c0105098>] do_IRQ+0x7e/0x92
 [<c040a73d>] __sched_text_start+0x715/0x7c4
 [<c040d418>] do_page_fault+0x213/0x510
 [<c040d205>] do_page_fault+0x0/0x510
 [<c040bebc>] error_code+0x7c/0x84
 [<c0400000>] rpcauth_marshcred+0x39/0x52
 =======================



Expected results:


Additional info:

Comment 1 Nate Straz 2007-04-11 21:33:37 UTC

A secondary oops on a different node:

BUG: unable to handle kernel paging request at virtual address 40000001
 printing eip:
e05fabee
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: lock_dlm gfs2 dlm configfs qla2xxx
CPU:    0
EIP:    0060:[<e05fabee>]    Not tainted VLI
EFLAGS: 00010297   (2.6.21-rc6 #2)
EIP is at gfs2_readpage+0x65/0x17a [gfs2]
eax: 000015d8   ebx: ca8ee07c   ecx: 40000001   edx: 40000001
esi: c115e7a0   edi: 0001e9ec   ebp: cafa2030   esp: cb045dcc
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process d_doio (pid: 5592, ti=cb044000 task=cb297a30 task.ti=cb044000)
Stack: 000004de c115e7a0 d18c2000 c115e7a0 c115e7a0 cafa20d4 00000000 c0137335 
       000201d2 c115e7a0 0001e9ec 00000000 000201d2 c115e7a0 0001e9ec 000004de 
       c0137866 000000d0 00000020 df70dec0 df70df08 cafa20d4 cafa2030 00001387 
Call Trace:
 [<c0137335>] add_to_page_cache+0x60/0x70
 [<c0137866>] do_generic_mapping_read+0x209/0x43b
 [<c013967a>] generic_file_aio_read+0x173/0x1a5
 [<c0136f78>] file_read_actor+0x0/0xd1
 [<c015165b>] do_sync_read+0xc7/0x10a
 [<c01297a5>] autoremove_wake_function+0x0/0x35
 [<c0151594>] do_sync_read+0x0/0x10a
 [<c0151dbe>] vfs_read+0x88/0x10a
 [<c01521bc>] sys_read+0x41/0x67
 [<c01030d8>] sysenter_past_esp+0x5d/0x81
 =======================
Code: 85 d1 00 00 00 8b 9d 90 01 00 00 8d 43 1c e8 38 10 e1 df 8b 53 38 eb 13 64
a1 08 00 00 00 8b 80 a4 00 00 00 39 42 0c 74 4f 89 ca <8b> 0a 0f 18 01 90 8d 43
38 39 c2 75 e0 31 c0 c6 43 1c 01 85 c0 
EIP: [<e05fabee>] gfs2_readpage+0x65/0x17a [gfs2] SS:ESP 0068:cb045dcc
dlm: connecting to 1

Comment 2 Abhijith Das 2007-04-11 21:37:31 UTC

Created attachment 152323 [details]
Patch that was applied to gfs2 when we ran this test.

Comment 3 Nate Straz 2007-04-12 17:04:11 UTC

We reverted the portion of the patch from gfs2_readpage, but were still able
to hit panics with d_mmap1. 

kernel BUG at fs/gfs2/ops_address.c:200!
invalid opcode: 0000 [#1]
SMP 
Modules linked in: qla2xxx lock_dlm gfs2 dlm configfs
CPU:    0
EIP:    0060:[<e0533336>]    Not tainted VLI
EFLAGS: 00010202   (2.6.21-rc6 #2)
EIP is at stuffed_readpage+0x15/0xe4 [gfs2]
eax: c7a91678   ebx: c7a91678   ecx: 0000018a   edx: c134f0e0
esi: 00000000   edi: c134f0e0   ebp: c66b1dd8   esp: c66b1d9c
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process d_doio (pid: 4167, ti=c66b0000 task=c15ebab0 task.ti=c66b0000)
Stack: 00000000 00000003 00000000 0000019d c134f0e0 c7a91678 dcac7a30 c7a91678 
       00000000 c134f0e0 c66b1dd8 e0533992 c66b1dd8 00000001 dc8ed000 c7331dd4 
       c7331dd4 c7331d9c 00001047 00000003 00000202 00000000 000000c2 e0533960 
Call Trace:
 [<e0533992>] gfs2_readpage+0x83/0xef [gfs2]
 [<e0533960>] gfs2_readpage+0x51/0xef [gfs2]
 [<c0137866>] do_generic_mapping_read+0x209/0x43b
 [<c013967a>] generic_file_aio_read+0x173/0x1a5
 [<c0136f78>] file_read_actor+0x0/0xd1
 [<c015165b>] do_sync_read+0xc7/0x10a
 [<c01297a5>] autoremove_wake_function+0x0/0x35
 [<c0151594>] do_sync_read+0x0/0x10a
 [<c0151dbe>] vfs_read+0x88/0x10a
 [<c01521bc>] sys_read+0x41/0x67
 [<c01030d8>] sysenter_past_esp+0x5d/0x81
 [<c0400000>] rpcauth_marshcred+0x39/0x52
 =======================
Code: ee 0f 95 c2 85 db 0f 94 c0 01 fb 08 c2 75 cb 5b 5e 5b 5e 5f 5d c3 55 57 56
53 83 ec 1c 89 44 24 14 89 54 24 10 83 7a 14 00 74 04 <0f> 0b eb fe 8b 44 24 14
8b 90 48 01 00 00 8b 88 4c 01 00 00 8d 
EIP: [<e0533336>] stuffed_readpage+0x15/0xe4 [gfs2] SS:ESP 0068:c66b1d9c

Comment 4 RHEL Program Management 2007-04-13 18:45:34 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Steve Whitehouse 2007-04-19 08:35:19 UTC

Its interesting to note that Ken had left a case to cover this particular
occurance in the code, which was then removed:

http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=61057c6bb3a3d14cf2bea6ca20dc6d367e1d852e

The remaining question then, is why this is an apparently valid code path since
we'd expect all inodes to be converted from being stuffed upon mmap().

Comment 6 Steve Whitehouse 2007-04-19 16:13:30 UTC

Now I see whats happening. Its related to the ordering in the page fault path.
We call the VFS's nopage (which calls gfs2_readpage()) before we add the buffers
to the page (since before readpage, there is no page). As a result if the first
page fault to a stuffed file which has been extended is above the initial page
mark then we can come down this code path.

So we need to add back Ken's fix for this, but with an appropriate
flush_dcache_page() for the less coherent architectures.

Comment 7 Steve Whitehouse 2007-04-19 16:24:34 UTC

Created attachment 153023 [details]
Patch to attempt to fix the problem encountered


When testing this patch, the one from bz #236039 should also be applied (i.e.
the cleaned up version of the other patch attached to this bug). I'll push this
patch upstream shortly.

Josef, Nate, is one of you in a position to give this a test run with the QE
test suite?

I'm pretty sure this is the right fix.

Comment 8 Steve Whitehouse 2007-04-24 10:59:36 UTC

Patch has been pushed upstream now into the -nmw git tree.

Comment 9 Steve Whitehouse 2007-05-10 10:23:50 UTC

Created attachment 154462 [details]
Patch for RHEL 5.1

The attached patch takes RHEL 5.1 up to the same level as upstream.

Again, this is one I'd like to get in, even if we need further changes later on
since Don is off on holiday shortly, so please open another bug if we need some
more changes in this area rather than getting this one back from POST.

Comment 10 Don Zickus 2007-05-11 22:07:26 UTC

in 2.6.18-19.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 11 Nate Straz 2007-07-03 22:18:34 UTC

I have not been able to hit this with recent builds (-32).

Comment 12 Don Zickus 2007-07-13 17:37:04 UTC

moving to MODIFIED for errata tool
QE note this bz has already been verified.

Comment 15 errata-xmlrpc 2007-11-07 19:46:35 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.