201196 – oops when starting install with gfs2 as rootfs

Bug 201196 - oops when starting install with gfs2 as rootfs

Summary: oops when starting install with gfs2 as rootfs

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Steve Whitehouse
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:	201082
Blocks:
TreeView+	depends on / blocked

Reported:	2006-08-03 14:52 UTC by Jeremy Katz
Modified:	2007-11-30 22:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-08-04 14:49:41 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Test patch to fix this bug (5.43 KB, patch) 2006-08-04 13:23 UTC, Steve Whitehouse	no flags	Details \| Diff
View All

Description Jeremy Katz 2006-08-03 14:52:07 UTC

When using gfs2 as the rootfs, I get the following oops I believe on mount. 
This is with 2.6.17-1.2510.fc6xen on x86_64 as a domU

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/gfs2/glock.c:1173
invalid opcode: 0000 [1] SMP 
last sysfs file: /block/ram0/dev
CPU 0 
Modules linked in: dm_emc dm_round_robin dm_multipath dm_snapshot dm_mirror
dm_zero dm_mod xfs jfs reiserfs lock_nolock gfs2 ext3 jbd msdos raid1 raid0
xenblk xennet iscsi_tcp libiscsi scsi_transport_iscsi sr_mod sd_mod scsi_mod
ide_cd cdrom ipv6 squashfs pcspkr loop nfs nfs_acl fscache lockd sunrpc vfat fat
cramfs
Pid: 339, comm: anaconda Not tainted 2.6.17-1.2510.fc6xen #1
RIP: e030:[<ffffffff882a2360>]  [<ffffffff882a2360>] :gfs2:gfs2_glock_nq+0x9d/0x184
RSP: e02b:ffff88000dce39d8  EFLAGS: 00010296
RAX: 0000000000000029 RBX: ffff88000dce3af8 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff804b9780
RBP: ffff880009dbfe20 R08: ffffffff804b9798 R09: ffff88000dce3658
R10: 0000000000000003 R11: 0000000000000000 R12: ffff880009dbfe20
R13: 0000000000000000 R14: ffffc2000029d000 R15: ffff880009dbfe20
FS:  00002aaaabbea120(0000) GS:ffffffff8063b000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process anaconda (pid: 339, threadinfo ffff88000dce2000, task ffff88000eb3a7f0)
Stack:  ffffc2000029d330  ffff88000dce3af8  ffff8800098242b8  ffffc2000029d000 
 ffff8800098242b8  ffffffff882a3df8  00000e1000000003  ffff880009dbfe00 
 0000420209dbfe00  ffffffff802614cc 
Call Trace:
 [<ffffffff882a3df8>] :gfs2:gfs2_glock_nq_atime+0xf4/0x2a2
 [<ffffffff802614cc>] __mutex_lock_slowpath+0x27a/0x285
 [<ffffffff882ab0d6>] :gfs2:gfs2_readpages+0x75/0x1d3
 [<ffffffff882a1728>] :gfs2:gfs2_glock_put+0x92/0x99
 [<ffffffff802b3847>] __rmqueue+0x4a/0xe8
 [<ffffffff8020aade>] get_page_from_freelist+0x231/0x408
 [<ffffffff882ab0c9>] :gfs2:gfs2_readpages+0x68/0x1d3
 [<ffffffff802131c6>] __do_page_cache_readahead+0x145/0x218
 [<ffffffff8026262c>] _spin_lock_irqsave+0x26/0x2b
 [<ffffffff802227ac>] __up_read+0x19/0x7f
 [<ffffffff883d3b82>] :dm_mod:dm_any_congested+0x3b/0x42
 [<ffffffff80213aaa>] filemap_nopage+0x14a/0x34f
 [<ffffffff882b062b>] :gfs2:gfs2_sharewrite_nopage+0xcc/0x2ee
 [<ffffffff882b05b1>] :gfs2:gfs2_sharewrite_nopage+0x52/0x2ee
 [<ffffffff802614cc>] __mutex_lock_slowpath+0x27a/0x285
 [<ffffffff80208f3a>] __handle_mm_fault+0x65d/0xf5e
 [<ffffffff80264f3c>] do_page_fault+0xe69/0x1203
 [<ffffffff8020e2d4>] do_mmap_pgoff+0x608/0x773
 [<ffffffff8026262c>] _spin_lock_irqsave+0x26/0x2b
 [<ffffffff8023135a>] __up_write+0x27/0xf2
 [<ffffffff8025e173>] error_exit+0x0/0x6e


Code: 0f 0b 68 5a 99 2b 88 c2 95 04 48 8b 73 18 49 8b 84 24 90 00 
RIP  [<ffffffff882a2360>] :gfs2:gfs2_glock_nq+0x9d/0x184
 RSP <ffff88000dce39d8>

Comment 1 Jeremy Katz 2006-08-03 15:10:04 UTC

Happens on i386 as well.  Things seem fine if I do a gfs2 /scratch, though.

Comment 2 Jeremy Katz 2006-08-03 15:43:23 UTC

Also reproducible with gfs2 as /usr, although we then start to get some bits
installed before things blow up

------------[ cut here ]------------
kernel BUG at fs/gfs2/glock.c:1173!
invalid opcode: 0000 [#1]
SMP 
last sysfs file: /block/xvda/dev
Modules linked in: dm_emc dm_round_robin dm_multipath dm_snapshot dm_mirror
dm_zero dm_mod xfs jfs reiserfs lock_nolock gfs2 ext3 jbd msdos raid1 raid0
xenblk xennet iscsi_tcp libiscsi scsi_transport_iscsi sr_mod sd_mod scsi_mod
ide_cd cdrom ipv6 squashfs pcspkr loop nfs nfs_acl fscache lockd sunrpc vfat fat
cramfs
CPU:    0
EIP:    0061:[<d92bc16d>]    Not tainted VLI
EFLAGS: 00210296   (2.6.17-1.2510.fc6xen #1) 
EIP is at gfs2_glock_nq+0x8f/0x14c [gfs2]
eax: 00000029   ebx: ccbe5ca4   ecx: ccbe5a80   edx: d92d1be1
esi: d38f63d4   edi: d38f63d4   ebp: d38f640c   esp: ccbe5bfc
ds: 007b   es: 007b   ss: 0069
Process build-locale-ar (pid: 424, ti=ccbe5000 task=c0c88930 task.ti=ccbe5000)
Stack: d95f0000 00000000 d95f02f0 ccbe5ca4 d35a9aa8 d95f0000 d92bd847 d35a9aa8 
       d35a9f38 00000000 d38f63d4 00000003 00000e10 00000000 d3ae7000 00004202 
       c0430a75 ccbe5ca4 d35a9b90 d35a9aa8 d35a9b80 d92c46af ccbe5ca4 ccbe5d4c 
Call Trace:
 [<d92bd847>] gfs2_glock_nq_atime+0xd7/0x2a5 [gfs2]
 [<c0430a75>] init_waitqueue_head+0x12/0x1d
 [<d92c46af>] gfs2_readpages+0x5a/0x199 [gfs2]
 [<d92c2187>] gfs2_meta_reread+0x59/0xc2 [gfs2]
 [<c044bcb1>] get_page_from_freelist+0x1f2/0x380
 [<d92c46a3>] gfs2_readpages+0x4e/0x199 [gfs2]
 [<d92c4655>] gfs2_readpages+0x0/0x199 [gfs2]
 [<c044d38b>] __do_page_cache_readahead+0x120/0x1c0
 [<d92bb726>] gfs2_glock_put+0x7b/0x81 [gfs2]
 [<c05f094e>] _spin_unlock_irq+0x5/0x27
 [<c044a0c1>] filemap_nopage+0x150/0x333
 [<d92c962d>] gfs2_sharewrite_nopage+0xb5/0x29e [gfs2]
 [<d92c95b4>] gfs2_sharewrite_nopage+0x3c/0x29e [gfs2]
 [<c0453c37>] __handle_mm_fault+0x64c/0x1076
 [<c05ef82c>] __mutex_unlock_slowpath+0xb0/0x10f
 [<c05efaad>] __mutex_lock_slowpath+0x21d/0x225
 [<d92bb8e7>] gfs2_glmutex_lock+0x72/0x78 [gfs2]
 [<c04b6a6e>] selinux_vm_enough_memory+0x3b/0x51
 [<c0458804>] __vm_enough_memory+0xc/0xd0
 [<c0457566>] expand_stack+0x10f/0x118
 [<c05f1e4c>] do_page_fault+0x704/0xc07
 [<c044fd10>] vma_prio_tree_insert+0x17/0x2a
 [<c0458e15>] do_mmap_pgoff+0x54d/0x6a0
 [<c05f1748>] do_page_fault+0x0/0xc07
 [<c0404e9b>] error_code+0x2b/0x30
Code: 0c 74 0e 89 d0 8b 10 0f 18 02 90 39 e8 75 ef eb 22 8b 50 3c b8 d0 1b 2d d9
e8 cf da 17 e7 8b 53 3c b8 e1 1b 2d d9 e8 c2 da 17 e7 <0f> 0b 95 04 2b 1b 2d d9
8b 6b 0c 8d 4f 50 8b 47 50 eb 07 39 68 
EIP: [<d92bc16d>] gfs2_glock_nq+0x8f/0x14c [gfs2] SS:ESP 0069:ccbe5bfc

Comment 3 Steve Whitehouse 2006-08-03 15:48:04 UTC

The important information is the new: and original: which should have been
printed right before the stack trace. Can you get that for me please?

The problem is due to an attempt at recursive locking (which the glock layer
doesn't allow any more) and that will tell me which locks were involved.

Comment 4 Jeremy Katz 2006-08-03 16:43:22 UTC

Not seeing anything along the lines of new: or original: logged anywhere

But it happens trivially just by booting with today's rawhide (or the test2
candidate tree) with 'linux gfs2' and then selecting gfs2 as the fs type to use

Comment 5 Steve Whitehouse 2006-08-04 07:39:35 UTC

I think I know why this happens. I believe its related to taking page faults in an  
mmaped() area of memory. I'm surprised that you don't see the printk's though as
lines 1171-1173 of glock.c read:

                print_symbol(KERN_WARNING "original: %s\n", existing->gh_ip);
                print_symbol(KERN_WARNING "new: %s\n", gh->gh_ip);
                BUG();

Perhaps warnings get put somewhere different and I should use another log level?

In the mean time I'm looking at the path from
fs/gfs2/ops_vm.c:gfs2_sharewrite_nopage through to ops_address.c:readpage(s) and 
wondering how best to indicate to the latter that they've been called via this
path in order that they don't do their own locking as they would usually do. I
think this is the correct solution to the problem.

Comment 6 Steve Whitehouse 2006-08-04 13:23:27 UTC

Created attachment 133633 [details]
Test patch to fix this bug

Once I've confirmed that this is indeed the correct fix with a bit more
testing,
I'll commit it to the git tree for gfs2.

Comment 7 Steve Whitehouse 2006-08-04 14:49:41 UTC

I've done some more testing and it looks like its the right fix, but I've run
into what looks like another manifestation of bug #201082, so I'm going to
commit the patch as it is and then transfer further work to that bug unless
anybody finds any evidence otherwise.

Note You need to log in before you can comment on or make changes to this bug.