808111 – GFS2: Oops: BUG: unable to handle kernel NULL pointer dereference at (null) (TAINTED: Nvidia)

Bug 808111 - GFS2: Oops: BUG: unable to handle kernel NULL pointer dereference at (null) (TAINTED: Nvidia)

Summary: GFS2: Oops: BUG: unable to handle kernel NULL pointer dereference at ...

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	16
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Ben Marzinski
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-03-29 15:29 UTC by thomas
Modified:	2012-11-19 09:42 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-11-19 09:42:17 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description thomas 2012-03-29 15:29:56 UTC

Description of problem:

Host system with VMs stored on GFS2 file systems (iSCSI backed block devices) panics with "BUG: Unable to handle kernel NULL pointer dereference..." during elevated level(s) of IO in a VM (i.e. during installation of a new VM, fsck'ing a VM file system, etc).


Version-Release number of selected component (if applicable):

rpm -qa | grep qemu:
	qemu-common-1.0-7.fc16.x86_64
	qemu-kvm-1.0-7.fc16.x86_64
	qemu-system-x86-1.0-7.fc16.x86_64

uname -r: 
	3.3.0-4.fc16.x86_64

rpm -qa | grep cluster:
	cluster-glue-libs-1.0.6-8.fc16.x86_64
	lvm2-cluster-2.02.86-6.fc16.x86_64
	cluster-snmp-0.18.7-3.fc16.x86_64
	modcluster-0.18.7-3.fc16.x86_64
	cluster-cim-0.18.7-3.fc16.x86_64
	gfs2-cluster-3.1.3-1.fc16.x86_64
	clusterssh-3.28-3.fc15.noarch
	cluster-glue-1.0.6-8.fc16.x86_64
	clusterlib-3.1.7-1.fc16.x86_64
	cman-3.1.7-1.fc16.x86_64

How reproducible:

Fairly consistently.

Steps to Reproduce:
1.Boot host system w/GFS2 file system and start VM(s) hosted on GFS2 
2. Run increased ("intense") I/O read/write operations in the VM/guest
3. Eventually, the system will panic.
  
Actual results:

System panics & gets fenced/restarts (after ensuring kernel.panic_on_oops sysctl is set to 1)

Expected results:

IO operations continue without causing panic.

Additional info:

Stack trace:

Mar 28 09:34:24 virt1-backup corosync[2251]:   [IPC   ] Invalid IPC credentials.
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192091] BUG: unable to handle kernel NULL pointer dereference at           (null)
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192097] IP: [<ffffffff811b1019>] unlock_buffer+0x9/0x20
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192103] PGD 0 
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192105] Oops: 0002 [#1] SMP 
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192108] CPU 1 
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192109] Modules linked in: gfs2 ppdev parport_pc lp parport bnep bluetooth rfkill fuse dlm ebtable_nat ebtables dm_round_robin lockd be2iscsi iscsi_boot_sysfs bnx2i c
nic uio cxgb4i cxgb4 ip_vs_rr cxgb3i libcxgbi cxgb3 mdio ip_vs libcrc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bridge stp llc ip6t_REJECT 
nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables nvidia(PO) snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep vhost_net joydev snd_seq snd_seq_device macvtap macvlan s
nd_pcm tun e1000e iTCO_wdt snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support kvm_intel kvm i2c_core serio_raw uinput microcode sunrpc firewire_ohci firewire_core usb_storage crc_itu_t sata_sil
24 dm_multipath [last unloaded: scsi_wait_scan]
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192154] 
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192156] Pid: 22897, comm: qemu-kvm Tainted: P           O 3.3.0-4.fc16.x86_64 #1 HP-Pavilion KT608AV-ABA d5000t/Burbank
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192160] RIP: 0010:[<ffffffff811b1019>]  [<ffffffff811b1019>] unlock_buffer+0x9/0x20
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192163] RSP: 0018:ffff8800a9375838  EFLAGS: 00010286
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192165] RAX: 00000000c055c055 RBX: ffff8801ce7dc8c0 RCX: 000000000000006d
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192167] RDX: 000000000000c055 RSI: ffff8800a93757e0 RDI: 0000000000000000
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192169] RBP: ffff8800a9375838 R08: ffff8800a9374000 R09: 00000000000005de
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192171] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801dc059000
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192173] R13: ffff880067c604e0 R14: ffff880067ca8ea0 R15: 0000000000000002
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192175] FS:  00007ff6fbfff700(0000) GS:ffff88022fc80000q(0000) knlGS:0000000000000000
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192177] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192179] CR2: 0000000000000000 CR3: 00000001bfbfd000 CR4: 00000000000026e0
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192181] DR0: 0000000000000003 DR1: 00000000000000b0 DR2: 0000000000000001
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192183] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192185] Process qemu-kvm (pid: 22897, threadinfo ffff8800a9374000, task ffff880224165cc0)
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192187] Stack:
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192188]  ffff8800a9375868 ffffffffa1096613 ffff8801c23a4ea0 ffff880067c604e0
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192192]  ffff8801dc059000 0000000000000003 ffff8800a9375888 ffffffffa10abc03
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192195]  ffff8801c223d6f8 0000000000000003 ffff8800a9375a08 ffffffffa1084da5
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192198] Call Trace:
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192208]  [<ffffffffa1096613>] buf_lo_add+0x63/0x160 [gfs2]
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192217]  [<ffffffffa10abc03>] gfs2_trans_add_bh+0x43/0x70 [gfs2]
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192221]  [<ffffffffa1084da5>] gfs2_block_map+0x5f5/0xf50 [gfs2]
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192226]  [<ffffffff8116e29b>] ? kmem_cache_alloc+0x11b/0x150
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192229]  [<ffffffff811b286e>] __block_write_begin+0x1ce/0x520
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192234]  [<ffffffffa10847b0>] ? gfs2_unstuff_dinode+0x490/0x490 [gfs2]
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192237]  [<ffffffff81120ebe>] ? grab_cache_page_write_begin+0x7e/0xe0
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192244]  [<ffffffffa1098216>] gfs2_write_begin+0x356/0x470 [gfs2]
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192247]  [<ffffffff81120662>] generic_file_buffered_write+0x112/0x290
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192250]  [<ffffffff81121ed9>] __generic_file_aio_write+0x229/0x440
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192253]  [<ffffffff810aeb55>] ? futex_wait+0x1c5/0x2a0
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192256]  [<ffffffff81122162>] generic_file_aio_write+0x72/0xe0
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192263]  [<ffffffffa109a3e0>] gfs2_file_aio_write+0x80/0xb0 [gfs2]
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192265]  [<ffffffff810aed81>] ? futex_wake+0x101/0x120
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192268]  [<ffffffff81180e12>] do_sync_write+0xd2/0x110
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192272]  [<ffffffff8126f828>] ? selinux_file_permission+0xe8/0x130
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192275]  [<ffffffff812698bc>] ? security_file_permission+0x2c/0xb0
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192278]  [<ffffffff81181391>] ? rw_verify_area+0x61/0xf0
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192280]  [<ffffffff811816f3>] vfs_write+0xb3/0x180
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192282]  [<ffffffff81181bb2>] sys_pwrite64+0xa2/0xb0
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192286]  [<ffffffff815fc029>] system_call_fastpath+0x16/0x1b
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192287] Code: ff 48 8b 5d f0 4c 8b 65 f8 c9 c3 0f 1f 80 00 00 00 00 48 89 df e8 88 f2 ff ff eb d9 66 0f 1f 44 00 00 55 48 89 e5 66 66 66 66 90 <f0> 80 27 fb be 02 00 00 00 e8 89 93 ec ff 5d c3 0f 1f 80 00 00 
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192313] RIP  [<ffffffff811b1019>] unlock_buffer+0x9/0x20
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192316]  RSP <ffff8800a9375838>
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192317] CR2: 0000000000000000
Mar 28 09:34:36 virt1-backup kernel: [ 3667.192320] ---[ end trace 99743908f7036675 ]---

Comment 1 Steve Whitehouse 2012-03-30 14:50:55 UTC

Something rather odd going on here... why does dereferencing bd->bd_bh explode when unlocking the buffer when it was ok, only a few lines above that? I've not seen any other similar reports recently at all.

Comment 2 Steve Whitehouse 2012-03-30 14:54:25 UTC

Ah, one thought... does this still happen if you turn off selinux? Just to note that selinux won't work correctly with GFS2 currently, so it is probably best to turn it off as it will be faster without that turned on.

Comment 3 thomas 2012-03-30 19:26:00 UTC

SELinux is running in permissive mode, so I believe(d?) that it should not be affecting GFS2 in any meaningful way (other than possibly a slight perf degradation)?

Comment 4 Steve Whitehouse 2012-04-03 10:08:00 UTC

Yes, that should be the case, however we don't generally test with selinux, since it seems kind of pointless when it doesn't work correctly and its only function is therefore to degrade performance.

I'm really just wondering what variables we can eliminate since we've not seen anything like this recently, and I'm just trying to think of any reason why we might not have seen it before.

Comment 5 Elvis Pranskevichus 2012-04-27 22:50:01 UTC

I'm getting this too on a vanilla 3.3.3 on bare-bones Gentoo.  No SELinux on this box.

Below is a log of GFS crashing on database dump restore:

Apr 27 18:30:01 bombadil kernel: BUG: unable to handle kernel NULL pointer dereference at           (null)
Apr 27 18:30:01 bombadil kernel: IP: [<ffffffff8112b064>] unlock_buffer+0x4/0x20
Apr 27 18:30:01 bombadil kernel: PGD 292e73067 PUD 2c3a95067 PMD 0 
Apr 27 18:30:01 bombadil kernel: Oops: 0002 [#1] SMP 
Apr 27 18:30:01 bombadil kernel: CPU 10 
Apr 27 18:30:01 bombadil kernel: Pid: 19064, comm: postgres Not tainted 3.3.3 #4 Supermicro X8DTU/X8DTU
Apr 27 18:30:01 bombadil kernel: RIP: 0010:[<ffffffff8112b064>]  [<ffffffff8112b064>] unlock_buffer+0x4/0x20
Apr 27 18:30:01 bombadil kernel: RSP: 0018:ffff880292ed5808  EFLAGS: 00010286
Apr 27 18:30:01 bombadil kernel: RAX: ffff880230c84970 RBX: ffff880230c84980 RCX: 0000000000000046
Apr 27 18:30:01 bombadil kernel: RDX: 00000000000000dd RSI: ffff880230c84980 RDI: 0000000000000000
Apr 27 18:30:01 bombadil kernel: RBP: ffff880292ed5808 R08: fcd12a65fd8f8e02 R09: a80032521c000000
Apr 27 18:30:01 bombadil kernel: R10: 0000000000000004 R11: 0000000000000000 R12: ffff88032ba31000
Apr 27 18:30:01 bombadil kernel: R13: ffff880230c84960 R14: 0000000000000001 R15: 0000000000000001
Apr 27 18:30:01 bombadil kernel: FS:  00007fd8aad08700(0000) GS:ffff88033fd40000(0000) knlGS:0000000000000000
Apr 27 18:30:01 bombadil kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 27 18:30:01 bombadil kernel: CR2: 0000000000000000 CR3: 00000002f3b8c000 CR4: 00000000000006e0
Apr 27 18:30:01 bombadil kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 27 18:30:01 bombadil kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 27 18:30:01 bombadil kernel: Process postgres (pid: 19064, threadinfo ffff880292ed4000, task ffff8802c7e4c350)
Apr 27 18:30:01 bombadil kernel: Stack:
Apr 27 18:30:01 bombadil kernel: ffff880292ed5838 ffffffff8136d4fa ffff880324256050 ffff88022f6f31a0
Apr 27 18:30:01 bombadil kernel: ffff88032ba31000 ffff88022658ec30 ffff880292ed5858 ffffffff81382ebf
Apr 27 18:30:01 bombadil kernel: 0000000000000002 ffff8802cb4aba80 ffff880292ed59d8 ffffffff813607e6
Apr 27 18:30:01 bombadil kernel: Call Trace:
Apr 27 18:30:01 bombadil kernel: [<ffffffff8136d4fa>] buf_lo_add+0x5a/0x160
Apr 27 18:30:01 bombadil kernel: [<ffffffff81382ebf>] gfs2_trans_add_bh+0x3f/0x70
Apr 27 18:30:01 bombadil kernel: [<ffffffff813607e6>] gfs2_block_map+0x4e6/0xde0
Apr 27 18:30:01 bombadil kernel: [<ffffffff810f8f37>] ? kmem_cache_alloc+0x97/0xb0
Apr 27 18:30:01 bombadil kernel: [<ffffffff8112dcc3>] __block_write_begin+0x1e3/0x550
Apr 27 18:30:01 bombadil kernel: [<ffffffff81360300>] ? gfs2_unstuff_dinode+0x490/0x490
Apr 27 18:30:01 bombadil kernel: [<ffffffff810c7307>] ? grab_cache_page_write_begin+0x87/0xf0
Apr 27 18:30:01 bombadil kernel: [<ffffffff8136ff06>] gfs2_write_begin+0x346/0x460
Apr 27 18:30:01 bombadil kernel: [<ffffffff810c7480>] generic_file_buffered_write+0x110/0x280
Apr 27 18:30:01 bombadil kernel: [<ffffffff8169db10>] ? _raw_spin_unlock_bh+0x10/0x20
Apr 27 18:30:01 bombadil kernel: [<ffffffff810c7821>] __generic_file_aio_write+0x231/0x440
Apr 27 18:30:01 bombadil kernel: [<ffffffff815d8136>] ? inet_recvmsg+0x66/0x80
Apr 27 18:30:01 bombadil kernel: [<ffffffff810c7aa1>] generic_file_aio_write+0x71/0xe0
Apr 27 18:30:01 bombadil kernel: [<ffffffff81372109>] gfs2_file_aio_write+0x79/0xb0
Apr 27 18:30:01 bombadil kernel: [<ffffffff8107d6fa>] ? wake_up_bit+0x2a/0x40
Apr 27 18:30:01 bombadil kernel: [<ffffffff81367daa>] ? gfs2_holder_wake+0x1a/0x20
Apr 27 18:30:01 bombadil kernel: [<ffffffff813680a6>] ? do_promote+0x216/0x230
Apr 27 18:30:01 bombadil kernel: [<ffffffff811025ca>] do_sync_write+0xda/0x120
Apr 27 18:30:01 bombadil kernel: [<ffffffff81368c7e>] ? gfs2_holder_uninit+0x1e/0x40
Apr 27 18:30:01 bombadil kernel: [<ffffffff81369f79>] ? gfs2_glock_dq_uninit+0x19/0x20
Apr 27 18:30:01 bombadil kernel: [<ffffffff813721e2>] ? gfs2_llseek+0xa2/0xd0
Apr 27 18:30:01 bombadil kernel: [<ffffffff811026d6>] vfs_write+0xc6/0x180
Apr 27 18:30:01 bombadil kernel: [<ffffffff8110286c>] sys_write+0x4c/0x90
Apr 27 18:30:01 bombadil kernel: [<ffffffff8169e522>] system_call_fastpath+0x16/0x1b
Apr 27 18:30:01 bombadil kernel: Code: 19 c0 85 c0 75 09 c9 c3 0f 1f 80 00 00 00 00 b9 02 00 00 00 48 c7 c2 d0 ae 12 81 be 02 00 00 00 e8 22 03 57 00 c9 c3 55 48 89 e5 <f0> 80 27 fb be 02 00 00 00 e8 5e 26 f5 ff c9 c3 66 66 66 2e 0f 
Apr 27 18:30:01 bombadil kernel: RIP  [<ffffffff8112b064>] unlock_buffer+0x4/0x20
Apr 27 18:30:01 bombadil kernel: RSP <ffff880292ed5808>
Apr 27 18:30:01 bombadil kernel: CR2: 0000000000000000
Apr 27 18:30:01 bombadil kernel: ---[ end trace be4d0d224a472da2 ]---

Comment 6 Steve Whitehouse 2012-06-08 14:55:09 UTC

This is a rather odd bug. The call to unlock_buffer() in buf_lo_add() is made after a number of other references to the buffer in question. So I wonder why does it always seem to cause problems at this particular point? There is no obvious reason that I can see.

Nothing should be able to free the buffer here since we have already pinned the buffer in memory only a few lines before.

So I have to admit that I'm no further forward on this one currently. This is still the only report of this particular bug I've received.

Just wondering if you can still reproduce on 3.4 now?

Comment 7 Dave Jones 2012-10-23 15:32:31 UTC

# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 8 Steve Whitehouse 2012-11-08 13:38:23 UTC

Fix posted for 3.7:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=96e5d1d3adf56f1c7eeb07258f6a1a0a7ae9c489

Comment 9 Steve Whitehouse 2012-11-08 14:08:58 UTC

Fix also posted for 3.6 stable tree

Dave, we are working on this and have posted the fix to all the right places, so this is all in hand now and should be done shortly.

Comment 10 Steve Whitehouse 2012-11-14 10:08:21 UTC

Now accepted for the 3.6 -stable tree (it isn't at kernel.org yet, but I assume it will appear shortly)

Comment 11 Steve Whitehouse 2012-11-19 09:42:17 UTC

Now in stable kernel 3.6.7

Note You need to log in before you can comment on or make changes to this bug.