Hide Forgot
Description of problem: Host system with VMs stored on GFS2 file systems (iSCSI backed block devices) panics with "BUG: Unable to handle kernel NULL pointer dereference..." during elevated level(s) of IO in a VM (i.e. during installation of a new VM, fsck'ing a VM file system, etc). Version-Release number of selected component (if applicable): rpm -qa | grep qemu: qemu-common-1.0-7.fc16.x86_64 qemu-kvm-1.0-7.fc16.x86_64 qemu-system-x86-1.0-7.fc16.x86_64 uname -r: 3.3.0-4.fc16.x86_64 rpm -qa | grep cluster: cluster-glue-libs-1.0.6-8.fc16.x86_64 lvm2-cluster-2.02.86-6.fc16.x86_64 cluster-snmp-0.18.7-3.fc16.x86_64 modcluster-0.18.7-3.fc16.x86_64 cluster-cim-0.18.7-3.fc16.x86_64 gfs2-cluster-3.1.3-1.fc16.x86_64 clusterssh-3.28-3.fc15.noarch cluster-glue-1.0.6-8.fc16.x86_64 clusterlib-3.1.7-1.fc16.x86_64 cman-3.1.7-1.fc16.x86_64 How reproducible: Fairly consistently. Steps to Reproduce: 1.Boot host system w/GFS2 file system and start VM(s) hosted on GFS2 2. Run increased ("intense") I/O read/write operations in the VM/guest 3. Eventually, the system will panic. Actual results: System panics & gets fenced/restarts (after ensuring kernel.panic_on_oops sysctl is set to 1) Expected results: IO operations continue without causing panic. Additional info: Stack trace: Mar 28 09:34:24 virt1-backup corosync[2251]: [IPC ] Invalid IPC credentials. Mar 28 09:34:36 virt1-backup kernel: [ 3667.192091] BUG: unable to handle kernel NULL pointer dereference at (null) Mar 28 09:34:36 virt1-backup kernel: [ 3667.192097] IP: [<ffffffff811b1019>] unlock_buffer+0x9/0x20 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192103] PGD 0 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192105] Oops: 0002 [#1] SMP Mar 28 09:34:36 virt1-backup kernel: [ 3667.192108] CPU 1 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192109] Modules linked in: gfs2 ppdev parport_pc lp parport bnep bluetooth rfkill fuse dlm ebtable_nat ebtables dm_round_robin lockd be2iscsi iscsi_boot_sysfs bnx2i c nic uio cxgb4i cxgb4 ip_vs_rr cxgb3i libcxgbi cxgb3 mdio ip_vs libcrc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bridge stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables nvidia(PO) snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep vhost_net joydev snd_seq snd_seq_device macvtap macvlan s nd_pcm tun e1000e iTCO_wdt snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_vendor_support kvm_intel kvm i2c_core serio_raw uinput microcode sunrpc firewire_ohci firewire_core usb_storage crc_itu_t sata_sil 24 dm_multipath [last unloaded: scsi_wait_scan] Mar 28 09:34:36 virt1-backup kernel: [ 3667.192154] Mar 28 09:34:36 virt1-backup kernel: [ 3667.192156] Pid: 22897, comm: qemu-kvm Tainted: P O 3.3.0-4.fc16.x86_64 #1 HP-Pavilion KT608AV-ABA d5000t/Burbank Mar 28 09:34:36 virt1-backup kernel: [ 3667.192160] RIP: 0010:[<ffffffff811b1019>] [<ffffffff811b1019>] unlock_buffer+0x9/0x20 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192163] RSP: 0018:ffff8800a9375838 EFLAGS: 00010286 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192165] RAX: 00000000c055c055 RBX: ffff8801ce7dc8c0 RCX: 000000000000006d Mar 28 09:34:36 virt1-backup kernel: [ 3667.192167] RDX: 000000000000c055 RSI: ffff8800a93757e0 RDI: 0000000000000000 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192169] RBP: ffff8800a9375838 R08: ffff8800a9374000 R09: 00000000000005de Mar 28 09:34:36 virt1-backup kernel: [ 3667.192171] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801dc059000 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192173] R13: ffff880067c604e0 R14: ffff880067ca8ea0 R15: 0000000000000002 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192175] FS: 00007ff6fbfff700(0000) GS:ffff88022fc80000q(0000) knlGS:0000000000000000 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192177] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 28 09:34:36 virt1-backup kernel: [ 3667.192179] CR2: 0000000000000000 CR3: 00000001bfbfd000 CR4: 00000000000026e0 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192181] DR0: 0000000000000003 DR1: 00000000000000b0 DR2: 0000000000000001 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192183] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192185] Process qemu-kvm (pid: 22897, threadinfo ffff8800a9374000, task ffff880224165cc0) Mar 28 09:34:36 virt1-backup kernel: [ 3667.192187] Stack: Mar 28 09:34:36 virt1-backup kernel: [ 3667.192188] ffff8800a9375868 ffffffffa1096613 ffff8801c23a4ea0 ffff880067c604e0 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192192] ffff8801dc059000 0000000000000003 ffff8800a9375888 ffffffffa10abc03 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192195] ffff8801c223d6f8 0000000000000003 ffff8800a9375a08 ffffffffa1084da5 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192198] Call Trace: Mar 28 09:34:36 virt1-backup kernel: [ 3667.192208] [<ffffffffa1096613>] buf_lo_add+0x63/0x160 [gfs2] Mar 28 09:34:36 virt1-backup kernel: [ 3667.192217] [<ffffffffa10abc03>] gfs2_trans_add_bh+0x43/0x70 [gfs2] Mar 28 09:34:36 virt1-backup kernel: [ 3667.192221] [<ffffffffa1084da5>] gfs2_block_map+0x5f5/0xf50 [gfs2] Mar 28 09:34:36 virt1-backup kernel: [ 3667.192226] [<ffffffff8116e29b>] ? kmem_cache_alloc+0x11b/0x150 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192229] [<ffffffff811b286e>] __block_write_begin+0x1ce/0x520 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192234] [<ffffffffa10847b0>] ? gfs2_unstuff_dinode+0x490/0x490 [gfs2] Mar 28 09:34:36 virt1-backup kernel: [ 3667.192237] [<ffffffff81120ebe>] ? grab_cache_page_write_begin+0x7e/0xe0 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192244] [<ffffffffa1098216>] gfs2_write_begin+0x356/0x470 [gfs2] Mar 28 09:34:36 virt1-backup kernel: [ 3667.192247] [<ffffffff81120662>] generic_file_buffered_write+0x112/0x290 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192250] [<ffffffff81121ed9>] __generic_file_aio_write+0x229/0x440 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192253] [<ffffffff810aeb55>] ? futex_wait+0x1c5/0x2a0 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192256] [<ffffffff81122162>] generic_file_aio_write+0x72/0xe0 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192263] [<ffffffffa109a3e0>] gfs2_file_aio_write+0x80/0xb0 [gfs2] Mar 28 09:34:36 virt1-backup kernel: [ 3667.192265] [<ffffffff810aed81>] ? futex_wake+0x101/0x120 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192268] [<ffffffff81180e12>] do_sync_write+0xd2/0x110 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192272] [<ffffffff8126f828>] ? selinux_file_permission+0xe8/0x130 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192275] [<ffffffff812698bc>] ? security_file_permission+0x2c/0xb0 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192278] [<ffffffff81181391>] ? rw_verify_area+0x61/0xf0 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192280] [<ffffffff811816f3>] vfs_write+0xb3/0x180 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192282] [<ffffffff81181bb2>] sys_pwrite64+0xa2/0xb0 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192286] [<ffffffff815fc029>] system_call_fastpath+0x16/0x1b Mar 28 09:34:36 virt1-backup kernel: [ 3667.192287] Code: ff 48 8b 5d f0 4c 8b 65 f8 c9 c3 0f 1f 80 00 00 00 00 48 89 df e8 88 f2 ff ff eb d9 66 0f 1f 44 00 00 55 48 89 e5 66 66 66 66 90 <f0> 80 27 fb be 02 00 00 00 e8 89 93 ec ff 5d c3 0f 1f 80 00 00 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192313] RIP [<ffffffff811b1019>] unlock_buffer+0x9/0x20 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192316] RSP <ffff8800a9375838> Mar 28 09:34:36 virt1-backup kernel: [ 3667.192317] CR2: 0000000000000000 Mar 28 09:34:36 virt1-backup kernel: [ 3667.192320] ---[ end trace 99743908f7036675 ]---
Something rather odd going on here... why does dereferencing bd->bd_bh explode when unlocking the buffer when it was ok, only a few lines above that? I've not seen any other similar reports recently at all.
Ah, one thought... does this still happen if you turn off selinux? Just to note that selinux won't work correctly with GFS2 currently, so it is probably best to turn it off as it will be faster without that turned on.
SELinux is running in permissive mode, so I believe(d?) that it should not be affecting GFS2 in any meaningful way (other than possibly a slight perf degradation)?
Yes, that should be the case, however we don't generally test with selinux, since it seems kind of pointless when it doesn't work correctly and its only function is therefore to degrade performance. I'm really just wondering what variables we can eliminate since we've not seen anything like this recently, and I'm just trying to think of any reason why we might not have seen it before.
I'm getting this too on a vanilla 3.3.3 on bare-bones Gentoo. No SELinux on this box. Below is a log of GFS crashing on database dump restore: Apr 27 18:30:01 bombadil kernel: BUG: unable to handle kernel NULL pointer dereference at (null) Apr 27 18:30:01 bombadil kernel: IP: [<ffffffff8112b064>] unlock_buffer+0x4/0x20 Apr 27 18:30:01 bombadil kernel: PGD 292e73067 PUD 2c3a95067 PMD 0 Apr 27 18:30:01 bombadil kernel: Oops: 0002 [#1] SMP Apr 27 18:30:01 bombadil kernel: CPU 10 Apr 27 18:30:01 bombadil kernel: Pid: 19064, comm: postgres Not tainted 3.3.3 #4 Supermicro X8DTU/X8DTU Apr 27 18:30:01 bombadil kernel: RIP: 0010:[<ffffffff8112b064>] [<ffffffff8112b064>] unlock_buffer+0x4/0x20 Apr 27 18:30:01 bombadil kernel: RSP: 0018:ffff880292ed5808 EFLAGS: 00010286 Apr 27 18:30:01 bombadil kernel: RAX: ffff880230c84970 RBX: ffff880230c84980 RCX: 0000000000000046 Apr 27 18:30:01 bombadil kernel: RDX: 00000000000000dd RSI: ffff880230c84980 RDI: 0000000000000000 Apr 27 18:30:01 bombadil kernel: RBP: ffff880292ed5808 R08: fcd12a65fd8f8e02 R09: a80032521c000000 Apr 27 18:30:01 bombadil kernel: R10: 0000000000000004 R11: 0000000000000000 R12: ffff88032ba31000 Apr 27 18:30:01 bombadil kernel: R13: ffff880230c84960 R14: 0000000000000001 R15: 0000000000000001 Apr 27 18:30:01 bombadil kernel: FS: 00007fd8aad08700(0000) GS:ffff88033fd40000(0000) knlGS:0000000000000000 Apr 27 18:30:01 bombadil kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Apr 27 18:30:01 bombadil kernel: CR2: 0000000000000000 CR3: 00000002f3b8c000 CR4: 00000000000006e0 Apr 27 18:30:01 bombadil kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Apr 27 18:30:01 bombadil kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Apr 27 18:30:01 bombadil kernel: Process postgres (pid: 19064, threadinfo ffff880292ed4000, task ffff8802c7e4c350) Apr 27 18:30:01 bombadil kernel: Stack: Apr 27 18:30:01 bombadil kernel: ffff880292ed5838 ffffffff8136d4fa ffff880324256050 ffff88022f6f31a0 Apr 27 18:30:01 bombadil kernel: ffff88032ba31000 ffff88022658ec30 ffff880292ed5858 ffffffff81382ebf Apr 27 18:30:01 bombadil kernel: 0000000000000002 ffff8802cb4aba80 ffff880292ed59d8 ffffffff813607e6 Apr 27 18:30:01 bombadil kernel: Call Trace: Apr 27 18:30:01 bombadil kernel: [<ffffffff8136d4fa>] buf_lo_add+0x5a/0x160 Apr 27 18:30:01 bombadil kernel: [<ffffffff81382ebf>] gfs2_trans_add_bh+0x3f/0x70 Apr 27 18:30:01 bombadil kernel: [<ffffffff813607e6>] gfs2_block_map+0x4e6/0xde0 Apr 27 18:30:01 bombadil kernel: [<ffffffff810f8f37>] ? kmem_cache_alloc+0x97/0xb0 Apr 27 18:30:01 bombadil kernel: [<ffffffff8112dcc3>] __block_write_begin+0x1e3/0x550 Apr 27 18:30:01 bombadil kernel: [<ffffffff81360300>] ? gfs2_unstuff_dinode+0x490/0x490 Apr 27 18:30:01 bombadil kernel: [<ffffffff810c7307>] ? grab_cache_page_write_begin+0x87/0xf0 Apr 27 18:30:01 bombadil kernel: [<ffffffff8136ff06>] gfs2_write_begin+0x346/0x460 Apr 27 18:30:01 bombadil kernel: [<ffffffff810c7480>] generic_file_buffered_write+0x110/0x280 Apr 27 18:30:01 bombadil kernel: [<ffffffff8169db10>] ? _raw_spin_unlock_bh+0x10/0x20 Apr 27 18:30:01 bombadil kernel: [<ffffffff810c7821>] __generic_file_aio_write+0x231/0x440 Apr 27 18:30:01 bombadil kernel: [<ffffffff815d8136>] ? inet_recvmsg+0x66/0x80 Apr 27 18:30:01 bombadil kernel: [<ffffffff810c7aa1>] generic_file_aio_write+0x71/0xe0 Apr 27 18:30:01 bombadil kernel: [<ffffffff81372109>] gfs2_file_aio_write+0x79/0xb0 Apr 27 18:30:01 bombadil kernel: [<ffffffff8107d6fa>] ? wake_up_bit+0x2a/0x40 Apr 27 18:30:01 bombadil kernel: [<ffffffff81367daa>] ? gfs2_holder_wake+0x1a/0x20 Apr 27 18:30:01 bombadil kernel: [<ffffffff813680a6>] ? do_promote+0x216/0x230 Apr 27 18:30:01 bombadil kernel: [<ffffffff811025ca>] do_sync_write+0xda/0x120 Apr 27 18:30:01 bombadil kernel: [<ffffffff81368c7e>] ? gfs2_holder_uninit+0x1e/0x40 Apr 27 18:30:01 bombadil kernel: [<ffffffff81369f79>] ? gfs2_glock_dq_uninit+0x19/0x20 Apr 27 18:30:01 bombadil kernel: [<ffffffff813721e2>] ? gfs2_llseek+0xa2/0xd0 Apr 27 18:30:01 bombadil kernel: [<ffffffff811026d6>] vfs_write+0xc6/0x180 Apr 27 18:30:01 bombadil kernel: [<ffffffff8110286c>] sys_write+0x4c/0x90 Apr 27 18:30:01 bombadil kernel: [<ffffffff8169e522>] system_call_fastpath+0x16/0x1b Apr 27 18:30:01 bombadil kernel: Code: 19 c0 85 c0 75 09 c9 c3 0f 1f 80 00 00 00 00 b9 02 00 00 00 48 c7 c2 d0 ae 12 81 be 02 00 00 00 e8 22 03 57 00 c9 c3 55 48 89 e5 <f0> 80 27 fb be 02 00 00 00 e8 5e 26 f5 ff c9 c3 66 66 66 2e 0f Apr 27 18:30:01 bombadil kernel: RIP [<ffffffff8112b064>] unlock_buffer+0x4/0x20 Apr 27 18:30:01 bombadil kernel: RSP <ffff880292ed5808> Apr 27 18:30:01 bombadil kernel: CR2: 0000000000000000 Apr 27 18:30:01 bombadil kernel: ---[ end trace be4d0d224a472da2 ]---
This is a rather odd bug. The call to unlock_buffer() in buf_lo_add() is made after a number of other references to the buffer in question. So I wonder why does it always seem to cause problems at this particular point? There is no obvious reason that I can see. Nothing should be able to free the buffer here since we have already pinned the buffer in memory only a few lines before. So I have to admit that I'm no further forward on this one currently. This is still the only report of this particular bug I've received. Just wondering if you can still reproduce on 3.4 now?
# Mass update to all open bugs. Kernel 3.6.2-1.fc16 has just been pushed to updates. This update is a significant rebase from the previous version. Please retest with this kernel, and let us know if your problem has been fixed. In the event that you have upgraded to a newer release and the bug you reported is still present, please change the version field to the newest release you have encountered the issue with. Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered. If you are not the original bug reporter and you still experience this bug, please file a new report, as it is possible that you may be seeing a different problem. (Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).
Fix posted for 3.7: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=96e5d1d3adf56f1c7eeb07258f6a1a0a7ae9c489
Fix also posted for 3.6 stable tree Dave, we are working on this and have posted the fix to all the right places, so this is all in hand now and should be done shortly.
Now accepted for the 3.6 -stable tree (it isn't at kernel.org yet, but I assume it will appear shortly)
Now in stable kernel 3.6.7