Created attachment 331578 [details] Sysrq-t output and glock dumps from dash cluster Description of problem: While trying to verify a fix for bug 460218, I started hitting this panic right away. Unable to handle kernel NULL pointer dereference at 0000000000000108 RIP: [<ffffffff885b5687>] :gfs2:gfs2_getbuf+0x2f/0x115 PGD 2baea067 PUD 2a862067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /fs/gfs2/dash:dash0/lock_module/recover_done CPU 0 Modules linked in: sctp hfsplus dm_round_robin lock_dlm gfs2 dlm configfs autofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg i2c_i801 i2c_core tg3 serio_raw qla2xxx i3000_edac ide_cd pcspkr libphy edac_mc scsi_transport_fc shpchp cdrom dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 16775, comm: d_doio Tainted: G 2.6.18-129.gfs2abhi.001 #1 RIP: 0010:[<ffffffff885b5687>] [<ffffffff885b5687>] :gfs2:gfs2_getbuf+0x2f/0x115 RSP: 0018:ffff81001df99ab8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff81001e24d810 RCX: ffff81001df99b28 RDX: 0000000000000001 RSI: 0000000026c94d2e RDI: ffff81001e24d810 RBP: ffff81001df99b28 R08: ffff81001df99b68 R09: 0000000026c94d2e R10: ffff81003fd68020 R11: ffffffff8002c196 R12: 0000000026c94d2e R13: ffff81002c37e000 R14: 0000000026c94d2e R15: 0000000000000000 FS: 00002b347c6d3020(0000) GS:ffffffff803ac000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000108 CR3: 000000002bae4000 CR4: 00000000000006e0 Process d_doio (pid: 16775, threadinfo ffff81001df98000, task ffff81003d690080) Stack: 0000000300000000 ffff81001e24d810 ffff81001df99b28 0000000000000010 ffff81001df99b68 ffff8100255a5c78 ffffffff885a8b9b ffffffff885b5784 00000000ffffffff ffff810036b3f688 ffff81002c37e000 ffffffff885b5b82 Call Trace: [<ffffffff885a8b9b>] :gfs2:gfs2_dirent_find+0x0/0x4e [<ffffffff885b5784>] :gfs2:gfs2_meta_read+0x17/0x65 [<ffffffff885b5b82>] :gfs2:gfs2_meta_indirect_buffer+0xba/0x160 [<ffffffff885a88f9>] :gfs2:gfs2_dirent_search+0xf9/0x167 [<ffffffff885a9e17>] :gfs2:gfs2_dir_search+0x21/0x73 [<ffffffff8000d62d>] permission+0x81/0xc8 [<ffffffff885b0797>] :gfs2:gfs2_lookupi+0x12e/0x16b [<ffffffff885b0755>] :gfs2:gfs2_lookupi+0xec/0x16b [<ffffffff885bc57f>] :gfs2:gfs2_lookup+0x26/0xa7 [<ffffffff80022223>] d_alloc+0x174/0x1a9 [<ffffffff8000cc36>] do_lookup+0xe5/0x1e6 [<ffffffff80009fc7>] __link_path_walk+0xa01/0xf42 [<ffffffff8000e80a>] link_path_walk+0x5c/0xe5 [<ffffffff800b4635>] audit_syscall_entry+0x16e/0x1a1 [<ffffffff8000c9d5>] do_path_lookup+0x270/0x2e8 [<ffffffff800123f7>] getname+0x15b/0x1c1 [<ffffffff800232a0>] __user_walk_fd+0x37/0x4c [<ffffffff80032057>] sys_faccessat+0xe4/0x18d [<ffffffff800b4635>] audit_syscall_entry+0x16e/0x1a1 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 48 8b 98 08 01 00 00 44 88 f9 49 d3 ec 85 d2 74 23 8b 53 60 RIP [<ffffffff885b5687>] :gfs2:gfs2_getbuf+0x2f/0x115 RSP <ffff81001df99ab8> CR2: 0000000000000108 <0>Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): kernel-2.6.18-129.gfs2abhi.001 How reproducible: Easily Steps to Reproduce: 1. Run d_metaverify with memory pressure Actual results: See above Expected results: No panic Additional info:
The offset of 108 suggests that the problem relates to a buffer_head rather than a page since the struct page is fairly small and almost certainly less than 108 bytes. The function create_empty_buffers() ought not to return until it has actually managed to create a set of buffers. It should be impossible for the created buffers to disappear since we hold the page lock during the subsequent operations. The other thing which makes me think that this refers to a buffer_head is that the offset within the function is fairly high, suggesting that the problem is towards the end of that function. So this is a bit of a mystery at the moment.
Even more of a mystery - buffer_heads are so small that an offset of 108 would be beyond the end of it. There are no other data structures used in that function that are even remotely close to that size. Also if the buffers are causing the issue, then that should have been caught in create_empty_buffers() rather than in gfs2_getbuf. So I'm still not quite sure what is going on here. I have a hunch it might be related to the patch for bz #273001 but I can't say for certain yet.
I was able to recreate this with 2.6.18-140.gfs2abhi.004. I have a test case which creates a ton of files on each node, continuously lists all files on the file system, and applies memory pressure every five minutes. GFS2: fsid=dash:dash0.1: warning: assertion "gl" failed GFS2: fsid=dash:dash0.1: function = gfs2_glock_cb, file = fs/gfs2/glock.c, line = 1370 Call Trace: [<ffffffff88635256>] :gfs2:gfs2_assert_warn_i+0xa1/0xc3 [<ffffffff8009de6e>] keventd_create_kthread+0x0/0xc4 [<ffffffff8861ea3e>] :gfs2:gfs2_glock_cb+0x83/0x122 [<ffffffff8869785c>] :lock_dlm:gdlm_ast+0x306/0x311 [<ffffffff886972c1>] :lock_dlm:gdlm_bast+0x0/0x8d [<ffffffff885ed20e>] :dlm:dlm_astd+0xd7/0x14f [<ffffffff885ed137>] :dlm:dlm_astd+0x0/0x14f [<ffffffff80032bf2>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff8009de6e>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032af4>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Unable to handle kernel NULL pointer dereference at 0000000000000108 RIP: [<ffffffff8862569f>] :gfs2:gfs2_getbuf+0x2f/0x115 PGD 12c7d0067 PUD 12b92f067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /fs/gfs2/dash:dash0/lock_module/recover_done CPU 0 Modules linked in: deflate zlib_deflate ccm serpent blowfish twofish ecb xcbc crypto_hash cbc md5 sha256 sha512 des aes_generic testmgr_cipher testmgr crypto_blkcipher aes_x86_64 ipcomp6 ipcomp ah6 ah4 esp6 xfrm6 _esp esp4 xfrm4_esp aead crypto_algapi xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_tunnel xfrm6_tunnel tunnel6 af_key autofs4 hidp rfcomm l2cap bluetooth dm_log_clu stered(U) lock_dlm gfs2 dlm configfs sunrpc ipv6 xfrm_nalgo crypto_api ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_multipath scsi_dh video hwmon backlight s bs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg i3000_edac shpchp tg3 edac_mc libphy i2c_i801 i2c_core pcspkr serio_raw qla2xxx ide_cd scsi_transport_fc cdrom dm_raid45 dm_message d m_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 9981, comm: ls Tainted: G 2.6.18-140.gfs2abhi.004 #1 RIP: 0010:[<ffffffff8862569f>] [<ffffffff8862569f>] :gfs2:gfs2_getbuf+0x2f/0x115 RSP: 0018:ffff810132985ae8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff81012d81baa0 RCX: ffff810132985b58 RDX: 0000000000000001 RSI: 0000000013648d5c RDI: ffff81012d81baa0 RBP: ffff810132985b58 R08: ffff810132985bb8 R09: 0000000013648d5c R10: ffff81014b1f3018 R11: ffffffff8002c7df R12: 0000000013648d5c R13: ffff8101372f2000 R14: 0000000013648d5c R15: 0000000000000000 FS: 00002af796d92c10(0000) GS:ffffffff803b0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000108 CR3: 000000012c7e2000 CR4: 00000000000006e0 Process ls (pid: 9981, threadinfo ffff810132984000, task ffff81012c4c70c0) Stack: 0000000000000080 ffff81012d81baa0 ffff810132985b58 0000000000000010 ffff810132985bb8 0000000000000008 ffff8101372f2000 ffffffff8862579c 00000000ffffffff ffff810132985c68 ffff8101372f2000 ffffffff88625b9a Call Trace: [<ffffffff8862579c>] :gfs2:gfs2_meta_read+0x17/0x65 [<ffffffff88625b9a>] :gfs2:gfs2_meta_indirect_buffer+0xba/0x160 [<ffffffff8861853e>] :gfs2:gfs2_dir_read_data+0x7f/0x2c8 [<ffffffff88625776>] :gfs2:gfs2_getbuf+0x106/0x115 [<ffffffff88618b9b>] :gfs2:gfs2_dirent_find+0x0/0x4e [<ffffffff886187ae>] :gfs2:get_leaf_nr+0x27/0x48 [<ffffffff88618b9b>] :gfs2:gfs2_dirent_find+0x0/0x4e [<ffffffff886187e5>] :gfs2:get_first_leaf+0x16/0x31 [<ffffffff8002d440>] wake_up_bit+0x11/0x22 [<ffffffff88618883>] :gfs2:gfs2_dirent_search+0x83/0x167 [<ffffffff88619e8d>] :gfs2:gfs2_dir_check+0x24/0xb9 [<ffffffff88627c20>] :gfs2:gfs2_drevalidate+0x128/0x214 [<ffffffff88627bf5>] :gfs2:gfs2_drevalidate+0xfd/0x214 [<ffffffff80009cc4>] __d_lookup+0xb0/0xff [<ffffffff8000d3de>] do_lookup+0x18f/0x1e6 [<ffffffff8000a714>] __link_path_walk+0xa01/0xf42 [<ffffffff8000edf3>] link_path_walk+0x42/0xb2 [<ffffffff8000d0d4>] do_path_lookup+0x270/0x2e7 [<ffffffff80012aa6>] getname+0x15b/0x1c2 [<ffffffff8002392f>] __user_walk_fd+0x37/0x4c [<ffffffff800ed9f6>] sys_getxattr+0x2b/0x62 [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 48 8b 98 08 01 00 00 44 88 f9 49 d3 ec 85 d2 74 23 8b 53 60 RIP [<ffffffff8862569f>] :gfs2:gfs2_getbuf+0x2f/0x115
That makes a lot more sense than the last trace... I can start to see what is going on now. That assertion (only a warning in fact) is telling us that a reply has been received from the dlm via lock_dlm for which the glock appears to have gone away. That should be impossible since we hold ref counts for glocks when there are outstanding requests. I suspect that gfs2_getbuf has failed because the ref count is no longer correct on the glock and it has already gone when gfs2_getbuf is called. So it looks like if we find the ref count issue, then we have solved this one.
Looking at the differences between upstream and RHEL gfs2_glock_put(), I can see this: upstream: /* 1 for being hashed, 1 for having state != LM_ST_UNLOCKED */ if (atomic_read(&gl->gl_ref) == 2) gfs2_glock_schedule_for_reclaim(gl); write_unlock(gl_lock_addr(gl->gl_hash)); out: return rv; rhel: write_unlock(gl_lock_addr(gl->gl_hash)); /* 1 for being hashed, 1 for having state != LM_ST_UNLOCKED */ if (atomic_read(&gl->gl_ref) == 2) gfs2_glock_schedule_for_reclaim(gl); out: return rv; So it might have been an error while converting the scand&glockd patch to the RHEL kernel perhaps. The rhel kernel might land up accessing the ref count after the glock has been freed if there is a race here. Its possible that it isn't the issue that we are chasing, but it needs fixing anyway.
Created attachment 341724 [details] Proposed fix Lets see if this makes any difference...
That looks to have moved the panic to somewhere else. Running kernel-2.6.18-142.gfs2abhi.001 G: s:SH n:5/18951 f:lDp t:UN d:UN/166000 l:0 a:0 r:0 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at fs/gfs2/glock.c:247 invalid opcode: 0000 [1] SMP last sysfs file: /kernel/dlm/dash0/control CPU 0 Modules linked in: nls_utf8 hfsplus deflate zlib_deflate ccm serpent blowfish twofish ecb xcbc crypto_hash cbc md5 sha256 sha512 des aes_generic testmgr_cipher testmgr crypto_blkcipher aes_x86_64 ipcomp6 ipcomp a h6 ah4 esp6 xfrm6_esp esp4 xfrm4_esp aead crypto_algapi xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_tunnel xfrm6_tunnel tunnel6 af_key autofs4 hidp rfcomm l2cap blu etooth dm_log_clustered(U) lock_dlm gfs2 dlm configfs sunrpc ipv6 xfrm_nalgo crypto_api ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg i3000_edac qla2xxx i2c_i801 pcspkr scsi_transport_fc i2c_core tg3 edac_mc ide_cd cdrom shpchp serio_raw dm_raid45 dm _message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 8466, comm: glock_workqueue Tainted: G 2.6.18-142.gfs2abhi.001 #1 RIP: 0010:[<ffffffff885ed550>] [<ffffffff885ed550>] :gfs2:gfs2_glock_put+0xba/0x176 RSP: 0000:ffff810074561dd0 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff8100082381a8 RCX: 0000000000000282 RDX: 00000000ffffffff RSI: 0000000000000046 RDI: ffffffff802fdadc RBP: ffff810008238240 R08: 000000000000000d R09: 000000000000003f R10: ffffffff803e0520 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: ffffffff886048e0 R15: ffffffff885ee793 FS: 0000000000000000(0000) GS:ffffffff803b3000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002ab8d935e000 CR3: 0000000066236000 CR4: 00000000000006e0 Process glock_workqueue (pid: 8466, threadinfo ffff810074560000, task ffff81007da077e0) Stack: 0000000000000000 ffff8100082381a8 ffff810008238298 ffffffff885edd27 ffff8100082381a8 ffff810008238298 ffff81007631c9c0 0000000000000246 ffff8100082381a8 ffffffff885ee7b0 ffff810068b691a8 ffff810008238290 Call Trace: [<ffffffff885edd27>] :gfs2:finish_xmote+0x48/0x2b7 [<ffffffff885ee7b0>] :gfs2:glock_work_func+0x1d/0xb8 [<ffffffff8004db3b>] run_workqueue+0x94/0xe4 [<ffffffff8004a3a6>] worker_thread+0x0/0x122 [<ffffffff8009ee57>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a496>] worker_thread+0xf0/0x122 [<ffffffff8008b81c>] default_wake_function+0x0/0xe [<ffffffff8009ee57>] keventd_create_kthread+0x0/0xc4 [<ffffffff8009ee57>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032d4e>] kthread+0xfe/0x132 [<ffffffff8005efb1>] child_rip+0xa/0x11 [<ffffffff8009ee57>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032c50>] kthread+0x0/0x132 [<ffffffff8005efa7>] child_rip+0x0/0x11 Code: 0f 0b 68 d8 62 60 88 c2 f7 00 48 39 ab 98 00 00 00 74 14 48 RIP [<ffffffff885ed550>] :gfs2:gfs2_glock_put+0xba/0x176 RSP <ffff810074561dd0> <0>Kernel panic - not syncing: Fatal exception
Well there is no doubt that this is a bug relating to glock ref counts as this very clearly shows that. There are three calls to gfs2_glock_put() from finish_xmote if you include the indirect one via state_change(). One of them is easy to rule out - its after we clear the lock bit on the glock. One of them is more tricky to rule out, but I think we can do so due to a lack of holders on this glock. That leaves only the gfs2_glock_put() that is in state_change(). At that point there should always be a count of at least 1 since we hold a ref count from about half way through do_xmote() until the end of finish_xmote(). That was specifically to deal with this case of demoting a lock, since otherwise we have a ref count from the holder when we are doing promotion of the glock. There is a strong possibility that the error is elsewhere though, and it just happens to show up at this point because we are one short on the ref counting.
Created attachment 341936 [details] Updated patch This is now two lines, rather than one. I think I've found the real bug this time. This would explain why its tricky to hit in "normal" operation, why its more likely to happen with memory pressure, and why we see it in the places its shown up far. The bug is also in upstream, so I'll post a patch for it to cluster-devel shortly. Fingers crossed...
I hit this while running on abhi's 2.6.18-142.gfs2abhi.002 which includes the patch from comment #9. G: s:EX n:2/29b1fbc f:lDpr t:UN d:UN/18003000 l:0 a:00 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at fs/gfs2/glock.c:247 invalid opcode: 0000 [1] SMP last sysfs file: /fs/gfs2/dash:dash0/lock_module/recover_done CPU 0 Modules linked in: deflate zlib_deflate ccm serpent blowfish twofish ecb xcbc cd Pid: 173, comm: kswapd0 Tainted: G 2.6.18-142.gfs2abhi.002 #1 RIP: 0010:[<ffffffff885e4550>] [<ffffffff885e4550>] :gfs2:gfs2_glock_put+0xba/6 RSP: 0000:ffff81007fb93d40 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff810066769810 RCX: 0000000000000282 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff8100667698a8 R08: 000000000000000d R09: 0000000000000000 R10: ffffffff803e0520 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000077 R14: 0000000000009e04 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffffffff803b3000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002abc0c0eb000 CR3: 000000001ebe1000 CR4: 00000000000006e0 Process kswapd0 (pid: 173, threadinfo ffff81007fb92000, task ffff810037ecf100) Stack: ffff8100667698a8 ffff8100667698a8 ffff810066769810 ffffffff885e5a94 ffff810062721388 ffff810066015dc8 ffff810079853f00 0000000000013f24 0000000000000092 00000000000000d0 0000000000000000 ffffffff8003f7c1 Call Trace: [<ffffffff885e5a94>] :gfs2:gfs2_shrink_glock_memory+0x158/0x1fa [<ffffffff8003f7c1>] shrink_slab+0xdc/0x153 [<ffffffff80058089>] kswapd+0x35a/0x45a [<ffffffff8009f06f>] autoremove_wake_function+0x0/0x2e [<ffffffff8009ee57>] keventd_create_kthread+0x0/0xc4 [<ffffffff80057d2f>] kswapd+0x0/0x45a [<ffffffff8009ee57>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032d4e>] kthread+0xfe/0x132 [<ffffffff8009bcde>] request_module+0x0/0x14d [<ffffffff8005efb1>] child_rip+0xa/0x11 [<ffffffff8009ee57>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032c50>] kthread+0x0/0x132 [<ffffffff8005efa7>] child_rip+0x0/0x11 Code: 0f 0b 68 d8 d2 5f 88 c2 f7 00 48 39 ab 98 00 00 00 74 14 48 RIP [<ffffffff885e4550>] :gfs2:gfs2_glock_put+0xba/0x176 RSP <ffff81007fb93d40> <0>Kernel panic - not syncing: Fatal exception
Results from comment #10 were without the final patch for this bug, due to a mix up. The final patch has been tested and does work. This is a regression introduced by the recent scand & glockd patch and needs to be fixed before 5.4.
Please can those of you with flag setting powers, set the various required flags. Thanks, Steve.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-146.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
I ran the test case for about 18 hours and I was not able to reproduce this panic. Marking VERIFIED based on kernel-2.6.18-154.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html