Bug 589253 - Kernel BUG at list/list_debug.c:65 -- backtrace implicates dlm
Summary: Kernel BUG at list/list_debug.c:65 -- backtrace implicates dlm
Keywords:
Status: CLOSED DUPLICATE of bug 555754
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: x86_64
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Abhijith Das
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-05-05 16:35 UTC by Scooter Morris
Modified: 2012-12-12 10:07 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-03-03 16:20:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Scooter Morris 2010-05-05 16:35:34 UTC
Description of problem: Kernel oops on 3 node cluster with mixed gfs2 and ext3 file systems, including one gfs2 filesystem with quota=on.  Tape dumps were going at the time and leading up to the crash, saw some task blockage.


Version-Release number of selected component (if applicable): 2.6.18-194.el5 (5.5)


How reproducible: Random


Additional info:

Task block messages: 
2010-05-05 08:52:41]INFO: task tar:767 blocked for more than 120 seconds.^M
[2010-05-05 08:54:41]"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
[2010-05-05 08:54:42]tar           D ffff810001047820     0   767    766   768               (NOTLB)^M
[2010-05-05 08:54:42] ffff810290337d58 0000000000000086 0000000000000001 ffff81037c449b08^M
[2010-05-05 08:54:42] ffff81082d6b3968 0000000000000007 ffff8105ca75d0c0 ffff81011cb35080^M
[2010-05-05 08:54:42] 0001b7586c8d037e 0000000000024586 ffff8105ca75d2a8 000000088807aa5a^M
[2010-05-05 08:54:42]Call Trace:^M
[2010-05-05 08:54:42] [<ffffffff80064167>] wait_for_completion+0x79/0xa2^M
[2010-05-05 08:54:42] [<ffffffff8008e16d>] default_wake_function+0x0/0xe^M
[2010-05-05 08:54:42] [<ffffffff88399e85>] :st:st_do_scsi+0x1f4/0x221^M
[2010-05-05 08:54:42] [<ffffffff8839d3ed>] :st:st_write+0x5b9/0xaac^M
[2010-05-05 08:54:42] [<ffffffff8002e511>] __wake_up+0x38/0x4f^M
[2010-05-05 08:54:42] [<ffffffff80016a49>] vfs_write+0xce/0x174^M
[2010-05-05 08:54:42] [<ffffffff80017316>] sys_write+0x45/0x6e^M
[2010-05-05 08:54:42] [<ffffffff8005e28d>] tracesys+0xd5/0xe0^M

Kernel backtrace:

[2010-05-05 09:08:02]list_del corruption. prev->next should be ffff81068e28e040, but was 000000008e28e040^M
[2010-05-05 09:08:02]----------- [cut here ] --------- [please bite here ] ---------^M
[2010-05-05 09:08:02]Kernel BUG at lib/list_debug.c:65^M
[2010-05-05 09:08:02]invalid opcode: 0000 [1] SMP ^M
[2010-05-05 09:08:02]last sysfs file: /devices/pci0000:00/0000:00:06.0/0000:0b:00.0/0000:0c:09.0/0000:0d:00.0/host0/rport-0:0-5/target0:0:4/0:0:4:7/timeout^M
[2010-05-05 09:08:02]CPU 7 ^M
[2010-05-05 09:08:02]Modules linked in: ipt_MASQUERADE iptable_nat ip_nat bridge autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2 dlm configfs lockd sunrpc ip_conntrack_netbios_ns xt_state ip_conntrack nfnetlink xt_tcpudp ipt_REJECT iptable_filter ip_tables arpt_mangle arptable_filter arp_tables x_tables ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa ib_mad ib_core dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st ide_cd hpilo sg cdrom bnx2 pcspkr serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd^M
[2010-05-05 09:08:03]Pid: 7262, comm: dlm_astd Not tainted 2.6.18-194.el5 #1^M
[2010-05-05 09:08:03]RIP: 0010:[<ffffffff80154d76>]  [<ffffffff80154d76>] list_del+0x21/0x71^M
[2010-05-05 09:08:03]RSP: 0018:ffff81081c169dd0  EFLAGS: 00010082^M
[2010-05-05 09:08:03]RAX: 0000000000000058 RBX: ffff81068e28e040 RCX: ffffffff80312da8^M
[2010-05-05 09:08:03]RDX: ffffffff80312da8 RSI: 0000000000000000 RDI: ffffffff80312da0^M
[2010-05-05 09:08:03]RBP: ffff81082cd6de40 R08: ffffffff80312da8 R09: 0000000000000001^M
[2010-05-05 09:08:03]R10: 0000000000000000 R11: 0000000000000280 R12: ffff81081f93f200^M
[2010-05-05 09:08:03]R13: ffff81068e28edd8 R14: 000000000000003b R15: 0000000000000000^M
[2010-05-05 09:08:04]FS:  0000000000000000(0000) GS:ffff81082fead340(0000) knlGS:0000000000000000^M
[2010-05-05 09:08:04]CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M
[2010-05-05 09:08:04]CR2: 00002aaaac0010a8 CR3: 000000080ec73000 CR4: 00000000000006e0^M
[2010-05-05 09:08:04]Process dlm_astd (pid: 7262, threadinfo ffff81081c168000, task ffff81082df087e0)^M
[2010-05-05 09:08:04]Stack:  ffff81068e28e040 ffffffff800dcc8b ffff81011cb24000 0000003cfedb36a8^M
[2010-05-05 09:08:04] ffff81082d2f1818 000000000000003c ffff81082d2f1800 ffff81082cd6de40^M
[2010-05-05 09:08:04] 0000000000000000 ffff81081f93f200 ffffffff800a198c ffffffff800dce4d^M
[2010-05-05 09:08:04]Call Trace:^M
[2010-05-05 09:08:04] [<ffffffff800dcc8b>] free_block+0xb5/0x143^M
[2010-05-05 09:08:04] [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4^M
[2010-05-05 09:08:04] [<ffffffff800dce4d>] cache_flusharray+0x74/0xa3^M
[2010-05-05 09:08:04] [<ffffffff80007684>] kmem_cache_free+0x1c2/0x1dd^M
[2010-05-05 09:08:04] [<ffffffff887c56ce>] :dlm:__put_lkb+0xe8/0x106^M
[2010-05-05 09:08:04] [<ffffffff8886c2c1>] :lock_dlm:gdlm_bast+0x0/0x8d^M
[2010-05-05 09:08:04] [<ffffffff887c0240>] :dlm:dlm_astd+0x109/0x14f^M
[2010-05-05 09:08:04] [<ffffffff887c0137>] :dlm:dlm_astd+0x0/0x14f^M
[2010-05-05 09:08:04] [<ffffffff80032bdc>] kthread+0xfe/0x132^M
[2010-05-05 09:08:04] [<ffffffff8005efb1>] child_rip+0xa/0x11^M
[2010-05-05 09:08:05] [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4^M
[2010-05-05 09:08:05] [<ffffffff80032ade>] kthread+0x0/0x132^M
[2010-05-05 09:08:05] [<ffffffff8005efa7>] child_rip+0x0/0x11^M
[2010-05-05 09:08:05]^M
[2010-05-05 09:08:05]^M
[2010-05-05 09:08:05]Code: 0f 0b 68 a2 ca 2b 80 c2 41 00 48 8b 03 48 8b 50 08 48 39 da ^M
[2010-05-05 09:08:05]RIP  [<ffffffff80154d76>] list_del+0x21/0x71^M
[2010-05-05 09:08:05] RSP <ffff81081c169dd0>^M

Comment 1 Scooter Morris 2010-05-06 04:52:49 UTC
This is now happening repeatedly.  Not sure what's changed on our systems, but we've seen several crashes today.  Here is the most recent example:

[2010-05-05 20:25:47]Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: ^M
[2010-05-05 20:25:47] [<ffffffff80154d5d>] list_del+0x8/0x71^M
[2010-05-05 20:25:47]PGD 7fda51067 PUD 7ffcb0067 PMD 0 ^M
[2010-05-05 20:25:47]Oops: 0000 [1] SMP ^M
[2010-05-05 20:25:47]last sysfs file: /devices/pci0000:04/0000:04:09.0/0000:05:0d.0/host0/rport-0:0-3/target0:0:3/0:0:3:7/timeout^M
[2010-05-05 20:25:47]CPU 4 ^M
[2010-05-05 20:25:47]Modules linked in: ipt_MASQUERADE iptable_nat ip_nat bridge autofs4 vmnet(U) vmblock(U) vmci(U) vmmon(U) hidp l2cap bluetooth lock_dlm gfs2 dlm configfs lockd sunrpc ip_conntrack_netbios_ns xt_state ip_conntrack nfnetlink xt_tcpudp ipt_REJECT iptable_filter ip_tables arpt_mangle arptable_filter arp_tables x_tables cpufreq_ondemand powernow_k8 freq_table ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa ib_mad ib_core dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_amd756 ide_cd k8_edac floppy i2c_core hpilo k8temp tg3 edac_mc serio_raw pcspkr hwmon cdrom amd_rng sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd^M
[2010-05-05 20:25:48]Pid: 5639, comm: gfs2_quotad Tainted: G      2.6.18-194.el5 #1^M
[2010-05-05 20:25:48]RIP: 0010:[<ffffffff80154d5d>]  [<ffffffff80154d5d>] list_del+0x8/0x71^M
[2010-05-05 20:25:49]RSP: 0018:ffff8101ea3479a0  EFLAGS: 00010002^M
[2010-05-05 20:25:49]RAX: 0000000000000000 RBX: ffff8102ce42f000 RCX: 0000000000000002^M
[2010-05-05 20:25:49]RDX: 0000000000000000 RSI: ffff8102ce42f000 RDI: ffff8102ce42f000^M
[2010-05-05 20:25:49]RBP: ffff8103ffcc4a40 R08: 0000000000000001 R09: ffff8107ff3f27e0^M
[2010-05-05 20:25:49]R10: ffff810508de6630 R11: 0000000000000000 R12: 0000000000000000^M
[2010-05-05 20:25:49]R13: 0000000000000001 R14: 0000000000011220 R15: ffffffff800154ce^M
[2010-05-05 20:25:49]FS:  00002b425a0592d0(0000) GS:ffff8104071573c0(0000) knlGS:0000000000000000^M
[2010-05-05 20:25:49]CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M
[2010-05-05 20:25:49]CR2: 0000000000000000 CR3: 00000007ffd40000 CR4: 00000000000006e0^M
[2010-05-05 20:25:49]Process gfs2_quotad (pid: 5639, threadinfo ffff8101ea346000, task ffff8107ff3f27e0)^M
[2010-05-05 20:25:49]Stack:  ffff8102ce42f000 ffffffff800dc624 0000000000000046 0000000000011220^M
[2010-05-05 20:25:49] ffff8107ffa094c0 0000000000011220 ffff8105ffdcc048 ffffffff8000abeb^M
[2010-05-05 20:25:49] 0000000000000220 ffff8106073dfdc0 ffff8105fff294a8 ffffffff800231ce^M
[2010-05-05 20:25:49]Call Trace:^M
[2010-05-05 20:25:49] [<ffffffff800dc624>] __cache_alloc_node+0x88/0xd2^M
[2010-05-05 20:25:49] [<ffffffff8000abeb>] kmem_cache_alloc+0x34/0x76^M
[2010-05-05 20:25:50] [<ffffffff800231ce>] mempool_alloc+0x31/0xe7^M
[2010-05-05 20:25:50] [<ffffffff88075dd3>] :scsi_mod:scsi_get_command+0xe1/0x102^M
[2010-05-05 20:25:50] [<ffffffff8807b4a5>] :scsi_mod:scsi_prep_fn+0x262/0x3fb^M
[2010-05-05 20:25:50] [<ffffffff80143c57>] elv_next_request+0xb7/0x178^M
[2010-05-05 20:25:50] [<ffffffff8807af1d>] :scsi_mod:scsi_request_fn+0x6a/0x390^M
[2010-05-05 20:25:50] [<ffffffff8005addf>] generic_unplug_device+0x22/0x32^M
[2010-05-05 20:25:50] [<ffffffff88212c2c>] :dm_mod:dm_table_unplug_all+0x3f/0x83^M
[2010-05-05 20:25:50] [<ffffffff88211955>] :dm_mod:dm_request+0x11d/0x124^M
[2010-05-05 20:25:50] [<ffffffff88210d80>] :dm_mod:dm_unplug_all+0x1d/0x28^M
[2010-05-05 20:25:50] [<ffffffff80015504>] sync_buffer+0x36/0x3f^M
[2010-05-05 20:25:50] [<ffffffff80064a16>] __wait_on_bit+0x40/0x6e^M
[2010-05-05 20:25:50] [<ffffffff800154ce>] sync_buffer+0x0/0x3f^M
[2010-05-05 20:25:50] [<ffffffff80064ab0>] out_of_line_wait_on_bit+0x6c/0x78^M
[2010-05-05 20:25:50] [<ffffffff800a1bd2>] wake_bit_function+0x0/0x23^M
[2010-05-05 20:25:50] [<ffffffff8003aca8>] sync_dirty_buffer+0x96/0xcb^M
[2010-05-05 20:25:50] [<ffffffff8880ddc8>] :gfs2:log_write_header+0x10e/0x336^M
[2010-05-05 20:25:50] [<ffffffff8880e3ac>] :gfs2:gfs2_log_flush+0x3bc/0x472^M
[2010-05-05 20:25:50] [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e^M
[2010-05-05 20:25:50] [<ffffffff8881a9c0>] :gfs2:do_sync+0x59e/0x5bb^M
[2010-05-05 20:25:51] [<ffffffff8881b3fa>] :gfs2:gfs2_quota_sync+0x1fd/0x268^M
[2010-05-05 20:25:51] [<ffffffff888193bf>] :gfs2:quotad_check_timeo+0x20/0x60^M
[2010-05-05 20:25:51] [<ffffffff8881aea2>] :gfs2:gfs2_quotad+0x105/0x214^M
[2010-05-05 20:25:51] [<ffffffff800a1ba4>] autoremove_wake_function+0x0/0x2e^M
[2010-05-05 20:25:51] [<ffffffff8881ad9d>] :gfs2:gfs2_quotad+0x0/0x214^M
[2010-05-05 20:25:51] [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4^M
[2010-05-05 20:25:51] [<ffffffff80032bdc>] kthread+0xfe/0x132^M
[2010-05-05 20:25:51] [<ffffffff8005efb1>] child_rip+0xa/0x11^M
[2010-05-05 20:25:51] [<ffffffff800a198c>] keventd_create_kthread+0x0/0xc4^M
[2010-05-05 20:25:51] [<ffffffff80032ade>] kthread+0x0/0x132^M
[2010-05-05 20:25:51] [<ffffffff8005efa7>] child_rip+0x0/0x11^M
[2010-05-05 20:25:51]^M
[2010-05-05 20:25:51]^M
[2010-05-05 20:25:51]Code: 48 8b 10 48 39 fa 74 1b 48 89 fe 31 c0 48 c7 c7 65 ca 2b 80 ^M
[2010-05-05 20:25:51]RIP  [<ffffffff80154d5d>] list_del+0x8/0x71^M
[2010-05-05 20:25:51] RSP <ffff8101ea3479a0>^M
[2010-05-05 20:25:51]CR2: 0000000000000000^M


Any suggestions would be really appreciated!

Comment 2 Scooter Morris 2010-05-13 22:55:06 UTC
Further information.  The cluster continued to crash, with one of the three nodes crashing with a signature similar (but not always the same) as the one above.  We finally brought the entire cluster down and did an fsck.gfs2 on all of the gfs2 filesystems.  We found two filesystems were seriously corrupted: /usr/local, and /home/socr/nobackup (used for time machine backups via samba).  fsck.gfs2 was exceedingly slow on one other filesystem -- to the extent that we found it quicker to copy the file system to a new disk than to fsck it.

In case it was related, we also disabled quotas on all gfs2 filesystems (/home/socr/nobackup had quotas enabled, but /usr/local did not).  After rebooting all nodes, the cluster returned to stable operation.  We've been up for 7 days.

Comment 4 Abhijith Das 2011-03-03 16:20:45 UTC
Agree with Steve here, this looks like bug 555754. The patch is available in kernel-2.6.18-198.el5 and beyond. I'm marking this as a dup of that bug. Please reopen if problem persists with a newer kernel.

*** This bug has been marked as a duplicate of bug 555754 ***


Note You need to log in before you can comment on or make changes to this bug.