Created attachment 915160 [details] Comment (This comment was longer than 65,535 characters and has been moved to an attachment by Red Hat Bugzilla).
From the bug message: G: s:UN n:2/7cd19 f: t:UN d:EX/0 l:0 a:0 r:0 we can see that this is an inode glock. We can also see which inode it refers to (not that useful in this case I suspect) and that it looks like its newly created and thus, unused at the point where the problem occurs. The ref count is certainly zero, so it is probably an imbalance in get/put for this glock. I've looked at this patch: http://git.engineering.redhat.com/?p=users/jwilson/rhel5/kernel;a=commitdiff;h=fb5f8cb678bd05943c62be1abea2e2917895cc85 which is the one which adds the code in question (a relatively recent addition) but it is not obvious to me where the error is at the moment.
Could this be how we're ending up in gfs2_glock_hold from gfs2_process_unlinked_inode ? gfs2_process_unlinked_inode() at inode.c:262>: gfs2_glock_get() at glock.c:802>: glock_work_func() at glock.c:758>: finish_xmote() at glock.c:522> (R): state_change() at glock.c:487>: gfs2_glock_hold() at glock.c:207>: GLOCK_BUG_ON() I don't know if it's helpful at all but I was trying to figure out how we were getting from gfs2_glock_hold.
RE: Comment #2. I don't think that is very likely. A more likely explanation is that somewhere we have an unbalanced use of ref counts and that has caused a problem elsewhere in the code. The chances are that the call to glock_hold is from the holder_init code, judging by what is in the functions further up the stack.
I think we might have tracked down the cause of the problem encountered and it does look like my original idea in comment #2 is correct. When GFS2 does block allocation, at the same time it also performs a check for inodes which are (a) unlinked and (b) still listed as allocated. These occasionally occur in the filesystem if there has been a recovery situation. GFS2 checks for them during block allocation time because at that point in time the check is virtually "free" in the sense that the data structures required are cached in memory during the search for free blocks. This is basically the equivalent to ext3's orphan list, and it is perfectly normal for an inode to be found and processed in this way. The issue which was introduced in the earlier patch is one where the lookup of the inode which is scheduled to be deallocated should have occurred under the resource group lock to prevent multiple processes trying to do the same thing at once. Instead the inode number was passed out to the deallocation code so that the required lock was not being held at the right time. There is a possible workaround. Using a recent fsck.gfs2 (after umount the fs on all nodes) will remove any unlinked, but still allocated inodes meaning that the cluster should continue running without hitting this bug. This might be useful until we can get the fix ready. If you need any more info, please let us know.
Did the customer have any nodes reboot between the fsck run and the following failure? If so, then my original thoughts on what were causing this may be right, but if not then I'll have to think again since it would have been impossible for my original theory to be correct.
Is the customer in a position to try a test kernel? I don't have one ready yet, but it would be useful to know in case we can debug it that way.
Here is the customer's response to your question in comment 9. Note I had them run the FSCK on Bob's people page: "None of the nodes were rebooted between the time I ran Bob's fsck.gfs2 and the time I ran my test. Here is a detailed step-by-step of what I did: Reboot all nodes in the cluster (it was still at the hung state from last test). Manually stop application cluster services (using clusvcadm) on all nodes. Manually unmount all GFS2 filesystems (umount -t gfs2 -a) on all nodes. Run Bob's fsck.gfs2 -y /device_name on all GFS2 FSs from node 01. The check found lots of inode inconsistencies on all filesystems and fixed them. Ran it again and it came back clean. Re-mounted all GFS2 mounts. Manually restarted application cluster services and started my test. Some time later (3.5 hrs or so), node 01 panicked." To address: "Is the customer in a position to try a test kernel? I don't have one ready yet, but it would be useful to know in case we can debug it that way." >> The customer has a reproducer that hits this issue within hours 100% of the time. They are up for testing any package and have even volunteered access to any RH personnel that want access if it will help resolve this issue.
Created attachment 455771 [details] First attempt at a crosswrite This is a RHEL5.6 crosswrite of the previous patch. I threw it together fast, so it should be double-checked.
Here is the trace from the log: sd 0:0:0:0: SCSI error: return code = 0x08000002 sda: Current: sense key: Hardware Error Add. Sense: No defect spare location available Info fld=0x74c8abb end_request: I/O error, dev sda, sector 122456763 sd 0:0:0:0: SCSI error: return code = 0x08000002 sda: Current: sense key: Hardware Error Add. Sense: No defect spare location available Info fld=0x74c8abb end_request: I/O error, dev sda, sector 122456763 sd 0:0:0:0: SCSI error: return code = 0x08000002 sda: Current: sense key: Hardware Error Add. Sense: No defect spare location available Info fld=0x74c8abb end_request: I/O error, dev sda, sector 122456763 G: s:SH n:5/7ffa5 f:Dq t:SH d:UN/0 l:0 a:0 r:4 H: s:SH f:EH e:0 p:19436 [(ended)] gfs2_inode_lookup+0x112/0x21b [gfs2] ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at fs/gfs2/glock.c:743 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:07.0/0000:05:00.0/0000:06:00.0/irq CPU 1 Modules linked in: ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge nfs fscache nfsd exportfs nfs_acl auth_rpcgss autofs4 hidp l2cap bluetooth dm_log_clustered(U) lock_dlm gfs2 dlm configfs lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ksm(U) kvm_intel(U) kvm(U) i5000_edac tpm_tis tpm edac_mc pcspkr tpm_bios bnx2 dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod mppVhba(U) usb_storage qla2xxx scsi_transport_fc ata_piix libata shpchp mptsas mptscsih mptbase scsi_transport_sas mppUpper(U) sg sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 4383, comm: delete_workqueu Tainted: G 2.6.18-228.00333124.1.el5 #1 RIP: 0010:[<ffffffff8884ada4>] [<ffffffff8884ada4>] :gfs2:delete_work_func+0x22/0x82 RSP: 0018:ffff81082038be20 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 000000000007ffa5 RCX: ffffffff80318f28 RDX: ffffffff80318f28 RSI: 0000000000000082 RDI: ffffffff80318f20 RBP: ffff810802df2c30 R08: ffffffff80318f28 R09: 0000000000000001 R10: ffffffff88862e93 R11: ffff81082038bcb0 R12: ffff81081d72f000 R13: 0000000000000282 R14: ffff810802df2c30 R15: ffffffff8884ad82 FS: 0000000000000000(0000) GS:ffff81011cb767c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002b837c8172ec CR3: 0000000824209000 CR4: 00000000000026e0 Process delete_workqueu (pid: 4383, threadinfo ffff81082038a000, task ffff8108287c70c0) Stack: ffff810802df2d78 ffff810802df2d80 ffff81081f7c7ec0 ffffffff8004d7aa ffff81082038be80 ffff81081f7c7ec0 ffffffff80049ff2 ffff810820091d98 0000000000000282 ffff810820091d88 ffffffff800a267e ffffffff8004a0e2 Call Trace: [<ffffffff8004d7aa>] run_workqueue+0x99/0xf6 [<ffffffff80049ff2>] worker_thread+0x0/0x122 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4 [<ffffffff8004a0e2>] worker_thread+0xf0/0x122 [<ffffffff8008e414>] default_wake_function+0x0/0xe [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4 [<ffffffff80032968>] kthread+0xfe/0x132 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff800a267e>] keventd_create_kthread+0x0/0xc4 [<ffffffff8003286a>] kthread+0x0/0x132 [<ffffffff8005dfa7>] child_rip+0x0/0x11 Code: 0f 0b 68 38 2e 86 88 c2 e7 02 48 83 bd 90 00 00 00 00 74 0e RIP [<ffffffff8884ada4>] :gfs2:delete_work_func+0x22/0x82 RSP <ffff81082038be20>
messages at the top of comment #16 appear to suggest that there is a hardware problem on the array. In that case all bets are off. Can you confirm whether that is correct?
Yeah I saw those messages too, thats why I was sure to include them. This is the first time I have seen those come up in a core. I will follow up on the SCSI errors and the health of the array.
I followed up with the customer, it looks like those errors are on their local disk. There are 4 nodes in this cluster we are going to attempt to retest without this particular node. Not from the customer here: "We considered that option and I'm looking into it. We have four nodes and I plan to check if the HW error is only showing on one. If so, I'll run the test with other three nodes. I'll check on this and get back to you." That said, do you think that a problem on the local disk would cause a panic in fs/gfs2/glock.c?
It looks to me like the kernel panic in comment #16 is the fault of the patch. It's panicing in delete_work_func() on GLOCK_BUG_ON(gl, gl->gl_name.ln_type != LM_TYPE_INODE); But this work is only queued up by iopen_go_callback(), which seems like the glock should be of type LM_TYPE_IOPEN. Looking at G: s:SH n:5/7ffa5 f:Dq t:SH d:UN/0 l:0 a:0 r:4 That's exactly what you see. I need to understand the patch better in order to know what the right thing to do is, but if this next panic is at fs/gfs2/glock.c:743 then it looks like it's a patch induced error.
Created attachment 456362 [details] modified crosswrite patch. The only difference between this and Bob's original patch is that delete_work_func() now checks to make sure that it has a LM_TYPE_IOPEN type lock, which is what it will always have. I haven't finished completely reviewing the patch, but it seems pretty obvious that this is what was intended.
Ignore that last attachment. It won't work any better than the original. That's what I get for messing with a patch before I've completely looked through it.
Created attachment 456544 [details] Yet another version of the crosswrite patch This version removes the GLOCK_BUG_ON() check altogether.
Any news? Did the updated patch work?
So far so good, we'll keep you updated.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Based on the testing performed the issue appears resolved with this patch.
Posted to rhkernel-list
Created attachment 459442 [details] Yet a better version of the crosswrite patch Revised slightly due to formatting issue
in kernel-2.6.18-232.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Marking verified based on customer testing.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html