Created attachment 356047 [details] Patch to try This patch fixed the problem on my cluster. I'd like the users to try it and report whether it worked properly for them.
Setting NEEDINFO flag until I hear back on the results from the patch in comment #1.
It's been six months and I still have not heard whether the patch fixes the customer's problem. I'll close this as INSUFFICIENT_DATA for now. If the results come in, we can re-open it.
We were not able to reproduce the issue using the newest RedHat provided RPMs for RHEL4, so the problem seems to be fixed.
Little Add-On to my Comment #5: With the "patch to try" and the newest RedHat provided packages we were not able to reproduce the issue.
I'll try to get this patch into 4.9 then. Requesting ack flags accordingly.
The patch was pushed to the RHEL4 and RHEL49 branches of the cluster git tree for inclusion into 4.9. It was tested by me a long time ago on the trin cluster, and by various customers as shown in comment #6 above. Changing status to POST. Chris Feist does the builds for RHEL4 so I'm reassigning to him to get this into a build.
I wrote a new regression test and was able to recreate the bug using RHEL 4.7. I will let the regression test run on 4.9 over the weekend before marking this verified.
I hit the get_leaf assertion while running the new regression test. GFS: fsid=dash-cluster:dash-cluster0.2: fatal: invalid metadata block GFS: fsid=dash-cluster:dash-cluster0.2: bh = 654416609 (type: exp=6, found=0) GFS: fsid=dash-cluster:dash-cluster0.2: function = get_leaf GFS: fsid=dash-cluster:dash-cluster0.2: file = /builddir/build/BUILD/gfs-kernel-2.6.9-87/up/src/gfs/dir.c, line = 438 GFS: fsid=dash-cluster:dash-cluster0.2: time = 1295140811 GFS: fsid=dash-cluster:dash-cluster0.2: about to withdraw from the cluster GFS: fsid=dash-cluster:dash-cluster0.2: waiting for outstanding I/O ------------[ cut here ]------------ kernel BUG at /builddir/build/BUILD/gfs-kernel-2.6.9-87/up/src/gfs/lm.c:190! invalid operand: 0000 [#1] Modules linked in: vfat fat nfs nfsd exportfs lockd nfs_acl lock_dlm(U) dm_cmirror(U) gnbd(U) lock_nolock(U) gfs(U) lock_harness(U) dlm(U) cman(U) parport_pc lp parport autofs4 i2c_dev i2c_core md5 ipv6 sunrpc cpufreq_powersave button battery ac uhci_hcd ehci_hcd i3000_edac edac_ mc tg3 qla2400 qla2xxx scsi_transport_fc dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata sd_mod scsi_mod CPU: 0 EIP: 0060:[<f912546c>] Not tainted VLI EFLAGS: 00010202 (2.6.9-94.EL) EIP is at gfs_lm_withdraw+0x50/0xbc [gfs] eax: 00000044 ebx: f916f94c ecx: f9148456 edx: dfb09da4 esi: f915b000 edi: 00000000 ebp: f915b000 esp: dfb09db8 ds: 007b es: 007b ss: 0068 Process find (pid: 10901, threadinfo=dfb09000 task=f5b412a0) Stack: f916f94c cb3fd400 f9144647 f915b000 f914cb87 f916f94c f916f94c 27019ae1 00000000 00000006 00000000 f916f94c f9144a0f f916f94c f91464be 000001b6 f916f94c 4d3247cb ecb7cf1c f910e544 00000000 f9144a0f f91464be 000001b6 Call Trace: [<f9144647>] gfs_metatype_check_ii+0x34/0x3f [gfs] [<f910e544>] get_leaf+0xc1/0xd5 [gfs] [<f911051d>] dir_e_read+0x1f2/0x2c9 [gfs] [<f9110c24>] gfs_dir_read+0x18/0x25 [gfs] [<f9131a9d>] filldir_reg_func+0x0/0x12c [gfs] [<f9131cd3>] readdir_reg+0x10a/0x12c [gfs] [<f9131a9d>] filldir_reg_func+0x0/0x12c [gfs] [<c0183d99>] filldir64+0x0/0x11a [<c0183d99>] filldir64+0x0/0x11a [<c0183d99>] filldir64+0x0/0x11a [<f9132098>] gfs_readdir+0x4e/0x5b [gfs] [<c0183a02>] vfs_readdir+0x8a/0xb7 [<c018404f>] sys_getdents64+0x80/0xba [<c03246eb>] syscall_call+0x7/0xb [<c032007b>] packet_recvmsg+0xef/0x11a
We discussed this problem in our weekly meeting. We decided that the patch makes things better, not worse, so although the problem apparently isn't completely fixed, shipping the patch in 4.9 is better than not shipping it. Bug #674403 was opened to address any ongoing issues. Changing status to ON_QA.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0276.html