Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 612608 - GFS2: kernel BUG at fs/gfs2/glock.c:173! running brawl w/flocks
GFS2: kernel BUG at fs/gfs2/glock.c:173! running brawl w/flocks
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
All Linux
low Severity high
: rc
: 6.0
Assigned To: Steve Whitehouse
Cluster QE
:
Depends On: 604244
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-08 11:35 EDT by Nate Straz
Modified: 2011-12-06 07:24 EST (History)
10 users (show)

See Also:
Fixed In Version: kernel-2.6.32-160.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 604244
Environment:
Last Closed: 2011-12-06 07:24:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
One possible fix (may have perf implications) (445 bytes, patch)
2010-07-16 06:34 EDT, Steve Whitehouse
no flags Details | Diff
RHEL6 version of patch (445 bytes, patch)
2011-06-13 11:52 EDT, Steve Whitehouse
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1530 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 6 kernel security, bug fix and enhancement update 2011-12-05 20:45:35 EST

  None (edit)
Description Nate Straz 2010-07-08 11:35:15 EDT
+++ This bug was initially created as a clone of Bug #604244 +++

Description of problem:

I hit this BUG with kernel 2.6.32-42.el6.x86_64.  It is the same backtrace as 610136 which was dup'd to this bz.  It was hit while running brawl w/ a 1k file system block size.  The flock below corresponds to a file generated by the test program accordion.

3689713 -rw-rw-r--. 1 root root     27189 Jul  7 23:34 accrdfile2l


 G:  s:UN n:6/384cf1 f:I t:UN d:EX/0 a:0 r:0
------------[ cut here ]------------
kernel BUG at fs/gfs2/glock.c:173!
invalid opcode: 0000 [#1]
Modules linked in: sctp libcrc32c gfs2 dlm configfs sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log dcdba
s k8temp hwmon serio_raw amd64_edac_mod edac_core edac_mce_amd tg3 sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom qla2x
xx scsi_transport_fc scsi_tgt sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod [las
t unloaded: configfs]
Pid: 6793, comm: dlm_astd Not tainted 2.6.32-42.el6.x86_64 #1 PowerEdge SC1435
RIP: 0010:[<ffffffffa0435680>]  [<ffffffffa0435680>] gfs2_glock_hold+0x20/0x30 [gfs2]
RSP: 0018:ffff88011a075e10  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8801fa45ba28 RCX: 000000000000264e
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000000
RBP: ffff88011a075e10 R08: ffffffff818bb9c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000001 R15: ffff88011a12f000
FS:  00007f1497a47700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000002886000 CR3: 00000001bdbc4000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dlm_astd (pid: 6793, threadinfo ffff88011a074000, task ffff8801186c7580)
Stack:
 ffff88011a075e40 ffffffffa0436141 0000000000000001 0000000000000000
<0> 0000000000000001 ffff880107b80000 ffff88011a075e60 ffffffffa0453a5d
<0> ffffffffa0416aa8 ffff8801d229a078 ffff88011a075ee0 ffffffffa03f93dd
Call Trace:
 [<ffffffffa0436141>] gfs2_glock_complete+0x31/0xd0 [gfs2]
 [<ffffffffa0453a5d>] gdlm_ast+0xfd/0x110 [gfs2]
 [<ffffffffa03f93dd>] dlm_astd+0x25d/0x2b0 [dlm]
 [<ffffffffa0453860>] ? gdlm_bast+0x0/0x50 [gfs2]
 [<ffffffffa0453960>] ? gdlm_ast+0x0/0x110 [gfs2]
 [<ffffffffa03f9180>] ? dlm_astd+0x0/0x2b0 [dlm]
 [<ffffffff810909e6>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff81090950>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
Code: ff ff c9 c3 0f 1f 80 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 8b 47 28 85 c0 74 06 f0 ff 47 28 c9 c3 48 89 fe 31 ff e8 a0 fc ff ff <0f> 0
b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 
RIP  [<ffffffffa0435680>] gfs2_glock_hold+0x20/0x30 [gfs2]
 RSP <ffff88011a075e10>

--- Additional comment from rpeterso@redhat.com on 2010-07-08 11:13:53 EDT ---

The duplicate bug #610136 was a duplicate because it was an
improperly referenced i_iopen glock, as shown by the "5/"
in the glock dump:

 G:  s:UN n:5/9d14 f:I t:UN d:EX/0 a:0 r:0

However, in this case, the glock referenced improperly is

 G:  s:UN n:6/384cf1 f:I t:UN d:EX/0 a:0 r:0

and "6/" indicates a glock for an flock: LM_TYPE_FLOCK.

The patch for this bug record affected only i_open glocks.
Therefore, although this symptom is nearly identical, the
problem is not with this patch.  This has got to be another
similar bug somewhere in the flock code.

Please open a new bugzilla record with the symptom from
comment #12 and assign it to me.  Setting this one back to
ON_QA.
Comment 2 Robert Peterson 2010-07-09 18:03:37 EDT
I've tried some test programs and not been able to recreate
the problem.  One program took flocks and tried to unlock them
twice.  I did variations on that theme.

Another version forked hundreds of processes, each of which
did open, non-blocking flock, then exit(0).

The problem did not recreate for me.

My next step is to use accordion or a simplified version of it.
Note that I was previously able to run the entire brawl set
on my cluster without failure.
Comment 3 Steve Whitehouse 2010-07-12 06:56:41 EDT
I suspect that you'll need to do flock(fd); close(fd); rather than using unflock in order to reproduce this issue since that is the most likely cause of the problem. We could just drop a dq_wait into the do_unflock() function rather than the non-waiting one we have now. The only issue is that it would potentially slow down flock using programs, but it might be a small enough delay that it won't matter.

A faster solution (better, but more complicated) would be to not drop the ref to the flock glock once it has been touched once, except at close time. That would mean that there would still be a ref to the glock at close time which could then be used to wait on the pending demote (if any). In other words we'd only wait for the demote if we needed to close the fd, and not in the (faster) path of do_unflock. I'm not sure that the extra complexity is worth it.
Comment 4 Robert Peterson 2010-07-12 15:48:47 EDT
I tried a wide variety of programs to recreate this problem
again today, including many different parameter combinations in
accordion, and nothing seems to recreate this.  I tried Steve's
suggestions in comment #3 and nothing seems to make a difference.
Comment 5 RHEL Product and Program Management 2010-07-15 10:20:30 EDT
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release. It has
been denied for the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Comment 6 Robert Peterson 2010-07-15 15:24:34 EDT
Maybe this is a duplicate of bug #537010.  I'll ping Dave T. to
see if it could be the same thing, and if there's a fix for
RHEL6.0.
Comment 7 David Teigland 2010-07-15 16:06:31 EDT
There are no basts (blocking callbacks) for flocks, so the other bug shouldn't be a factor.
Comment 8 Steve Whitehouse 2010-07-16 06:13:31 EDT
I still think that it is simply due to a race where we are closing an fd and not waiting for the reply from the dlm at any stage. Normally the time taken for this sequence of operations is long enough that we don't see a problem, in some cases though the glock has vanished first.

One simple fix is just to add the waiting _dq function into the do_unflock function as per comment #3. If that doesn't affect performance too much, then that should solve the problem.
Comment 9 Steve Whitehouse 2010-07-16 06:14:40 EDT
We should try to get this one in for rhel6
Comment 10 Steve Whitehouse 2010-07-16 06:34:16 EDT
Created attachment 432349 [details]
One possible fix (may have perf implications)

This is what I was thinking of in comment #3
Comment 15 RHEL Product and Program Management 2011-01-06 22:51:19 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 16 Suzanne Yeghiayan 2011-01-07 11:21:48 EST
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.
Comment 17 RHEL Product and Program Management 2011-02-01 00:29:32 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 18 Steve Whitehouse 2011-02-01 07:29:29 EST
Clearing needinfo, since I can't see any questions which remain to be answered. The patch is not yet upstream, but there seems no reason not to push it upstream and include it in rhel6. Since we have no reproducer, I'd say that this wasn't greatly urgent though.
Comment 19 Ric Wheeler 2011-02-01 08:19:12 EST
Moving out to 6.2.
Comment 20 Steve Whitehouse 2011-03-09 06:39:23 EST
Patch is posted upstream for -nmw.
Comment 21 RHEL Product and Program Management 2011-05-13 11:23:43 EDT
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.
Comment 22 Steve Whitehouse 2011-06-13 11:52:41 EDT
Created attachment 504479 [details]
RHEL6 version of patch
Comment 23 Steve Whitehouse 2011-06-13 12:00:14 EDT
Notes for QE:

Since this bug cannot apparently be reproduced, the only testing that we need to do is a check for regressions in flock.
Comment 24 Aristeu Rozanski 2011-06-27 15:04:50 EDT
Patch(es) available on kernel-2.6.32-160.el6
Comment 27 Nate Straz 2011-08-08 12:49:22 EDT
Verified that the patch is included in kernel-2.6.32-178.el6.

I have not hit this during regression runs thus far.
Comment 28 errata-xmlrpc 2011-12-06 07:24:56 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html

Note You need to log in before you can comment on or make changes to this bug.