Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 612608

Summary:

GFS2: kernel BUG at fs/gfs2/glock.c:173! running brawl w/flocks

Product:

Red Hat Enterprise Linux 6

Reporter:

Nate Straz <nstraz>

Component:

kernel

Assignee:

Steve Whitehouse <swhiteho>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

high

Docs Contact:

Priority:

low

Version:

6.0

CC:

adas, arozansk, bmarzins, ddumas, nstraz, rpeterso, rwheeler, swhiteho, syeghiay, teigland

Target Milestone:

Target Release:

6.0

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-160.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

604244

Environment:

Last Closed:

2011-12-06 12:24:56 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

604244

Bug Blocks:

Attachments:

Description	Flags
One possible fix (may have perf implications)	none
RHEL6 version of patch	none

Description Nate Straz 2010-07-08 15:35:15 UTC

+++ This bug was initially created as a clone of Bug #604244 +++

Description of problem:

I hit this BUG with kernel 2.6.32-42.el6.x86_64.  It is the same backtrace as 610136 which was dup'd to this bz.  It was hit while running brawl w/ a 1k file system block size.  The flock below corresponds to a file generated by the test program accordion.

3689713 -rw-rw-r--. 1 root root     27189 Jul  7 23:34 accrdfile2l


 G:  s:UN n:6/384cf1 f:I t:UN d:EX/0 a:0 r:0
------------[ cut here ]------------
kernel BUG at fs/gfs2/glock.c:173!
invalid opcode: 0000 [#1]
Modules linked in: sctp libcrc32c gfs2 dlm configfs sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log dcdba
s k8temp hwmon serio_raw amd64_edac_mod edac_core edac_mce_amd tg3 sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom qla2x
xx scsi_transport_fc scsi_tgt sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod [las
t unloaded: configfs]
Pid: 6793, comm: dlm_astd Not tainted 2.6.32-42.el6.x86_64 #1 PowerEdge SC1435
RIP: 0010:[<ffffffffa0435680>]  [<ffffffffa0435680>] gfs2_glock_hold+0x20/0x30 [gfs2]
RSP: 0018:ffff88011a075e10  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8801fa45ba28 RCX: 000000000000264e
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000000
RBP: ffff88011a075e10 R08: ffffffff818bb9c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000001 R15: ffff88011a12f000
FS:  00007f1497a47700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000002886000 CR3: 00000001bdbc4000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dlm_astd (pid: 6793, threadinfo ffff88011a074000, task ffff8801186c7580)
Stack:
 ffff88011a075e40 ffffffffa0436141 0000000000000001 0000000000000000
<0> 0000000000000001 ffff880107b80000 ffff88011a075e60 ffffffffa0453a5d
<0> ffffffffa0416aa8 ffff8801d229a078 ffff88011a075ee0 ffffffffa03f93dd
Call Trace:
 [<ffffffffa0436141>] gfs2_glock_complete+0x31/0xd0 [gfs2]
 [<ffffffffa0453a5d>] gdlm_ast+0xfd/0x110 [gfs2]
 [<ffffffffa03f93dd>] dlm_astd+0x25d/0x2b0 [dlm]
 [<ffffffffa0453860>] ? gdlm_bast+0x0/0x50 [gfs2]
 [<ffffffffa0453960>] ? gdlm_ast+0x0/0x110 [gfs2]
 [<ffffffffa03f9180>] ? dlm_astd+0x0/0x2b0 [dlm]
 [<ffffffff810909e6>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff81090950>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
Code: ff ff c9 c3 0f 1f 80 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 8b 47 28 85 c0 74 06 f0 ff 47 28 c9 c3 48 89 fe 31 ff e8 a0 fc ff ff <0f> 0
b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 
RIP  [<ffffffffa0435680>] gfs2_glock_hold+0x20/0x30 [gfs2]
 RSP <ffff88011a075e10>

--- Additional comment from rpeterso on 2010-07-08 11:13:53 EDT ---

The duplicate bug #610136 was a duplicate because it was an
improperly referenced i_iopen glock, as shown by the "5/"
in the glock dump:

 G:  s:UN n:5/9d14 f:I t:UN d:EX/0 a:0 r:0

However, in this case, the glock referenced improperly is

 G:  s:UN n:6/384cf1 f:I t:UN d:EX/0 a:0 r:0

and "6/" indicates a glock for an flock: LM_TYPE_FLOCK.

The patch for this bug record affected only i_open glocks.
Therefore, although this symptom is nearly identical, the
problem is not with this patch.  This has got to be another
similar bug somewhere in the flock code.

Please open a new bugzilla record with the symptom from
comment #12 and assign it to me.  Setting this one back to
ON_QA.

Comment 2 Robert Peterson 2010-07-09 22:03:37 UTC

I've tried some test programs and not been able to recreate
the problem.  One program took flocks and tried to unlock them
twice.  I did variations on that theme.

Another version forked hundreds of processes, each of which
did open, non-blocking flock, then exit(0).

The problem did not recreate for me.

My next step is to use accordion or a simplified version of it.
Note that I was previously able to run the entire brawl set
on my cluster without failure.

Comment 3 Steve Whitehouse 2010-07-12 10:56:41 UTC

I suspect that you'll need to do flock(fd); close(fd); rather than using unflock in order to reproduce this issue since that is the most likely cause of the problem. We could just drop a dq_wait into the do_unflock() function rather than the non-waiting one we have now. The only issue is that it would potentially slow down flock using programs, but it might be a small enough delay that it won't matter.

A faster solution (better, but more complicated) would be to not drop the ref to the flock glock once it has been touched once, except at close time. That would mean that there would still be a ref to the glock at close time which could then be used to wait on the pending demote (if any). In other words we'd only wait for the demote if we needed to close the fd, and not in the (faster) path of do_unflock. I'm not sure that the extra complexity is worth it.

Comment 4 Robert Peterson 2010-07-12 19:48:47 UTC

I tried a wide variety of programs to recreate this problem
again today, including many different parameter combinations in
accordion, and nothing seems to recreate this.  I tried Steve's
suggestions in comment #3 and nothing seems to make a difference.

Comment 5 RHEL Program Management 2010-07-15 14:20:30 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release. It has
been denied for the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 6 Robert Peterson 2010-07-15 19:24:34 UTC

Maybe this is a duplicate of bug #537010.  I'll ping Dave T. to
see if it could be the same thing, and if there's a fix for
RHEL6.0.

Comment 7 David Teigland 2010-07-15 20:06:31 UTC

There are no basts (blocking callbacks) for flocks, so the other bug shouldn't be a factor.

Comment 8 Steve Whitehouse 2010-07-16 10:13:31 UTC

I still think that it is simply due to a race where we are closing an fd and not waiting for the reply from the dlm at any stage. Normally the time taken for this sequence of operations is long enough that we don't see a problem, in some cases though the glock has vanished first.

One simple fix is just to add the waiting _dq function into the do_unflock function as per comment #3. If that doesn't affect performance too much, then that should solve the problem.

Comment 9 Steve Whitehouse 2010-07-16 10:14:40 UTC

We should try to get this one in for rhel6

Comment 10 Steve Whitehouse 2010-07-16 10:34:16 UTC

Created attachment 432349 [details]
One possible fix (may have perf implications)

This is what I was thinking of in comment #3

Comment 15 RHEL Program Management 2011-01-07 03:51:19 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 16 Suzanne Logcher 2011-01-07 16:21:48 UTC

This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.

Comment 17 RHEL Program Management 2011-02-01 05:29:32 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 18 Steve Whitehouse 2011-02-01 12:29:29 UTC

Clearing needinfo, since I can't see any questions which remain to be answered. The patch is not yet upstream, but there seems no reason not to push it upstream and include it in rhel6. Since we have no reproducer, I'd say that this wasn't greatly urgent though.

Comment 19 Ric Wheeler 2011-02-01 13:19:12 UTC

Moving out to 6.2.

Comment 20 Steve Whitehouse 2011-03-09 11:39:23 UTC

Patch is posted upstream for -nmw.

Comment 21 RHEL Program Management 2011-05-13 15:23:43 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 22 Steve Whitehouse 2011-06-13 15:52:41 UTC

Created attachment 504479 [details]
RHEL6 version of patch

Comment 23 Steve Whitehouse 2011-06-13 16:00:14 UTC

Notes for QE:

Since this bug cannot apparently be reproduced, the only testing that we need to do is a check for regressions in flock.

Comment 24 Aristeu Rozanski 2011-06-27 19:04:50 UTC

Patch(es) available on kernel-2.6.32-160.el6

Comment 27 Nate Straz 2011-08-08 16:49:22 UTC

Verified that the patch is included in kernel-2.6.32-178.el6.

I have not hit this during regression runs thus far.

Comment 28 errata-xmlrpc 2011-12-06 12:24:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html