Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 230143

Summary:

[GFS2] getting fatal: assertion "!atomic_read(&gl->gl_ail_count)" failed when using gfs2

Product:

Red Hat Enterprise Linux 5

Reporter:

Josef Bacik <jbacik>

Component:

kernel

Assignee:

Don Zickus <dzickus>

Status:

CLOSED ERRATA

QA Contact:

GFS Bugs <gfs-bugs>

Severity:

medium

Docs Contact:

Priority:

high

Version:

5.1

CC:

david.craigon, kanderso, lwang, rkenna, swhiteho

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2007-0959

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-11-07 19:41:59 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

204760

Attachments:

Description	Flags
Patch to fix rgrp flushing	none
New patch to fix rgrp issue	none

Description Josef Bacik 2007-02-26 20:50:48 UTC

filling this as a bugzilla based off of me hitting this problem and another 
reporter of this problem on cluster-devel, this will help me track it.  He has 
a way to reproduce apparently:

This happens when I create a file on one computer, then quickly delete
it on the other.

It doesn't happen if 1) I wait a long period of time between creating
the file and deleting it 2) if I delete the file on the same computer as
I made it, no matter how fast I do it. I seem to be able to create files
on both computers as much as I like.

Once I figure out the umount panic I'm working on I will attempt to reproduce 
this and look into it further.

Comment 1 Steve Whitehouse 2007-02-28 11:03:49 UTC

This seems to be related to callbacks. I can reproduce this on a single node by
use of postmark (transactions & number both set to 100000) which lands up
triggering the code to reduce glock numbers, which in turn causes this to happen
during the demotion of the glock. Note that you have to be running lock_dlm for
this to happen, lock_nolock never causes this to occur.

The odd thing is that it appears that the demotion is occuring without having
called the ->go_sync function since the dirty flag appears to be set (correctly,
since there is obviously still data to be flushed) on the glock. This is true
even if I remove the dirty test and run the flush unconditionally (but still
clear the dirty flag).

So my best guess at the moment is that during demotion due to callback the first
part of the glock demote code doesn't get run for some reason.

Comment 2 Steve Whitehouse 2007-02-28 13:58:54 UTC

Created attachment 148914 [details]
Patch to fix rgrp flushing

It appears this bug is down to not flushing the rgrps when a callback is
received. We've not seen this before as normally the journal log flush will
result in the rgrp being flushed anyway, so it only occurs when a request is
received to flush _only_ an rgrp and that rgrp is dirty at the time of the
request.

The attached patch fixes the problem.

Comment 3 David J Craigon 2007-03-02 11:50:29 UTC

Works for me now, thanks!

Comment 5 RHEL Program Management 2007-03-09 19:05:05 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 RHEL Program Management 2007-03-09 23:52:25 UTC

This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 7 Abhijith Das 2007-03-19 21:59:38 UTC

I'm running the latest gfs2 from steve's nmw git tree. I was running the QA
locksmith test and I tripped this assertion on node winston:

GFS2: fsid=smoke:gfs2.2: fatal: assertion "!atomic_read(&gl->gl_ail_count)" fa
iled
GFS2: fsid=smoke:gfs2.2:   function = gfs2_meta_inval, file = fs/gfs2/meta_io.
c, line = 101
GFS2: fsid=smoke:gfs2.2: about to withdraw this file system
GFS2: fsid=smoke:gfs2.2: telling LM to withdraw
 [<e051dfae>] gfs2_assert_withdraw_i+0x42/0x4e [gfs2]
 [<e0511a39>] gfs2_writepage+0x6f/0x172 [gfs2]
 [<c013b2ac>] generic_writepages+0x17d/0x2ae
 [<e05119ca>] gfs2_writepage+0x0/0x172 [gfs2]
 [<c011447d>] __activate_task+0x1c/0x29
 [<c011632c>] try_to_wake_up+0x38c/0x396
 [<e0511b3c>] gfs2_writepages+0x0/0x3a [gfs2]
 [<e0511b74>] gfs2_writepages+0x38/0x3a [gfs2]
 [<c013b3fd>] do_writepages+0x20/0x30
 [<c0167399>] __writeback_single_inode+0x198/0x308
 [<c0116341>] default_wake_function+0xb/0xd
 [<c0116341>] default_wake_function+0xb/0xd
 [<c01677e4>] sync_sb_inodes+0x168/0x211
 [<c016790e>] sync_inodes_sb+0x81/0x8f
 [<c015212d>] __fsync_super+0xa/0x58
 [<c016b06e>] freeze_bdev+0x39/0x68
 [<c03745c5>] dm_suspend+0xf1/0x265
 [<c0116336>] default_wake_function+0x0/0xd
 [<c0376d3e>] dev_suspend+0x53/0x157
 [<c037765a>] ctl_ioctl+0x212/0x257
 [<c0158191>] __link_path_walk+0x9df/0xb23
 [<c0376ceb>] dev_suspend+0x0/0x157
 [<c0159ff0>] do_ioctl+0x4c/0x62
 [<c015a24a>] vfs_ioctl+0x244/0x256
 [<c015a28f>] sys_ioctl+0x33/0x4c
 [<c01035a0>] sysenter_past_esp+0x5d/0x81
 [<c0410033>] __sched_text_start+0x863/0x912


kdb traceback for lock_dlm1

Stack traceback for pid 3045
0xc14ea550     3045        7  0    0   D  0xc14ea700  lock_dlm1
esp        eip        Function (args)
0xd04c7df4 0xc0410033 __sched_text_start+0x863
0xd04c7e0c 0xc015feea destroy_inode+0x32
0xd04c7e10 0xc01147a4 task_rq_lock+0x31
0xd04c7e4c 0xc020f7b3 kobject_release
0xd04c7e68 0xc0410190 wait_for_completion+0x68
0xd04c7e74 0xc0116336 default_wake_function
0xd04c7e90 0xc012a4ad kthread_stop+0x4e
0xd04c7e98 0xe006218e [lock_dlm]gdlm_release_threads+0xe
0xd04c7ea0 0xe0061e50 [lock_dlm]gdlm_withdraw+0x96
0xd04c7eac 0xc012a76e autoremove_wake_function
0xd04c7ec0 0xe050f3f6 [gfs2]gfs2_withdraw_lockproto+0x16
0xd04c7ec8 0xe050c976 [gfs2]gfs2_lm_withdraw+0x6d
0xd04c7ee0 0xe051dfa7 [gfs2]gfs2_assert_withdraw_i+0x3b
0xd04c7f0c 0xe050fa7c [gfs2]gfs2_meta_inval+0x41
0xd04c7f24 0xe050a66c [gfs2]inode_go_inval+0xe
0xd04c7f2c 0xe0509eea [gfs2]drop_bh+0xb1
0xd04c7f4c 0xe0509a15 [gfs2]gfs2_glock_cb+0xb1
0xd04c7f54 0xc012a8ee remove_wait_queue+0x31
0xd04c7f64 0xe006279c [lock_dlm]gdlm_thread+0x5af

I'll try reproducing this and come up with a testcase.

Comment 8 Steve Whitehouse 2007-03-23 09:38:37 UTC

Abhi, can you dup this bug or something so that one copy of it can be put back
into POST in order to get the patch into RHEL5.1?

There is nothing wrong with investigating this further, but I don't want to
delay the original patch if at all possible since that does fix a real bug, even
if it hasn't solved all cases of it.

Comment 9 Steve Whitehouse 2007-04-18 13:51:43 UTC

Created attachment 152906 [details]
New patch to fix rgrp issue

A new patch which applies to RHEL5.1 post bz 235349

Comment 10 Don Zickus 2007-05-01 18:08:18 UTC

in 2.6.18-17.el5

Comment 13 errata-xmlrpc 2007-11-07 19:41:59 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html