Red Hat Bugzilla – Bug 230143
[GFS2] getting fatal: assertion "!atomic_read(&gl->gl_ail_count)" failed when using gfs2
Last modified: 2007-11-30 17:07:42 EST
filling this as a bugzilla based off of me hitting this problem and another
reporter of this problem on cluster-devel, this will help me track it. He has
a way to reproduce apparently:
This happens when I create a file on one computer, then quickly delete
it on the other.
It doesn't happen if 1) I wait a long period of time between creating
the file and deleting it 2) if I delete the file on the same computer as
I made it, no matter how fast I do it. I seem to be able to create files
on both computers as much as I like.
Once I figure out the umount panic I'm working on I will attempt to reproduce
this and look into it further.
This seems to be related to callbacks. I can reproduce this on a single node by
use of postmark (transactions & number both set to 100000) which lands up
triggering the code to reduce glock numbers, which in turn causes this to happen
during the demotion of the glock. Note that you have to be running lock_dlm for
this to happen, lock_nolock never causes this to occur.
The odd thing is that it appears that the demotion is occuring without having
called the ->go_sync function since the dirty flag appears to be set (correctly,
since there is obviously still data to be flushed) on the glock. This is true
even if I remove the dirty test and run the flush unconditionally (but still
clear the dirty flag).
So my best guess at the moment is that during demotion due to callback the first
part of the glock demote code doesn't get run for some reason.
Created attachment 148914 [details]
Patch to fix rgrp flushing
It appears this bug is down to not flushing the rgrps when a callback is
received. We've not seen this before as normally the journal log flush will
result in the rgrp being flushed anyway, so it only occurs when a request is
received to flush _only_ an rgrp and that rgrp is dirty at the time of the
The attached patch fixes the problem.
Works for me now, thanks!
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla
I'm running the latest gfs2 from steve's nmw git tree. I was running the QA
locksmith test and I tripped this assertion on node winston:
GFS2: fsid=smoke:gfs2.2: fatal: assertion "!atomic_read(&gl->gl_ail_count)" fa
GFS2: fsid=smoke:gfs2.2: function = gfs2_meta_inval, file = fs/gfs2/meta_io.
c, line = 101
GFS2: fsid=smoke:gfs2.2: about to withdraw this file system
GFS2: fsid=smoke:gfs2.2: telling LM to withdraw
[<e051dfae>] gfs2_assert_withdraw_i+0x42/0x4e [gfs2]
[<e0511a39>] gfs2_writepage+0x6f/0x172 [gfs2]
[<e05119ca>] gfs2_writepage+0x0/0x172 [gfs2]
[<e0511b3c>] gfs2_writepages+0x0/0x3a [gfs2]
[<e0511b74>] gfs2_writepages+0x38/0x3a [gfs2]
kdb traceback for lock_dlm1
Stack traceback for pid 3045
0xc14ea550 3045 7 0 0 D 0xc14ea700 lock_dlm1
esp eip Function (args)
0xd04c7df4 0xc0410033 __sched_text_start+0x863
0xd04c7e0c 0xc015feea destroy_inode+0x32
0xd04c7e10 0xc01147a4 task_rq_lock+0x31
0xd04c7e4c 0xc020f7b3 kobject_release
0xd04c7e68 0xc0410190 wait_for_completion+0x68
0xd04c7e74 0xc0116336 default_wake_function
0xd04c7e90 0xc012a4ad kthread_stop+0x4e
0xd04c7e98 0xe006218e [lock_dlm]gdlm_release_threads+0xe
0xd04c7ea0 0xe0061e50 [lock_dlm]gdlm_withdraw+0x96
0xd04c7eac 0xc012a76e autoremove_wake_function
0xd04c7ec0 0xe050f3f6 [gfs2]gfs2_withdraw_lockproto+0x16
0xd04c7ec8 0xe050c976 [gfs2]gfs2_lm_withdraw+0x6d
0xd04c7ee0 0xe051dfa7 [gfs2]gfs2_assert_withdraw_i+0x3b
0xd04c7f0c 0xe050fa7c [gfs2]gfs2_meta_inval+0x41
0xd04c7f24 0xe050a66c [gfs2]inode_go_inval+0xe
0xd04c7f2c 0xe0509eea [gfs2]drop_bh+0xb1
0xd04c7f4c 0xe0509a15 [gfs2]gfs2_glock_cb+0xb1
0xd04c7f54 0xc012a8ee remove_wait_queue+0x31
0xd04c7f64 0xe006279c [lock_dlm]gdlm_thread+0x5af
I'll try reproducing this and come up with a testcase.
Abhi, can you dup this bug or something so that one copy of it can be put back
into POST in order to get the patch into RHEL5.1?
There is nothing wrong with investigating this further, but I don't want to
delay the original patch if at all possible since that does fix a real bug, even
if it hasn't solved all cases of it.
Created attachment 152906 [details]
New patch to fix rgrp issue
A new patch which applies to RHEL5.1 post bz 235349
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.