This service will be undergoing maintenance at 00:00 UTC, 2016-09-28. It is expected to last about 1 hours
Bug 230143 - [GFS2] getting fatal: assertion "!atomic_read(&gl->gl_ail_count)" failed when using gfs2
[GFS2] getting fatal: assertion "!atomic_read(&gl->gl_ail_count)" failed when...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
All Linux
high Severity medium
: ---
: ---
Assigned To: Don Zickus
GFS Bugs
:
Depends On:
Blocks: 204760
  Show dependency treegraph
 
Reported: 2007-02-26 15:50 EST by Josef Bacik
Modified: 2007-11-30 17:07 EST (History)
5 users (show)

See Also:
Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 14:41:59 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Patch to fix rgrp flushing (413 bytes, patch)
2007-02-28 08:58 EST, Steve Whitehouse
no flags Details | Diff
New patch to fix rgrp issue (1.09 KB, patch)
2007-04-18 09:51 EDT, Steve Whitehouse
no flags Details | Diff

  None (edit)
Description Josef Bacik 2007-02-26 15:50:48 EST
filling this as a bugzilla based off of me hitting this problem and another 
reporter of this problem on cluster-devel, this will help me track it.  He has 
a way to reproduce apparently:

This happens when I create a file on one computer, then quickly delete
it on the other.

It doesn't happen if 1) I wait a long period of time between creating
the file and deleting it 2) if I delete the file on the same computer as
I made it, no matter how fast I do it. I seem to be able to create files
on both computers as much as I like.

Once I figure out the umount panic I'm working on I will attempt to reproduce 
this and look into it further.
Comment 1 Steve Whitehouse 2007-02-28 06:03:49 EST
This seems to be related to callbacks. I can reproduce this on a single node by
use of postmark (transactions & number both set to 100000) which lands up
triggering the code to reduce glock numbers, which in turn causes this to happen
during the demotion of the glock. Note that you have to be running lock_dlm for
this to happen, lock_nolock never causes this to occur.

The odd thing is that it appears that the demotion is occuring without having
called the ->go_sync function since the dirty flag appears to be set (correctly,
since there is obviously still data to be flushed) on the glock. This is true
even if I remove the dirty test and run the flush unconditionally (but still
clear the dirty flag).

So my best guess at the moment is that during demotion due to callback the first
part of the glock demote code doesn't get run for some reason.
Comment 2 Steve Whitehouse 2007-02-28 08:58:54 EST
Created attachment 148914 [details]
Patch to fix rgrp flushing

It appears this bug is down to not flushing the rgrps when a callback is
received. We've not seen this before as normally the journal log flush will
result in the rgrp being flushed anyway, so it only occurs when a request is
received to flush _only_ an rgrp and that rgrp is dirty at the time of the
request.

The attached patch fixes the problem.
Comment 3 David J Craigon 2007-03-02 06:50:29 EST
Works for me now, thanks!
Comment 5 RHEL Product and Program Management 2007-03-09 14:05:05 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 6 RHEL Product and Program Management 2007-03-09 18:52:25 EST
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.
Comment 7 Abhijith Das 2007-03-19 17:59:38 EDT
I'm running the latest gfs2 from steve's nmw git tree. I was running the QA
locksmith test and I tripped this assertion on node winston:

GFS2: fsid=smoke:gfs2.2: fatal: assertion "!atomic_read(&gl->gl_ail_count)" fa
iled
GFS2: fsid=smoke:gfs2.2:   function = gfs2_meta_inval, file = fs/gfs2/meta_io.
c, line = 101
GFS2: fsid=smoke:gfs2.2: about to withdraw this file system
GFS2: fsid=smoke:gfs2.2: telling LM to withdraw
 [<e051dfae>] gfs2_assert_withdraw_i+0x42/0x4e [gfs2]
 [<e0511a39>] gfs2_writepage+0x6f/0x172 [gfs2]
 [<c013b2ac>] generic_writepages+0x17d/0x2ae
 [<e05119ca>] gfs2_writepage+0x0/0x172 [gfs2]
 [<c011447d>] __activate_task+0x1c/0x29
 [<c011632c>] try_to_wake_up+0x38c/0x396
 [<e0511b3c>] gfs2_writepages+0x0/0x3a [gfs2]
 [<e0511b74>] gfs2_writepages+0x38/0x3a [gfs2]
 [<c013b3fd>] do_writepages+0x20/0x30
 [<c0167399>] __writeback_single_inode+0x198/0x308
 [<c0116341>] default_wake_function+0xb/0xd
 [<c0116341>] default_wake_function+0xb/0xd
 [<c01677e4>] sync_sb_inodes+0x168/0x211
 [<c016790e>] sync_inodes_sb+0x81/0x8f
 [<c015212d>] __fsync_super+0xa/0x58
 [<c016b06e>] freeze_bdev+0x39/0x68
 [<c03745c5>] dm_suspend+0xf1/0x265
 [<c0116336>] default_wake_function+0x0/0xd
 [<c0376d3e>] dev_suspend+0x53/0x157
 [<c037765a>] ctl_ioctl+0x212/0x257
 [<c0158191>] __link_path_walk+0x9df/0xb23
 [<c0376ceb>] dev_suspend+0x0/0x157
 [<c0159ff0>] do_ioctl+0x4c/0x62
 [<c015a24a>] vfs_ioctl+0x244/0x256
 [<c015a28f>] sys_ioctl+0x33/0x4c
 [<c01035a0>] sysenter_past_esp+0x5d/0x81
 [<c0410033>] __sched_text_start+0x863/0x912


kdb traceback for lock_dlm1

Stack traceback for pid 3045
0xc14ea550     3045        7  0    0   D  0xc14ea700  lock_dlm1
esp        eip        Function (args)
0xd04c7df4 0xc0410033 __sched_text_start+0x863
0xd04c7e0c 0xc015feea destroy_inode+0x32
0xd04c7e10 0xc01147a4 task_rq_lock+0x31
0xd04c7e4c 0xc020f7b3 kobject_release
0xd04c7e68 0xc0410190 wait_for_completion+0x68
0xd04c7e74 0xc0116336 default_wake_function
0xd04c7e90 0xc012a4ad kthread_stop+0x4e
0xd04c7e98 0xe006218e [lock_dlm]gdlm_release_threads+0xe
0xd04c7ea0 0xe0061e50 [lock_dlm]gdlm_withdraw+0x96
0xd04c7eac 0xc012a76e autoremove_wake_function
0xd04c7ec0 0xe050f3f6 [gfs2]gfs2_withdraw_lockproto+0x16
0xd04c7ec8 0xe050c976 [gfs2]gfs2_lm_withdraw+0x6d
0xd04c7ee0 0xe051dfa7 [gfs2]gfs2_assert_withdraw_i+0x3b
0xd04c7f0c 0xe050fa7c [gfs2]gfs2_meta_inval+0x41
0xd04c7f24 0xe050a66c [gfs2]inode_go_inval+0xe
0xd04c7f2c 0xe0509eea [gfs2]drop_bh+0xb1
0xd04c7f4c 0xe0509a15 [gfs2]gfs2_glock_cb+0xb1
0xd04c7f54 0xc012a8ee remove_wait_queue+0x31
0xd04c7f64 0xe006279c [lock_dlm]gdlm_thread+0x5af

I'll try reproducing this and come up with a testcase.
Comment 8 Steve Whitehouse 2007-03-23 05:38:37 EDT
Abhi, can you dup this bug or something so that one copy of it can be put back
into POST in order to get the patch into RHEL5.1?

There is nothing wrong with investigating this further, but I don't want to
delay the original patch if at all possible since that does fix a real bug, even
if it hasn't solved all cases of it.
Comment 9 Steve Whitehouse 2007-04-18 09:51:43 EDT
Created attachment 152906 [details]
New patch to fix rgrp issue

A new patch which applies to RHEL5.1 post bz 235349
Comment 10 Don Zickus 2007-05-01 14:08:18 EDT
in 2.6.18-17.el5
Comment 13 errata-xmlrpc 2007-11-07 14:41:59 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.