Bug 230143
| Summary: | [GFS2] getting fatal: assertion "!atomic_read(&gl->gl_ail_count)" failed when using gfs2 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Josef Bacik <jbacik> | ||||||
| Component: | kernel | Assignee: | Don Zickus <dzickus> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 5.1 | CC: | david.craigon, kanderso, lwang, rkenna, swhiteho | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | RHBA-2007-0959 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2007-11-07 19:41:59 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 204760 | ||||||||
| Attachments: |
|
||||||||
|
Description
Josef Bacik
2007-02-26 20:50:48 UTC
This seems to be related to callbacks. I can reproduce this on a single node by use of postmark (transactions & number both set to 100000) which lands up triggering the code to reduce glock numbers, which in turn causes this to happen during the demotion of the glock. Note that you have to be running lock_dlm for this to happen, lock_nolock never causes this to occur. The odd thing is that it appears that the demotion is occuring without having called the ->go_sync function since the dirty flag appears to be set (correctly, since there is obviously still data to be flushed) on the glock. This is true even if I remove the dirty test and run the flush unconditionally (but still clear the dirty flag). So my best guess at the moment is that during demotion due to callback the first part of the glock demote code doesn't get run for some reason. Created attachment 148914 [details]
Patch to fix rgrp flushing
It appears this bug is down to not flushing the rgrps when a callback is
received. We've not seen this before as normally the journal log flush will
result in the rgrp being flushed anyway, so it only occurs when a request is
received to flush _only_ an rgrp and that rgrp is dirty at the time of the
request.
The attached patch fixes the problem.
Works for me now, thanks! This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST. I'm running the latest gfs2 from steve's nmw git tree. I was running the QA locksmith test and I tripped this assertion on node winston: GFS2: fsid=smoke:gfs2.2: fatal: assertion "!atomic_read(&gl->gl_ail_count)" fa iled GFS2: fsid=smoke:gfs2.2: function = gfs2_meta_inval, file = fs/gfs2/meta_io. c, line = 101 GFS2: fsid=smoke:gfs2.2: about to withdraw this file system GFS2: fsid=smoke:gfs2.2: telling LM to withdraw [<e051dfae>] gfs2_assert_withdraw_i+0x42/0x4e [gfs2] [<e0511a39>] gfs2_writepage+0x6f/0x172 [gfs2] [<c013b2ac>] generic_writepages+0x17d/0x2ae [<e05119ca>] gfs2_writepage+0x0/0x172 [gfs2] [<c011447d>] __activate_task+0x1c/0x29 [<c011632c>] try_to_wake_up+0x38c/0x396 [<e0511b3c>] gfs2_writepages+0x0/0x3a [gfs2] [<e0511b74>] gfs2_writepages+0x38/0x3a [gfs2] [<c013b3fd>] do_writepages+0x20/0x30 [<c0167399>] __writeback_single_inode+0x198/0x308 [<c0116341>] default_wake_function+0xb/0xd [<c0116341>] default_wake_function+0xb/0xd [<c01677e4>] sync_sb_inodes+0x168/0x211 [<c016790e>] sync_inodes_sb+0x81/0x8f [<c015212d>] __fsync_super+0xa/0x58 [<c016b06e>] freeze_bdev+0x39/0x68 [<c03745c5>] dm_suspend+0xf1/0x265 [<c0116336>] default_wake_function+0x0/0xd [<c0376d3e>] dev_suspend+0x53/0x157 [<c037765a>] ctl_ioctl+0x212/0x257 [<c0158191>] __link_path_walk+0x9df/0xb23 [<c0376ceb>] dev_suspend+0x0/0x157 [<c0159ff0>] do_ioctl+0x4c/0x62 [<c015a24a>] vfs_ioctl+0x244/0x256 [<c015a28f>] sys_ioctl+0x33/0x4c [<c01035a0>] sysenter_past_esp+0x5d/0x81 [<c0410033>] __sched_text_start+0x863/0x912 kdb traceback for lock_dlm1 Stack traceback for pid 3045 0xc14ea550 3045 7 0 0 D 0xc14ea700 lock_dlm1 esp eip Function (args) 0xd04c7df4 0xc0410033 __sched_text_start+0x863 0xd04c7e0c 0xc015feea destroy_inode+0x32 0xd04c7e10 0xc01147a4 task_rq_lock+0x31 0xd04c7e4c 0xc020f7b3 kobject_release 0xd04c7e68 0xc0410190 wait_for_completion+0x68 0xd04c7e74 0xc0116336 default_wake_function 0xd04c7e90 0xc012a4ad kthread_stop+0x4e 0xd04c7e98 0xe006218e [lock_dlm]gdlm_release_threads+0xe 0xd04c7ea0 0xe0061e50 [lock_dlm]gdlm_withdraw+0x96 0xd04c7eac 0xc012a76e autoremove_wake_function 0xd04c7ec0 0xe050f3f6 [gfs2]gfs2_withdraw_lockproto+0x16 0xd04c7ec8 0xe050c976 [gfs2]gfs2_lm_withdraw+0x6d 0xd04c7ee0 0xe051dfa7 [gfs2]gfs2_assert_withdraw_i+0x3b 0xd04c7f0c 0xe050fa7c [gfs2]gfs2_meta_inval+0x41 0xd04c7f24 0xe050a66c [gfs2]inode_go_inval+0xe 0xd04c7f2c 0xe0509eea [gfs2]drop_bh+0xb1 0xd04c7f4c 0xe0509a15 [gfs2]gfs2_glock_cb+0xb1 0xd04c7f54 0xc012a8ee remove_wait_queue+0x31 0xd04c7f64 0xe006279c [lock_dlm]gdlm_thread+0x5af I'll try reproducing this and come up with a testcase. Abhi, can you dup this bug or something so that one copy of it can be put back into POST in order to get the patch into RHEL5.1? There is nothing wrong with investigating this further, but I don't want to delay the original patch if at all possible since that does fix a real bug, even if it hasn't solved all cases of it. Created attachment 152906 [details] New patch to fix rgrp issue A new patch which applies to RHEL5.1 post bz 235349 in 2.6.18-17.el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html |