Description of problem: While editing a file located on a GFS partition, the fs got withdrawn, in /var/log/messages I see: Apr 20 14:26:15 toad0 kernel: [335703.895508] GFS2: fsid=ske:gfs2.3: fatal: filesystem consistency error Apr 20 14:26:15 toad0 kernel: [335703.895510] GFS2: fsid=ske:gfs2.3: RG = 526024 Apr 20 14:26:15 toad0 kernel: [335703.895511] GFS2: fsid=ske:gfs2.3: function = gfs2_setbit, file = fs/gfs2/rgrp.c, line = 95 Apr 20 14:26:15 toad0 kernel: [335703.895514] GFS2: fsid=ske:gfs2.3: about to withdraw this file system Apr 20 14:26:15 toad0 kernel: [335703.895656] GFS2: fsid=ske:gfs2.3: telling LM to unmount Apr 20 14:26:15 toad0 kernel: [335703.901222] GFS2: fsid=ske:gfs2.3: withdrawn Apr 20 14:26:15 toad0 kernel: [335703.901302] Pid: 9642, comm: vim Not tainted 2.6.35.11-83.fc14.x86_64 #1 Apr 20 14:26:15 toad0 kernel: [335703.901304] Call Trace: Apr 20 14:26:15 toad0 kernel: [335703.901327] [<ffffffffa02d05d7>] gfs2_lm_withdraw+0xd7/0x10a [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901333] [<ffffffff8106662e>] ? wake_up_bit+0x25/0x2a Apr 20 14:26:15 toad0 kernel: [335703.901346] [<ffffffffa02d0832>] gfs2_consist_rgrpd_i+0x4a/0x4c [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901358] [<ffffffffa02cffa7>] ? lops_add+0x17/0x19 [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901370] [<ffffffffa02cafe7>] gfs2_setbit+0x7c/0xa9 [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901382] [<ffffffffa02cbba5>] rgblk_free+0x131/0x151 [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901387] [<ffffffff811383b4>] ? unlock_buffer+0x2b/0x2f Apr 20 14:26:15 toad0 kernel: [335703.901399] [<ffffffffa02cd114>] gfs2_unlink_di+0x3c/0x92 [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901408] [<ffffffffa02bb77d>] gfs2_change_nlink+0xe7/0xf5 [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901419] [<ffffffffa02c5d05>] gfs2_unlink+0x12f/0x1da [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901430] [<ffffffffa02c5c38>] ? gfs2_unlink+0x62/0x1da [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901441] [<ffffffffa02c5c5a>] ? gfs2_unlink+0x84/0x1da [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901452] [<ffffffffa02c5c8b>] ? gfs2_unlink+0xb5/0x1da [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901463] [<ffffffffa02caf4a>] ? gfs2_glock_nq_init+0x16/0x37 [gfs2] Apr 20 14:26:15 toad0 kernel: [335703.901467] [<ffffffff811203f0>] vfs_unlink+0x76/0xcf Apr 20 14:26:15 toad0 kernel: [335703.901471] [<ffffffff81121d15>] do_unlinkat+0xd1/0x15f Apr 20 14:26:15 toad0 kernel: [335703.901474] [<ffffffff811185c5>] ? fput+0x1de/0x1ed Apr 20 14:26:15 toad0 kernel: [335703.901477] [<ffffffff8111f9b5>] ? path_put+0x22/0x27 Apr 20 14:26:15 toad0 kernel: [335703.901481] [<ffffffff81099b75>] ? audit_syscall_entry+0x11c/0x148 Apr 20 14:26:15 toad0 kernel: [335703.901485] [<ffffffff8112301f>] sys_unlink+0x16/0x18 Apr 20 14:26:15 toad0 kernel: [335703.901488] [<ffffffff81009cf2>] system_call_fastpath+0x16/0x1b Version-Release number of selected component (if applicable): gfs2-utils-3.1.1-2.fc14.x86_64 Let me know if I can provide any more info, but I haven't found anything useful in /var/log/cluster/*.
I was just able to reproduce this on the same file, however, after the second withdrawal, that file just vanished :/ After that I tried to touch all the remaining files on that partition and it didn't cause any problems.
Hm, today I got: Apr 27 15:34:45 toad0 kernel: [432786.466952] GFS2: fsid=ske:gfs2.1: fatal: invalid metadata block Apr 27 15:34:45 toad0 kernel: [432786.466954] GFS2: fsid=ske:gfs2.1: bh = 644301 (type: exp=4, found=5) Apr 27 15:34:45 toad0 kernel: [432786.466955] GFS2: fsid=ske:gfs2.1: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 398 Apr 27 15:34:45 toad0 kernel: [432786.466959] GFS2: fsid=ske:gfs2.1: about to withdraw this file system Apr 27 15:34:45 toad0 kernel: [432786.467084] GFS2: fsid=ske:gfs2.1: telling LM to unmount Apr 27 15:34:45 toad0 kernel: [432786.471867] GFS2: fsid=ske:gfs2.1: withdrawn Apr 27 15:34:45 toad0 kernel: [432786.471935] Pid: 1682, comm: glock_workqueue Not tainted 2.6.35.12-88.fc14.x86_64 #1 Apr 27 15:34:45 toad0 kernel: [432786.471937] Call Trace: Apr 27 15:34:45 toad0 kernel: [432786.471963] [<ffffffffa02aa5d7>] gfs2_lm_withdraw+0xd7/0x10a [gfs2] Apr 27 15:34:45 toad0 kernel: [432786.471977] [<ffffffffa02aa8d2>] gfs2_metatype_check_ii+0x50/0x59 [gfs2] Apr 27 15:34:45 toad0 kernel: [432786.471988] [<ffffffffa0299d11>] gfs2_meta_indirect_buffer+0x120/0x149 [gfs2] Apr 27 15:34:45 toad0 kernel: [432786.471993] [<ffffffff8103d3a2>] ? update_curr+0xd9/0xe2 Apr 27 15:34:45 toad0 kernel: [432786.472002] [<ffffffffa0294c33>] gfs2_meta_inode_buffer+0x1c/0x1e [gfs2] Apr 27 15:34:45 toad0 kernel: [432786.472012] [<ffffffffa0295951>] gfs2_inode_refresh+0x1e/0x2ba [gfs2] Apr 27 15:34:45 toad0 kernel: [432786.472017] [<ffffffff81008063>] ? load_TLS+0x10/0x14 Apr 27 15:34:45 toad0 kernel: [432786.472020] [<ffffffff810086f6>] ? __switch_to+0xce/0x20a Apr 27 15:34:45 toad0 kernel: [432786.472029] [<ffffffffa02948df>] inode_go_lock+0x4d/0xc8 [gfs2] Apr 27 15:34:45 toad0 kernel: [432786.472037] [<ffffffffa0291a7d>] do_promote+0x6e/0x113 [gfs2] Apr 27 15:34:45 toad0 kernel: [432786.472046] [<ffffffffa02936e2>] finish_xmote+0x1dd/0x1fa [gfs2] Apr 27 15:34:45 toad0 kernel: [432786.472055] [<ffffffffa0293888>] glock_work_func+0x84/0x134 [gfs2] Apr 27 15:34:45 toad0 kernel: [432786.472060] [<ffffffff81062bdd>] worker_thread+0x1c5/0x251 Apr 27 15:34:45 toad0 kernel: [432786.472068] [<ffffffffa0293804>] ? glock_work_func+0x0/0x134 [gfs2] Apr 27 15:34:45 toad0 kernel: [432786.472072] [<ffffffff81066a2f>] ? autoremove_wake_function+0x0/0x39 Apr 27 15:34:45 toad0 kernel: [432786.472075] [<ffffffff81062a18>] ? worker_thread+0x0/0x251 Apr 27 15:34:45 toad0 kernel: [432786.472077] [<ffffffff81066595>] kthread+0x7f/0x87 Apr 27 15:34:45 toad0 kernel: [432786.472081] [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10 Apr 27 15:34:45 toad0 kernel: [432786.472084] [<ffffffff81066516>] ? kthread+0x0/0x87 Apr 27 15:34:45 toad0 kernel: [432786.472086] [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10 and then I saw: type=AVC msg=audit(1303911285.107:10110): avc: denied { read } for pid=20394 comm="dmsetup" name="udev.conf" dev=dm-4 ino=1048718 scontext=system_u:system_r:gfs_controld_t:s0 tcontext=system_u:object_r:etc_t:s0 tclass=file type=SYSCALL msg=audit(1303911285.107:10110): arch=c000003e syscall=2 success=no exit=-13 a0=1c69090 a1=80000 a2=1b6 a3=0 items=0 ppid=1631 pid=20394 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="dmsetup" exe="/sbin/dmsetup" subj=system_u:system_r:gfs_controld_t:s0 key=(null) type=AVC msg=audit(1303911285.109:10111): avc: denied { sys_admin } for pid=20394 comm="dmsetup" capability=21 scontext=system_u:system_r:gfs_controld_t:s0 tcontext=system_u:system_r:gfs_controld_t:s0 tclass=capability (not sure whether this was cause or rather a result of the withdrawal; CC'ing selinux maintainers)
This most recent message is telling you that an inode look up found a different type of metadata instead of what it was expecting to find (an inode). Did you run fsck after you got the initial problem? If so did it pick up any issues? I'd recommend unmounting all nodes, saving the metadata (for later analysis) and then running fsck and seeing what it picks up. The selinux issue you've hit is due to the way in which withdraw works. In general it is better not to use selinux with gfs2, at least until we can support it properly, since it will not work with a clustered gfs2. That is due to the long standing problem of not being able to invalidate cached selinux data from the fs. There is a bug open regarding that, but I've not checked recent progress on it.
Steve, thank you very much for the info, that seems to be quite important to me now. I also dared to add the note to https://fedoraproject.org/wiki/Features/GFS2Stable, if you possibly find the bugreport you mentioned, please add a link there (I haven't found any). Regarding the FS failure, I have saved the metadata (using gfs2_edit savemeta) and did the fsck -- do you want to me to attach the metadata file and fsck output here? (And no, I did not run fsck previously, just to see what will happen next -- there were no mission critical data [yet]).
How large is it? If its not too huge, then attaching it here is ok. Otherwise we'll need to make other arrangements. So far as I can tell, everything you've seen above is probably the result of a single problem in the filesystem's metadata. We'd certainly be very interested to know if it is possible to reproduce the issue from a clean, newly created filesystem as that should help point us in the right direction. Often, it is not possible to do so, since by their very nature these kinds of issues tend to result from rather unlikely events and are often tricky to track down. The GFS2Stable feature is really obsolete now since it was for a version of Fedora that has long gone. The selinux bug is #437984 and it is possible that you won't be able to see it, depending on whether you have a RH subscription or not. I should also mention that Bob has been working on a very similar issue to the one you've reported (similar enough that they may well be the same). That is one reason why we are very keen to gather data, since it might well help narrow down the issue beyond what we know already.
After xz-ting its reasonable large, I'll attach -- and yes, I do see that such problems are hard to reproduce and therefore to fix as well :(
Created attachment 495538 [details] fsck output (first part, run without -y)
Created attachment 495539 [details] fsck output (second part, run with -y)
Created attachment 495540 [details] output from gfs2_edit savemeta
I suspect that this patch: http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=d192a8e5c6fec4fe8cdafebccc415db4074dee88 will fix the problem that you are seeing here. You should run fsck.gfs2 (after unmounting on all nodes) as a precaution, in case there is some on-disk inconsistency although that is relatively unlikely. The patch should prevent you hitting the same problem again. The patch in question is queued for the next merge window, but it may go in sooner than that via the -fixes tree depending on testing, and when the next merge window turns out to be (always a movable feast, but probably not too far away now).
This patch was accepted by Linus recently: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d192a8e5c6fec4fe8cdafebccc415db4074dee88