Bug 698205 - gfs2 withdrawal while browsing the filesystem
gfs2 withdrawal while browsing the filesystem
Status: CLOSED UPSTREAM
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
14
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Robert Peterson
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-20 08:40 EDT by Milos Jakubicek
Modified: 2011-05-23 07:35 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-23 07:35:08 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
fsck output (first part, run without -y) (2.48 KB, application/x-xz)
2011-04-28 09:32 EDT, Milos Jakubicek
no flags Details
fsck output (second part, run with -y) (29.46 KB, application/x-xz)
2011-04-28 09:33 EDT, Milos Jakubicek
no flags Details
output from gfs2_edit savemeta (3.76 MB, application/x-xz)
2011-04-28 09:34 EDT, Milos Jakubicek
no flags Details

  None (edit)
Description Milos Jakubicek 2011-04-20 08:40:34 EDT
Description of problem:

While editing a file located on a GFS partition, the fs got withdrawn, in /var/log/messages I see:

Apr 20 14:26:15 toad0 kernel: [335703.895508] GFS2: fsid=ske:gfs2.3: fatal: filesystem consistency error
Apr 20 14:26:15 toad0 kernel: [335703.895510] GFS2: fsid=ske:gfs2.3:   RG = 526024
Apr 20 14:26:15 toad0 kernel: [335703.895511] GFS2: fsid=ske:gfs2.3:   function = gfs2_setbit, file = fs/gfs2/rgrp.c, line = 95
Apr 20 14:26:15 toad0 kernel: [335703.895514] GFS2: fsid=ske:gfs2.3: about to withdraw this file system
Apr 20 14:26:15 toad0 kernel: [335703.895656] GFS2: fsid=ske:gfs2.3: telling LM to unmount
Apr 20 14:26:15 toad0 kernel: [335703.901222] GFS2: fsid=ske:gfs2.3: withdrawn
Apr 20 14:26:15 toad0 kernel: [335703.901302] Pid: 9642, comm: vim Not tainted 2.6.35.11-83.fc14.x86_64 #1
Apr 20 14:26:15 toad0 kernel: [335703.901304] Call Trace:
Apr 20 14:26:15 toad0 kernel: [335703.901327]  [<ffffffffa02d05d7>] gfs2_lm_withdraw+0xd7/0x10a [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901333]  [<ffffffff8106662e>] ? wake_up_bit+0x25/0x2a
Apr 20 14:26:15 toad0 kernel: [335703.901346]  [<ffffffffa02d0832>] gfs2_consist_rgrpd_i+0x4a/0x4c [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901358]  [<ffffffffa02cffa7>] ? lops_add+0x17/0x19 [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901370]  [<ffffffffa02cafe7>] gfs2_setbit+0x7c/0xa9 [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901382]  [<ffffffffa02cbba5>] rgblk_free+0x131/0x151 [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901387]  [<ffffffff811383b4>] ? unlock_buffer+0x2b/0x2f
Apr 20 14:26:15 toad0 kernel: [335703.901399]  [<ffffffffa02cd114>] gfs2_unlink_di+0x3c/0x92 [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901408]  [<ffffffffa02bb77d>] gfs2_change_nlink+0xe7/0xf5 [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901419]  [<ffffffffa02c5d05>] gfs2_unlink+0x12f/0x1da [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901430]  [<ffffffffa02c5c38>] ? gfs2_unlink+0x62/0x1da [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901441]  [<ffffffffa02c5c5a>] ? gfs2_unlink+0x84/0x1da [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901452]  [<ffffffffa02c5c8b>] ? gfs2_unlink+0xb5/0x1da [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901463]  [<ffffffffa02caf4a>] ? gfs2_glock_nq_init+0x16/0x37 [gfs2]
Apr 20 14:26:15 toad0 kernel: [335703.901467]  [<ffffffff811203f0>] vfs_unlink+0x76/0xcf
Apr 20 14:26:15 toad0 kernel: [335703.901471]  [<ffffffff81121d15>] do_unlinkat+0xd1/0x15f
Apr 20 14:26:15 toad0 kernel: [335703.901474]  [<ffffffff811185c5>] ? fput+0x1de/0x1ed
Apr 20 14:26:15 toad0 kernel: [335703.901477]  [<ffffffff8111f9b5>] ? path_put+0x22/0x27
Apr 20 14:26:15 toad0 kernel: [335703.901481]  [<ffffffff81099b75>] ? audit_syscall_entry+0x11c/0x148
Apr 20 14:26:15 toad0 kernel: [335703.901485]  [<ffffffff8112301f>] sys_unlink+0x16/0x18
Apr 20 14:26:15 toad0 kernel: [335703.901488]  [<ffffffff81009cf2>] system_call_fastpath+0x16/0x1b

Version-Release number of selected component (if applicable):

gfs2-utils-3.1.1-2.fc14.x86_64

Let me know if I can provide any more info, but I haven't found anything useful in /var/log/cluster/*.
Comment 1 Milos Jakubicek 2011-04-21 02:07:15 EDT
I was just able to reproduce this on the same file, however, after the second withdrawal, that file just vanished :/ After that I tried to touch all the remaining files on that partition and it didn't cause any problems.
Comment 2 Milos Jakubicek 2011-04-27 09:46:46 EDT
Hm, today I got:

Apr 27 15:34:45 toad0 kernel: [432786.466952] GFS2: fsid=ske:gfs2.1: fatal: invalid metadata block
Apr 27 15:34:45 toad0 kernel: [432786.466954] GFS2: fsid=ske:gfs2.1:   bh = 644301 (type: exp=4, found=5)
Apr 27 15:34:45 toad0 kernel: [432786.466955] GFS2: fsid=ske:gfs2.1:   function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 398
Apr 27 15:34:45 toad0 kernel: [432786.466959] GFS2: fsid=ske:gfs2.1: about to withdraw this file system
Apr 27 15:34:45 toad0 kernel: [432786.467084] GFS2: fsid=ske:gfs2.1: telling LM to unmount
Apr 27 15:34:45 toad0 kernel: [432786.471867] GFS2: fsid=ske:gfs2.1: withdrawn
Apr 27 15:34:45 toad0 kernel: [432786.471935] Pid: 1682, comm: glock_workqueue Not tainted 2.6.35.12-88.fc14.x86_64 #1
Apr 27 15:34:45 toad0 kernel: [432786.471937] Call Trace:
Apr 27 15:34:45 toad0 kernel: [432786.471963]  [<ffffffffa02aa5d7>] gfs2_lm_withdraw+0xd7/0x10a [gfs2]
Apr 27 15:34:45 toad0 kernel: [432786.471977]  [<ffffffffa02aa8d2>] gfs2_metatype_check_ii+0x50/0x59 [gfs2]
Apr 27 15:34:45 toad0 kernel: [432786.471988]  [<ffffffffa0299d11>] gfs2_meta_indirect_buffer+0x120/0x149 [gfs2]
Apr 27 15:34:45 toad0 kernel: [432786.471993]  [<ffffffff8103d3a2>] ? update_curr+0xd9/0xe2
Apr 27 15:34:45 toad0 kernel: [432786.472002]  [<ffffffffa0294c33>] gfs2_meta_inode_buffer+0x1c/0x1e [gfs2]
Apr 27 15:34:45 toad0 kernel: [432786.472012]  [<ffffffffa0295951>] gfs2_inode_refresh+0x1e/0x2ba [gfs2]
Apr 27 15:34:45 toad0 kernel: [432786.472017]  [<ffffffff81008063>] ? load_TLS+0x10/0x14
Apr 27 15:34:45 toad0 kernel: [432786.472020]  [<ffffffff810086f6>] ? __switch_to+0xce/0x20a
Apr 27 15:34:45 toad0 kernel: [432786.472029]  [<ffffffffa02948df>] inode_go_lock+0x4d/0xc8 [gfs2]
Apr 27 15:34:45 toad0 kernel: [432786.472037]  [<ffffffffa0291a7d>] do_promote+0x6e/0x113 [gfs2]
Apr 27 15:34:45 toad0 kernel: [432786.472046]  [<ffffffffa02936e2>] finish_xmote+0x1dd/0x1fa [gfs2]
Apr 27 15:34:45 toad0 kernel: [432786.472055]  [<ffffffffa0293888>] glock_work_func+0x84/0x134 [gfs2]
Apr 27 15:34:45 toad0 kernel: [432786.472060]  [<ffffffff81062bdd>] worker_thread+0x1c5/0x251
Apr 27 15:34:45 toad0 kernel: [432786.472068]  [<ffffffffa0293804>] ? glock_work_func+0x0/0x134 [gfs2]
Apr 27 15:34:45 toad0 kernel: [432786.472072]  [<ffffffff81066a2f>] ? autoremove_wake_function+0x0/0x39
Apr 27 15:34:45 toad0 kernel: [432786.472075]  [<ffffffff81062a18>] ? worker_thread+0x0/0x251
Apr 27 15:34:45 toad0 kernel: [432786.472077]  [<ffffffff81066595>] kthread+0x7f/0x87
Apr 27 15:34:45 toad0 kernel: [432786.472081]  [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
Apr 27 15:34:45 toad0 kernel: [432786.472084]  [<ffffffff81066516>] ? kthread+0x0/0x87
Apr 27 15:34:45 toad0 kernel: [432786.472086]  [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10

and then I saw:

type=AVC msg=audit(1303911285.107:10110): avc:  denied  { read } for  pid=20394 comm="dmsetup" name="udev.conf" dev=dm-4 ino=1048718 scontext=system_u:system_r:gfs_controld_t:s0 tcontext=system_u:object_r:etc_t:s0 tclass=file
type=SYSCALL msg=audit(1303911285.107:10110): arch=c000003e syscall=2 success=no exit=-13 a0=1c69090 a1=80000 a2=1b6 a3=0 items=0 ppid=1631 pid=20394 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="dmsetup" exe="/sbin/dmsetup" subj=system_u:system_r:gfs_controld_t:s0 key=(null)
type=AVC msg=audit(1303911285.109:10111): avc:  denied  { sys_admin } for  pid=20394 comm="dmsetup" capability=21  scontext=system_u:system_r:gfs_controld_t:s0 tcontext=system_u:system_r:gfs_controld_t:s0 tclass=capability

(not sure whether this was cause or rather a result of the withdrawal; CC'ing selinux maintainers)
Comment 3 Steve Whitehouse 2011-04-27 10:00:50 EDT
This most recent message is telling you that an inode look up found a different type of metadata instead of what it was expecting to find (an inode).

Did you run fsck after you got the initial problem? If so did it pick up any issues?

I'd recommend unmounting all nodes, saving the metadata (for later analysis) and then running fsck and seeing what it picks up.

The selinux issue you've hit is due to the way in which withdraw works. In general it is better not to use selinux with gfs2, at least until we can support it properly, since it will not work with a clustered gfs2. That is due to the long standing problem of not being able to invalidate cached selinux data from the fs. There is a bug open regarding that, but I've not checked recent progress on it.
Comment 4 Milos Jakubicek 2011-04-28 03:24:16 EDT
Steve,

thank you very much for the info, that seems to be quite important to me now. I also dared to add the note to https://fedoraproject.org/wiki/Features/GFS2Stable,
if you possibly find the bugreport you mentioned, please add a link there (I haven't found any).

Regarding the FS failure, I have saved the metadata (using gfs2_edit savemeta) and did the fsck -- do you want to me to attach the metadata file and fsck output here?
(And no, I did not run fsck previously, just to see what will happen next -- there were no mission critical data [yet]).
Comment 5 Steve Whitehouse 2011-04-28 04:31:14 EDT
How large is it? If its not too huge, then attaching it here is ok. Otherwise we'll need to make other arrangements.

So far as I can tell, everything you've seen above is probably the result of a single problem in the filesystem's metadata. We'd certainly be very interested to know if it is possible to reproduce the issue from a clean, newly created filesystem as that should help point us in the right direction. Often, it is not possible to do so, since by their very nature these kinds of issues tend to result from rather unlikely events and are often tricky to track down.

The GFS2Stable feature is really obsolete now since it was for a version of Fedora that has long gone. The selinux bug is #437984 and it is possible that you won't be able to see it, depending on whether you have a RH subscription or not.

I should also mention that Bob has been working on a very similar issue to the one you've reported (similar enough that they may well be the same). That is one reason why we are very keen to gather data, since it might well help narrow down the issue beyond what we know already.
Comment 6 Milos Jakubicek 2011-04-28 09:28:10 EDT
After xz-ting its reasonable large, I'll attach -- and yes, I do see that such problems are hard to reproduce and therefore to fix as well :(
Comment 7 Milos Jakubicek 2011-04-28 09:32:41 EDT
Created attachment 495538 [details]
fsck output (first part, run without -y)
Comment 8 Milos Jakubicek 2011-04-28 09:33:20 EDT
Created attachment 495539 [details]
fsck output (second part, run with -y)
Comment 9 Milos Jakubicek 2011-04-28 09:34:24 EDT
Created attachment 495540 [details]
output from gfs2_edit savemeta
Comment 10 Steve Whitehouse 2011-05-06 10:09:43 EDT
I suspect that this patch:

http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=d192a8e5c6fec4fe8cdafebccc415db4074dee88

will fix the problem that you are seeing here. You should run fsck.gfs2 (after unmounting on all nodes) as a precaution, in case there is some on-disk inconsistency although that is relatively unlikely. The patch should prevent you hitting the same problem again.

The patch in question is queued for the next merge window, but it may go in sooner than that via the -fixes tree depending on testing, and when the next merge window turns out to be (always a movable feast, but probably not too far away now).
Comment 11 Steve Whitehouse 2011-05-23 07:35:08 EDT
This patch was accepted by Linus recently:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d192a8e5c6fec4fe8cdafebccc415db4074dee88

Note You need to log in before you can comment on or make changes to this bug.