Bug 762243 (GLUSTER-511)

Summary: mount hangs with some audit configurations
Product: [Community] GlusterFS Reporter: Csaba Henk <csaba>
Component: fuseAssignee: Csaba Henk <csaba>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: mainlineCC: fharshav, gluster-bugs, lakshmipathi
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTNR Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Csaba Henk 2009-12-24 11:43:50 UTC
Mounting with FUSE happens as follows:

1. mount(2) syscall invoked and completed. At this point the fs is not yet usable, all calls into it block.
2. Add an mtab entry via a contrived invocation of mount(8).
3. Proceed on to a handshake with the kernel (INIT message). After this fs goes alive and the calls to fs blocked in between 1. and 3. can complete.

As mount(8) is implemented, step 2. involves a readlink(2) against the mountpoint. This is not a problem usually, but certain configurations of the Linux audit system imply that some preliminary checks are done against the readlink'd path, practically in the form of getxattr(2) calls, which effectively call into the file system, hence at this stage will block. So the mount(8) process in 2. can't complete, therefore we get stuck in step 2.

* Original Fedora bugreport:
https://bugzilla.redhat.com/show_bug.cgi?id=493565

* fuse-devel@ thread on it:
http://thread.gmane.org/gmane.linux.file-systems/36651

* libfuse fix:
http://fuse.cvs.sourceforge.net/viewvc/fuse/fuse/lib/mount_util.c?view=log#rev1.14
http://git.gluster.com/?p=users/csaba/fuse.git;a=commit;h=d6bc53b3d50776c6da4b6e221029c0d2e40f4db7

This fix makes use of an on-demand crafted mount(8) option which makes mount(8) to skip the call to readlink(2). 

* The respective commit to util-linux-ng:
http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=commit;h=v2.17-rc1-12-g45fc569

Comment 1 Harshavardhana 2009-12-24 16:30:20 UTC
This is type of bug i saw in storage platform at times during mount after 
a reboot, since it was i pressed 3 finger salute, it made glusterfs crash in 
io-cache which is fixed. But the issue was exactly what you mentioned here. 

Atleast some one reproduced it elsewhere :).

Comment 2 Csaba Henk 2009-12-26 11:00:47 UTC
(In reply to comment #1)
> This is type of bug i saw in storage platform at times during mount after 
> a reboot

How can that be? Are you sure it's the same bug? I mean, you say "at times", and "mount after a reboot"... With the kind of audit configuration which triggers it, it should occur quite deterministically.

Csaba

Comment 3 Anand Avati 2009-12-28 09:39:34 UTC
PATCH: http://patches.gluster.com/patch/2638 in master (fuse: add mtab entry asynchronously)

Comment 4 Harshavardhana 2009-12-28 19:20:57 UTC
This happened in Storage Platform, i have a audit config for Fedora11 which seems
to be similar to what Fedora is referring in their bug tracker. 

This happens exactly at "waitpid()" just hangs around in fuse_mnt_add_mount() call.

Check #364, which is fixed for a segfault, but there is another problem in the
same backtrace, which for me seems quite similar to what you tried to fix in this
patch or no?

Comment 5 Csaba Henk 2009-12-29 12:53:09 UTC
(In reply to comment #4)
> Check #364, which is fixed for a segfault, but there is another problem in the
> same backtrace, which for me seems quite similar to what you tried to fix in
> this
> patch or no?

Apparently seems so, I was just wondering about reproducibility.

Comment 6 Harshavardhana 2009-12-29 20:49:51 UTC
Reproducing it is really tough some times, perhaps you can reproduce this in 5 in 1 reboots using a Fedora 11 virtual image.

But how does you patch fixes this issue? now if you don't update the mtab and just
give a failure how do we understand the volume is mounted and under use?. Does
"/proc/mounts" has this value?. Should we catch this error?

Comment 7 Csaba Henk 2009-12-30 03:35:19 UTC
(In reply to comment #6)
> Reproducing it is really tough some times, perhaps you can reproduce this in 5
> in 1 reboots using a Fedora 11 virtual image.

Well it's not worth for the time to reproduce it (I just took a modified mount(8) binary for testing which has explicitly called into the fs at that certain point), I just don't see how is it possible that once a system configured so that it's affected by the bug, it just shows up sporadically and not all the time. 

> But how does you patch fixes this issue? now if you don't update the mtab and
> just
> give a failure how do we understand the volume is mounted and under use?. Does
> "/proc/mounts" has this value?. Should we catch this error?

It's unlikely that mtab update would fail. Cases of non-existing mtab, symlink'd mtab, mtab on ro mount are checked in advance, and if it's like that, mtab is not even tried to update. So mtab update could fail if there is some error with the underlying fs, or disk full, etc. In this case:

- the appropriate info is still there in /proc/mounts
- the appropriate error msg is in the logs
- this doesn't affect directly glusterfs functionality
- however the system is quite likely pretty much f*cked up anyway

So, then it's a system administration problem and not a Glusterfs issue. And wrt the other fix, Miklos + Karel Zak's mount(8) option hackery... I guess Fedora adopts that mount option and the libfuse patch to fix this on their behalf, but what if someone independently configures the system in similar way (like you in storage platform... well OK that's semi-independent), I don't wanna give them a choice to have their whining heard :P

Comment 8 Csaba Henk 2011-04-06 08:37:39 UTC
(In reply to comment #0)

Some fixups on the urls included in the original problem description.

> * Original Fedora bugreport:
> https://bugzilla.redhat.com/0

Somehow this url is garbled, correct one is

https://bugzilla.redhat.com/show_bug.cgi?id=493565

> * libfuse fix:
> http://fuse.cvs.sourceforge.net/viewvc/fuse/fuse/lib/mount_util.c?view=log#rev1.14
> http://git.gluster.com/?p=users/csaba/fuse.git;a=commit;h=d6bc53b3d50776c6da4b6e221029c0d2e40f4db7

Both these repos are gone, the current valid url is:

http://fuse.git.sourceforge.net/git/gitweb.cgi?p=fuse/fuse;a=commitdiff;h=4c3d9b195

Comment 9 Csaba Henk 2011-04-06 08:41:39 UTC
(In reply to comment #8)
> (In reply to comment #0)
> 
> Some fixups on the urls included in the original problem description.
> 
> > * Original Fedora bugreport:
> > https://bugzilla.redhat.com/0
> 
> Somehow this url is garbled, correct one is
> 
> https://bugzilla.redhat.com/0

?? wtf...

"https://bugzilla.redhat.com/show_bug.cgi?id=493565"

ie. "https://bugzilla.redhat.com/", bug id 493565

Comment 10 Csaba Henk 2011-04-06 08:43:44 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > (In reply to comment #0)
> > 
> > Some fixups on the urls included in the original problem description.
> > 
> > > * Original Fedora bugreport:
> > > https://bugzilla.redhat.com/0
> > 
> > Somehow this url is garbled, correct one is
> > 
> > https://bugzilla.redhat.com/0
> 
> ?? wtf...
> 
> "https://bugzilla.redhat.com/0"
> 
> ie. "https://bugzilla.redhat.com/", bug id 493565

Oh I guess the poor thing tries to resolve the path in the foreign bugzilla url as a link to a local bug entry which then fails gloriously.