1421649 – When using a fuse mount for client, EC volumes do not mount.

Bug 1421649 - When using a fuse mount for client, EC volumes do not mount.

Summary: When using a fuse mount for client, EC volumes do not mount.

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ashish Pandey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-13 10:54 UTC by Nigel Babu
Modified:	2023-01-09 13:19 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-03-12 12:40:46 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1421955	0	unspecified	CLOSED	Disperse: Fallback to pre-compiled code execution when dynamic code generation fails	2021-02-22 00:41:40 UTC

Internal Links: 1421955

Description Nigel Babu 2017-02-13 10:54:56 UTC

We recently fixed bug 1402661 by using the /usr/libexec/glusterfs folder for creating the executable files. This folder is created when you have the glusterfs-server package installed.

However, when you only have the client installed, it doesn't create the /usr/libexec/glusterfs folder. Therefore, disperse volumes cannot be mounted correctly. This is from the mount logs on the client.

[2017-02-13 09:53:14.583515] I [MSGID: 100030] [glusterfsd.c:2460:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.11dev (args: /usr/sbin/glusterfs --volfile-server=172.19.2.81 --volfile-id=/testvol_dispersed /mnt/testvol_dispersed_glusterfs)
[2017-02-13 09:53:14.593506] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-02-13 09:53:14.597531] I [MSGID: 122067] [ec-code.c:1025:ec_code_detect] 0-testvol_dispersed-disperse-0: Using 'sse' CPU extensions
[2017-02-13 09:53:14.597580] E [MSGID: 122074] [ec-code.c:425:ec_code_space_create] 0-testvol_dispersed-disperse-0: Unable to create a temporary file for the ec dynamic code [No such file or directory]
[2017-02-13 09:53:14.597598] E [MSGID: 122073] [ec.c:654:init] 0-testvol_dispersed-disperse-0: Failed to initialize matrix management [No such file or directory]
[2017-02-13 09:53:16.597684] E [MSGID: 101019] [xlator.c:503:xlator_init] 0-testvol_dispersed-disperse-0: Initialization of volume 'testvol_dispersed-disperse-0' failed, review your volfile again
[2017-02-13 09:53:16.597723] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-testvol_dispersed-disperse-0: initializing translator failed
[2017-02-13 09:53:16.597770] E [MSGID: 101176] [graph.c:680:glusterfs_graph_activate] 0-graph: init failed
[2017-02-13 09:53:16.598251] W [glusterfsd.c:1329:cleanup_and_exit] (-->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x3c1) [0x7fbf765006f1] -->/usr/sbin/glusterfs(glusterfs_process_volfp+0x1b1) [0x7fbf764fa8c1] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7fbf764f9dfb] ) 0-: received signum (1), shutting down
[2017-02-13 09:53:16.598281] I [fuse-bridge.c:5802:fini] 0-fuse: Unmounting '/mnt/testvol_dispersed_glusterfs'.
[2017-02-13 09:53:16.598575] W [glusterfsd.c:1329:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7fbf74e61dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7fbf764f9f85] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7fbf764f9dfb] ) 0-: received signum (15), shutting down

Comment 1 Nigel Babu 2017-02-13 10:58:51 UTC

We have a follow up issue from the mmap changes for SELinux

Comment 2 Worker Ant 2017-02-13 13:58:58 UTC

REVIEW: https://review.gluster.org/16612 (build: Add %{_libexecdir}/glusterfs to base package) posted (#1) for review on master by Anoop C S (anoopcs)

Comment 3 Niels de Vos 2017-02-13 14:40:51 UTC

Files under /usr/libexec/glusterfs should not be writable by processes after the initial installation. Anything that generates files should use /tmp or /run. For this particular case /run/gluster/lib looks most suitable.

This new /run/gluster/lib directory should be added to extras/run-gluster.tmpfiles.in so that it is created upon boot. There also needs to be a selinux-policy update (most likely) to enforce the correct context on the directory.

Comment 4 Niels de Vos 2017-02-13 14:55:42 UTC

Also, EC is client-side... How is this supposed to work for gfapi applications that do not run as root?

I think the location of the generated files needs to be detected during runtime:
 - /run/gluster/lib when running as root
 - /run/user/<UID>/gluster/lib (maybe?) when running as non-root

For non-root, the directory needs to be created on demand, and have the correct SELinux context set after that (type:bin_t)?

Comment 5 Xavi Hernandez 2017-02-16 07:23:16 UTC

Nigel, do we really need to block 3.10.0 for this problem ?

There's already a patch to avoid failures when mmap() or anything else related to the dynamic code generation fails (https://bugzilla.redhat.com/show_bug.cgi?id=1421955). It simply falls back to the precompiled code instead of causing fatal failures.

This is not the best solution but allows disperse to be used, at least until a proper solution for the selinux problem is found and implemented. It's also a good solution for any possible future change or OS problem that could cause this to fail again.

Another workaround is to manually disable dynamic code generation by using option disperse.cpu-extensions = none.

Comment 6 Anoop C S 2017-02-16 07:35:04 UTC

(In reply to Xavier Hernandez from comment #5)
> Nigel, do we really need to block 3.10.0 for this problem ?
> 
> There's already a patch to avoid failures when mmap() or anything else
> related to the dynamic code generation fails
> (https://bugzilla.redhat.com/show_bug.cgi?id=1421955). It simply falls back
> to the precompiled code instead of causing fatal failures.
> 
> This is not the best solution but allows disperse to be used, at least until
> a proper solution for the selinux problem is found and implemented. It's
> also a good solution for any possible future change or OS problem that could
> cause this to fail again.
> 
> Another workaround is to manually disable dynamic code generation by using
> option disperse.cpu-extensions = none.

Additionally we also need to make sure that these are clearly documented in the release notes for 3.10.

Comment 7 Siddharth Sharma 2017-02-16 11:31:30 UTC

1. Why does it have to generate dynamic binary at runtime?

2. Can user be added to group which has permissions to generate this dynamic binary and then drop privileges, so there is no chance to escalate privileges at any later stage.

3. Solution in comment 5 in this bug seems to be fine, but what are the effects of re-using pre-compiled code ? That brings me to first question 'why dynamic binary generation is required' ?

Comment 8 Shyamsundar 2017-02-16 14:45:35 UTC

(In reply to Anoop C S from comment #6)
> (In reply to Xavier Hernandez from comment #5)
> > Nigel, do we really need to block 3.10.0 for this problem ?
> > 
> > There's already a patch to avoid failures when mmap() or anything else
> > related to the dynamic code generation fails
> > (https://bugzilla.redhat.com/show_bug.cgi?id=1421955). It simply falls back
> > to the precompiled code instead of causing fatal failures.
> > 
> > This is not the best solution but allows disperse to be used, at least until
> > a proper solution for the selinux problem is found and implemented. It's
> > also a good solution for any possible future change or OS problem that could
> > cause this to fail again.
> > 
> > Another workaround is to manually disable dynamic code generation by using
> > option disperse.cpu-extensions = none.
> 
> Additionally we also need to make sure that these are clearly documented in
> the release notes for 3.10.

If https://bugzilla.redhat.com/show_bug.cgi?id=1421955 is taken in, what additional release notes are required? I need them ASAP so that the same can be updated.

Further, removing this from the 3.10 blocker list as per the discussion here.

Comment 9 Anoop C S 2017-02-17 16:48:07 UTC

(In reply to Shyamsundar from comment #8)
> (In reply to Anoop C S from comment #6)
> > (In reply to Xavier Hernandez from comment #5)
> > > Nigel, do we really need to block 3.10.0 for this problem ?
> > > 
> > > There's already a patch to avoid failures when mmap() or anything else
> > > related to the dynamic code generation fails
> > > (https://bugzilla.redhat.com/show_bug.cgi?id=1421955). It simply falls back
> > > to the precompiled code instead of causing fatal failures.
> > > 
> > > This is not the best solution but allows disperse to be used, at least until
> > > a proper solution for the selinux problem is found and implemented. It's
> > > also a good solution for any possible future change or OS problem that could
> > > cause this to fail again.
> > > 
> > > Another workaround is to manually disable dynamic code generation by using
> > > option disperse.cpu-extensions = none.
> > 
> > Additionally we also need to make sure that these are clearly documented in
> > the release notes for 3.10.
> 
> If https://bugzilla.redhat.com/show_bug.cgi?id=1421955 is taken in, what
> additional release notes are required? I need them ASAP so that the same can
> be updated.
> 
> Further, removing this from the 3.10 blocker list as per the discussion here.

I thought it would be better to add a note about this fallback mechanism which we do internally instead of explicitly setting the specified volume set option. As you know I am not an expert in EC. I will leave it for Xavi to have a final word.

Xavi,
What do you think of adding a note in release notes for 3.10 regarding the internal fallback we do with https://review.gluster.org/#/c/16614/?

Comment 10 Xavi Hernandez 2017-02-20 07:20:18 UTC

If patch https://review.gluster.org/16614/ is accepted in 3.10, then I wouldn't say anything about the disperse.cpu-extensions workaround but add a comment stating that in some cases (basically if /usr/libexec/glusterfs doesn't exist or if it's not writable by the process using the volume) the disperse volumes won't use the any acceleration (it will fallback to the same computation method used in 3.8 and earlier).

Not sure how to word it or how many details to tell.

Comment 11 Xavi Hernandez 2017-02-20 10:54:53 UTC

(In reply to Siddharth Sharma from comment #7)
> 1. Why does it have to generate dynamic binary at runtime?

Basically because the amount of possible combinations that may be needed to encode/decode the data is too big (in the order of tens of thousands). On normal circumstances, only a bunch of combinations are really needed, but in relatively big configurations, things can fail in multiple ways and all of them need to be supported though not at the same time (as an example, in a 16+4 configuration, there are 4845 different ways in which 4 of the 20 bricks can fail, and each combination needs a different matrix to be computed).

It can be done without dynamic code (it has been working this way till recently) using more generic functions, but the performance is significantly worse.

The addition of SSE and AVX support makes this problem even worse because it would need to build specific functions for each of the possible extensions for each combination.

> 
> 2. Can user be added to group which has permissions to generate this dynamic
> binary and then drop privileges, so there is no chance to escalate
> privileges at any later stage.

I don't think so. The code doesn't really need to be stored anywhere. In a previous patch it was created only on memory but, because of selinux, it needed to be stored into a file. The file is volatile though: the code creates the file and then immediately deletes it, keeping only an open fd. The generated code only needs to be there while the process is running.

Since creating all possible combinations would be impractical, only the needed combinations are created as they are needed on run-time, so we cannot create a file with all the code and then revoke privileges.

> 
> 3. Solution in comment 5 in this bug seems to be fine, but what are the
> effects of re-using pre-compiled code ? That brings me to first question
> 'why dynamic binary generation is required' ?

Well, solution in comment 5 basically disables dynamic code generation if something fails, so we are at the starting point again. The performance benefits are lost, though it's true that it works in all cases.

The detailed problem is this:

First solution:

An mmap() call was used to create an anomymous memory area where dynamic code was generated. It had read/write/execute privileges and it was not backed by any file. The problem here is that selinux forbids this.

Second solution:

Looking at how some internet browsers solve this problem (they generate dynamic code for javascript), I tried to use the double mmap() approach, that also seems to be the recommended solution to support selinux. With this approach I need a backing file mmapped to two distinct memory regions, one with read/write access and another one with read/execute access.

This works fine and selinux really allows this, but the problem here is that the backing file must have the bin_t selinux's type. Otherwise selinux doesn't allow the file to be mmaped to an executable region.

With the help of Anoop, we found that /usr/libexec/glusterfs has this type set, so we used it and everything worked fine with selinux activated.

The problem here (I didn't know that) is that this directory can be read-only in some cases, so it cannot be used. Also, it requires root privileges, while some gfapi applications may not run as root.

So the questions are:

1. Is there any way to improve security in this approach ?

2. Where can we store the backing file with bin_t type ? Niels proposed /run/glusterfs for root and /run/<user>/glusterfs for regular users. Is this ok ?

3. Can an selinux policy be distributed with gluster to make the previous directories of type bin_t ?

Thanks

Comment 12 Steve Grubb 2017-03-14 15:48:26 UTC

You know that selinux policy can be adjusted to allow generating code in memory for this specific process.

Comment 13 Niels de Vos 2017-03-14 19:36:12 UTC

(In reply to Steve Grubb from comment #12)
> You know that selinux policy can be adjusted to allow generating code in
> memory for this specific process.

If that is acceptable, we should make that happen. Siddarth, what is your take on this?

Comment 14 Siddharth Sharma 2017-03-15 04:29:47 UTC

There are two ways

1. As Steve Suggested
or
2. I looked into it again yesterday, there is /run/usr/$gid which gets re-created after every reboot. suppose if user has gid (1000), there will be /run/usr/1000/.. only readable and writable by user only. 

looking at other places, there doesnt seem to be a place which is recommended for such type of files. So if this file does not remain for long time I am OK with tweaking selinux policy to do so.

Comment 15 Xavi Hernandez 2017-03-15 09:36:25 UTC

After having seen some documentation, I like the dual mapping solution as it seems more robust and safe to me. This requires the creation of a file. This file is created and immediately deleted, so it only exists while the process keeps it open.

If it's ok to create an selinux policy that allows the process to create that file in an already existing directory (this directory should be writable by the owner of the process, not necessarily root. Maybe /tmp ?), that's fine to me.

If the best place to put that file is inside /run/usr/$gid (is it really gid or uid ?), I can write the necessary code. In this case, would we need to create a /run/usr/$gid/glusterfs ? or we can directly use /run/usr/$gid ?

Does /run/usr/$gid exist in all distributions ?

Comment 16 Siddharth Sharma 2017-03-15 11:05:12 UTC

(In reply to Xavier Hernandez from comment #15)
> After having seen some documentation, I like the dual mapping solution as it
> seems more robust and safe to me. This requires the creation of a file. This
> file is created and immediately deleted, so it only exists while the process
> keeps it open.
> 
> If it's ok to create an selinux policy that allows the process to create
> that file in an already existing directory (this directory should be
> writable by the owner of the process, not necessarily root. Maybe /tmp ?),
> that's fine to me.
> 
> If the best place to put that file is inside /run/usr/$gid (is it really gid
> or uid ?), I can write the necessary code. In this case, would we need to
> create a /run/usr/$gid/glusterfs ? or we can directly use /run/usr/$gid ?
> 
> Does /run/usr/$gid exist in all distributions ?

/run/usr/$gid is created by pam, and it is not persistent. so the problem with writing such file with predictable filename to /tmp is that it will become vulnerable to symlink attack. So I am not in favor of it being written to /tmp.

Comment 17 Niels de Vos 2017-03-15 11:28:51 UTC

(In reply to Xavier Hernandez from comment #15)
> Does /run/usr/$gid exist in all distributions ?

This would be /run/user/$uid. Note that it is "user", not "usr" and "$uid" not "$gid".

Modern distributions (Fedora, CentOS 7, all systemd ones?) have this directory. It is not available on CentOS 6 though, and that is also a target we aim to support as a Gluster community.

We can do some ./configure stuff to select different directories, but it still requires the correct SELinux context (bin_t) to be set too. Depending on the directory that gets picked for a distribution, the SELinux policy needs those modifications. Suggestions for a non-systemd distribution directory welcome.

Comment 18 Steve Grubb 2017-03-17 15:48:48 UTC

Another approach might be to compile into mmap() memory that is rw, and then mprotect() it to be rx. That assumes you are using a language (javascript) that does not do runtime optimizing which requires continued writing as the app runs.

Comment 19 Simon Sekidde 2017-03-17 16:38:37 UTC

(In reply to Niels de Vos from comment #17)
> 
> We can do some ./configure stuff to select different directories, but it
> still requires the correct SELinux context (bin_t) to be set too. Depending
> on the directory that gets picked for a distribution, the SELinux policy
> needs those modifications. Suggestions for a non-systemd distribution
> directory welcome.

bin_t is used to identify a file as an ordinary program type. If this is shell script then it should be shell_exec_t.

Comment 20 Xavi Hernandez 2017-03-20 07:27:54 UTC

(In reply to Steve Grubb from comment #18)
> Another approach might be to compile into mmap() memory that is rw, and then
> mprotect() it to be rx. That assumes you are using a language (javascript)
> that does not do runtime optimizing which requires continued writing as the
> app runs.

The dynamic code here is used because the number of combinations that would be needed grows easily to thousands. However in normal circumstances only a small subset is used and tends to remain stable until some event happens (brick failure or similar).

The program mmaps regions in blocks of 64KB to store multiple fragments of code. However the fragments are generated as they are needed. At the same time some other fragments might be executing, so the mprotect() solution is not practical here.

Using an independent mmap() for each fragment of code seems a waste of memory to me (normally each fragment of code requires few hundreds of bytes).

(In reply to Simon Sekidde from comment #19)
> bin_t is used to identify a file as an ordinary program type. If this is
> shell script then it should be shell_exec_t.

The dynamic code generated and stored into the temporary backend file is machine code, not shell code. The bin_t type works fine here.

Comment 21 Xavi Hernandez 2018-10-31 13:10:45 UTC

I'll try to fix this problem.

Is it ok to use /run/gluster/bin (for example) to store dynamic code for root user and /run/user/$uid/gluster for other users ?

If so, do I need to also update https://github.com/gluster/glusterfs-selinux ?

If that's the correct option, I'll implement it.

Comment 22 Niels de Vos 2018-11-12 13:26:17 UTC

(In reply to Xavi Hernandez from comment #21)
> I'll try to fix this problem.
> 
> Is it ok to use /run/gluster/bin (for example) to store dynamic code for
> root user and /run/user/$uid/gluster for other users ?
> 
> If so, do I need to also update https://github.com/gluster/glusterfs-selinux
> ?
> 
> If that's the correct option, I'll implement it.

Siddarth or Simon, could you answer this question?

Comment 23 Siddharth Sharma 2018-11-20 08:22:24 UTC

This looks OK to me, /run/gluster/bin is not writable by non-root user. From SELinux perspective this is not standard path to have binary. It would be good to get opinion from Simon as well.

Comment 24 Anoop C S 2019-11-18 07:00:38 UTC

Re-visiting..

Are we going to address this any time soon? I remember that we have the fall back mechanism implemented but wanted to fix the real problem by putting dynamic code satisfying SELinux needs.

Comment 25 Xavi Hernandez 2019-11-22 14:04:42 UTC

If the proposed change is ok (which I don't know for sure yet), I'll do it when I've time. It's not urgent and there's a fallback for the particular cases where the client is not running as root (only for gfapi) so it can get delayed.

The change is quite simple though. Only select a different path depending on the user (and maybe creating some directory), so if someone wants to take it, I'll be happy to help.

Comment 26 Worker Ant 2020-03-12 12:40:46 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/924, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.