Bug 1240583

Summary: [SELinux] SElinux prevents Gluster/NFS from connecting to RPC services on NFS-clients (RHEL-6)
Product: Red Hat Enterprise Linux 6 Reporter: Prasanth <pprakash>
Component: selinux-policyAssignee: Miroslav Grepl <mgrepl>
Status: CLOSED ERRATA QA Contact: Milos Malik <mmalik>
Severity: high Docs Contact: Petr Bokoc <pbokoc>
Priority: high    
Version: 6.7CC: annair, asrivast, dwalsh, jherrman, lvrabec, mgrepl, mmalik, ndevos, pbokoc, plautrba, pprakash, pvrabec, rcyriac, sankarshan, skoduri, ssampat, ssekidde, storage-qa-internal, tlavigne
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: selinux-policy-3.7.19-279.el6 Doc Type: Bug Fix
Doc Text:
*Gluster* can now correctly connect to RPC services on NFS clients Prior to this update, SELinux unintentionally prevented Gluster from connecting to remote procedure call (RPC) services on NFS clients. This update modifies the relevant SELinux policies, and Gluster now connects to RPC services as expected.
Story Points: ---
Clone Of: 1238404
: 1248515 (view as bug list) Environment:
Last Closed: 2016-05-10 19:58:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1202842, 1212796, 1238404, 1248515    

Description Prasanth 2015-07-07 09:42:28 UTC
+++ This bug was initially created as a clone of Bug #1238404 +++

Description of problem:
------------------------
Running `./autogen.sh' from git clone of glusterfs on gluster-nfs mount hangs for hours. The same on a fuse mount takes only a few minutes to complete.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
glusterfs-3.7.1-6.el6rhs.x86_64

How reproducible:
------------------
100%

Steps to Reproduce:
--------------------
1. On NFS mount of distribute-replicate (1x2) volume, run ./autogen.sh from git clone of glusterfs source.

Actual results:
----------------
Command is hung on the mount point.

Expected results:
-----------------
Command is not expected to be hung and should go through.

Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-07-01 15:30:37 EDT ---

This bug is automatically being proposed for Red Hat Gluster Storage 3.1.0 by setting the release flag 'rhgs‑3.1.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Shruti Sampat on 2015-07-01 15:38:22 EDT ---

Find sosreports from servers at -

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1238404/

--- Additional comment from Shruti Sampat on 2015-07-01 15:39:03 EDT ---

Volume configuration -

# gluster v info rep
 
Volume Name: rep
Type: Replicate
Volume ID: 364ec34f-c989-47b7-b2e4-a07185e84b79
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.37.168:/rhs/brick6/b1
Brick2: 10.70.37.199:/rhs/brick6/b1
Options Reconfigured:
cluster.consistent-metadata: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
features.uss: on
performance.readdir-ahead: on

--- Additional comment from Niels de Vos on 2015-07-02 10:19:15 EDT ---

Under investigation...

--- Additional comment from Niels de Vos on 2015-07-02 11:32:54 EDT ---

This is not easily reproducible, I have run the "git checkout" and "./autogen.sh" a few times now, but it continues to succeed for me. I tested both with default volume options and the ones given in comment #3.

How many times out of how many runs does this fail for you? Do you have the logs somewhere so that I can have a look?

A typical run on my test environment looks like this:

[root@vm016 ~]# time /tmp/clone-and-autogen.sh
+ mount -t nfs -o vers=3 vm017.example.com:/bz1238404 /mnt/
+ pushd /mnt/
/mnt ~
+ git clone /srv/src/glusterfs/
Cloning into 'glusterfs'...
done.
Checking out files: 100% (1877/1877), done.
+ pushd glusterfs/
/mnt/glusterfs /mnt ~
+ ./autogen.sh

... GlusterFS autogen ...

Running aclocal...
Running autoheader...
Running libtoolize...
Running autoconf...
Running automake...
configure.ac:249: installing './config.guess'
configure.ac:249: installing './config.sub'
configure.ac:16: installing './install-sh'
configure.ac:16: installing './missing'
api/examples/Makefile.am: installing './depcomp'
geo-replication/syncdaemon/Makefile.am:3: installing './py-compile'
parallel-tests: installing './test-driver'
Running autogen.sh in argp-standalone ...
configure.ac:10: installing './install-sh'
configure.ac:10: installing './missing'
Makefile.am: installing './depcomp'

Please proceed with configuring, compiling, and installing.
+ popd
/mnt ~
+ rm -rf glusterfs
+ popd
~
+ umount /mnt

real    4m29.031s
user    0m28.391s
sys     0m6.374s

--- Additional comment from Niels de Vos on 2015-07-02 11:47:57 EDT ---

I've taken a look at the nfs.log from the sosreports mentioned in comment #2.

There are quite some obvious messages in sosreport-dhcp37-168.1238404-20150702010409.tar.xz , and I wonder if you have missed those? I do not know if that is the NFS-server you mounted the volume from, the other sosreport does not have them. It is also not clear what NFS-client you used, and if you have an sosreport from that one. It would be trivial to check if a firewall and/or rpcbind is enabled and running there...

[2015-07-01 19:18:28.943961] E [MSGID: 112167] [nlm4.c:1013:nlm4_establish_callback] 0-nfs-NLM: Unable to get NLM port of the client. Is the firewall running on client? OR Are RPC services running (rpcinfo -p)?

--- Additional comment from Shruti Sampat on 2015-07-03 03:08:52 EDT ---

sosreports from the NFS client -

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1238404/sosreport-vm8-rhsqa13.1238404-20150703115855-5b79.tar.xz

Let me know if I can provide any other information.

--- Additional comment from Niels de Vos on 2015-07-03 04:10:47 EDT ---

Please also answer the questions I posted in the previous comments:

- How many times out of how many runs does this fail for you?
- There are quite some obvious messages in
  sosreport-dhcp37-168.1238404-20150702010409.tar.xz , and I wonder if you have
  missed those?
- Did you check if rpcbind was running and no firewall interfered?

--- Additional comment from Shruti Sampat on 2015-07-03 05:54:45 EDT ---

(In reply to Niels de Vos from comment #8)
> Please also answer the questions I posted in the previous comments:
> 
> - How many times out of how many runs does this fail for you?

I have tried about 2-3 times and it has failed every time.

> - There are quite some obvious messages in
>   sosreport-dhcp37-168.1238404-20150702010409.tar.xz , and I wonder if you
> have
>   missed those?

Are you referring to these messages -

[2015-07-01 19:18:28.943961] E [MSGID: 112167] [nlm4.c:1013:nlm4_establish_callback] 0-nfs-NLM: Unable to get NLM port of the client. Is the firewall running on client? OR Are RPC services running (rpcinfo -p)?

I have seen them. I checked iptables and rpcbind and they seemed to be okay.

Am I missing something here?

> - Did you check if rpcbind was running and no firewall interfered?

rpcbind was running, see below. 

[root@dhcp37-168 ~]# pgrep rpcbind
2048

Firewall does not seem to be an issue either.

[root@dhcp37-168 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

--- Additional comment from Niels de Vos on 2015-07-03 14:40:38 EDT ---

I have tried again in a clean environment, but can still not reproduce this.

Did you check for a firewall and rpcbind on the NFS-CLIENT too? You gave the output from one of the NFS-SERVERs (dhcp37-168), but the NFS-CLIENT has vm8-rhsqa13.lab.eng.blr.redhat.com as hostname.

Can you give me access to an environment where this problem occurs? That would surely speed up things.

--- Additional comment from Shruti Sampat on 2015-07-04 00:54:35 EDT ---

On the NFS-client (vm8-rhsqa13), firewall was running as follows -

[root@vm8-rhsqa13 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             anywhere            state RELATED,ESTABLISHED 
ACCEPT     icmp --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     tcp  --  anywhere             anywhere            state NEW tcp dpt:ssh 
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited 

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited 

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

[root@vm8-rhsqa13 ~]# pgrep rpcbind
1200

I flushed the iptables rules and tried running autogen.sh and could easily reproduce the issue.

Setup details for your use below. Password for root on all machines is `rhscqe'.

It is a 6-node cluster with bricks from the following servers -

dhcp37-208.lab.eng.blr.redhat.com
dhcp37-134.lab.eng.blr.redhat.com

The volume being exported is rep2 -

[root@dhcp37-134 ~]# gluster v info rep2
 
Volume Name: rep2
Type: Replicate
Volume ID: b1ab634f-3ba4-4321-b57a-90f1d33ec06f
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.37.208:/rhs/brick6/b1
Brick2: 10.70.37.134:/rhs/brick6/b1
Options Reconfigured:
cluster.consistent-metadata: on
performance.readdir-ahead: on

NFS-client -

vm8-rhsqa13.lab.eng.blr.redhat.com

[root@vm8-rhsqa13 ~]# mount -l -t nfs
10.70.37.168:rep2 on /mnt/rep type nfs (rw,vers=3,addr=10.70.37.168)

Let me know if I can provide any other information.

--- Additional comment from Niels de Vos on 2015-07-04 04:32:18 EDT ---

On the NFS-client:

# ps axf
...
15541 pts/0    S+     0:00  |       \_ /bin/sh ./autogen.sh
15605 pts/0    S+     0:01  |           \_ /usr/bin/perl -w /usr...
15606 pts/0    S+     0:00  |               \_ /usr/bin/perl -w ...
...


# cat /proc/15606/stack 
[<ffffffffa023e536>] nlmclnt_block+0xe6/0x130 [lockd]
[<ffffffffa023f53e>] nlmclnt_proc+0x25e/0x740 [lockd]
[<ffffffffa0274478>] nfs3_proc_lock+0x28/0x30 [nfs]
[<ffffffffa025fb68>] do_setlk+0xf8/0x110 [nfs]
[<ffffffffa025fc0f>] nfs_flock+0x8f/0xf0 [nfs]
[<ffffffff811dff5d>] sys_flock+0x10d/0x1c0
[<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff


For some reason, NLM on the NFS-client seems to be stuck. Most probably because the NFS-server could not connect to the NLM port earlier (like the log message in comment #6).

I have killed the processes that keep the mountpoint open, so that I could have a clean attempt.

  # lsof /mnt/rep/
  ... kill $PIDs
  # umount /mnt/rep/
  # rmmod nfs
  # rmmod lockd
  # service rpcbind restart


The plan was to restart the NFS-server for the volume "rep" by disabling and re-enabling with the "nfs.disable" option. But that failed due to a problem in the Trusted Pool:

[root@dhcp37-168 ~]# gluster volume set rep nfs.disable true
volume set: failed: Commit failed on 10.70.37.60. Please check log file for details.
Commit failed on 10.70.37.115. Please check log file for details.
Commit failed on 10.70.37.134. Please check log file for details.
Commit failed on 10.70.37.208. Please check log file for details.

Failed back to killing the glusterfs/nfs process and "gluster volume start rep force".

Mounting again, and tail'ing the /var/log/glusterfs/nfs.log shows these messages when starting ./autogen.sh:

[2015-07-04 08:07:22.120983] E [MSGID: 112167] [nlm4.c:1013:nlm4_establish_callback] 0-nfs-NLM: Unable to get NLM port of the client. Is the firewall running on client? OR Are RPC services running (rpcinfo -p)?
[2015-07-04 08:07:22.120991] E [MSGID: 112164] [nlm4.c:558:nsm_monitor] 0-nfs-NLM: Clnt_create(): RPC: Remote system error - Permission denied

Thats weird, something prevents access to the NLM service on the client. No more firewalls (checked *again*) and rpcbind on the client reports that everything is fine too:

[root@dhcp37-168 ~]# rpcinfo -p 10.70.44.89
   program vers proto   port  service
    100000    4   tcp    111  portmapper
    100000    3   tcp    111  portmapper
    100000    2   tcp    111  portmapper
    100000    4   udp    111  portmapper
    100000    3   udp    111  portmapper
    100000    2   udp    111  portmapper
    100024    1   udp  38417  status
    100024    1   tcp  58958  status
    100021    1   udp  47831  nlockmgr
    100021    3   udp  47831  nlockmgr
    100021    4   udp  47831  nlockmgr
    100021    1   tcp  33163  nlockmgr
    100021    3   tcp  33163  nlockmgr
    100021    4   tcp  33163  nlockmgr


This seems to be an SElinux issue. After changing the SElinux mode to "Permissive" instead of "Enforcing", things just work.

AVC denial in the /var/log/audit/audit.log:

type=AVC msg=audit(1435997568.099:13230): avc:  denied  { name_connect } for  pid=16323 comm="glusterfs" dest=111 scontext=unconfined_u:system_r:glusterd_t:s0 tcontext=system_u:object_r:portmap_port_t:s0 tclass=tcp_socket


The Gluster/NFS server needs to be able to connect to RPC services on the NFS-client:
 - portmapper (port 111)
 - nlockmgr (dynamically assigned)
 - status (dynamically assigned)

Depending on the NFS-client and its configuration, TCP or UDP can be required.


Processes/binaries that are involved, hopefully making it easier for SElinux people to modify the policy:

  - glusterd: main Gluster management daemon that starts the glusterfs
              (Gluster client) binary with an NFS-server configuration
  - glusterfs: the binary that acts as an NFS-server on one side and a Gluster
               client on the other side (similar to a proxy/gateway)

--- Additional comment from Niels de Vos on 2015-07-04 05:00:40 EDT ---

Prasanth,

to have locking on Gluster/NFS work, the "glusterfs" binary acting as NFS-server needs to be allowed to connect to some of the RPC services on the NFS-client. At the moment SElinux prevents this. I am not sure (yet) if this only is an issue on RHEL6, or also on RHEL7.

Should we replace this bug with a dedicated one (or two if RHEL7 is affected too?) for correcting the selinux-policy?

--- Additional comment from Prasanth on 2015-07-06 02:33:10 EDT ---

(In reply to Niels de Vos from comment #13)
> Prasanth,
> 
> to have locking on Gluster/NFS work, the "glusterfs" binary acting as
> NFS-server needs to be allowed to connect to some of the RPC services on the
> NFS-client. At the moment SElinux prevents this. I am not sure (yet) if this
> only is an issue on RHEL6, or also on RHEL7.
> 
> Should we replace this bug with a dedicated one (or two if RHEL7 is affected
> too?) for correcting the selinux-policy?

I would recommend you to have separate RHGS BZ's for RHEL-6 and RHEL-7 (if RHEL7 is affected too) and clone it against "selinux-policy" component in RHEL-6 and RHEL-7 to get the corresponding SELinux fixes. And once the fix is made available in RHEL-7.2, tested and verified by QE, i'll propose it for a RHEL-7.1.Z clone so that we get it backported in 7.1.

Hope this helps!

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-07-06 10:25:19 EDT ---

Since this bug has been approved for the Red Hat Gluster Storage 3.1.0 release, through release flag 'rhgs-3.1.0+', the Target Release is being automatically set to 'RHGS 3.1.0'

--- Additional comment from Milos Malik on 2015-07-07 05:19:22 EDT ---

Here is a beaker task, which provides a local policy that solves the AVC in comment#12. You can prepend it to list of your beaker tasks:

--task "! echo -en 'policy_module(bz1238404,1.0)\n\nrequire {\ntype glusterd_t;\n}\n\ncorenet_tcp_connect_portmap_port(glusterd_t)\n' > bz1238404.te ; make -f /usr/share/selinux/devel/Makefile ; semodule -i bz1238404.pp ; semodule -l | grep bz1238404"

Comment 8 errata-xmlrpc 2016-05-10 19:58:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0763.html