+++ This bug was initially created as a clone of Bug #1238404 +++ Description of problem: ------------------------ Running `./autogen.sh' from git clone of glusterfs on gluster-nfs mount hangs for hours. The same on a fuse mount takes only a few minutes to complete. Version-Release number of selected component (if applicable): -------------------------------------------------------------- glusterfs-3.7.1-6.el6rhs.x86_64 How reproducible: ------------------ 100% Steps to Reproduce: -------------------- 1. On NFS mount of distribute-replicate (1x2) volume, run ./autogen.sh from git clone of glusterfs source. Actual results: ---------------- Command is hung on the mount point. Expected results: ----------------- Command is not expected to be hung and should go through. Additional info: --- Additional comment from Red Hat Bugzilla Rules Engine on 2015-07-01 15:30:37 EDT --- This bug is automatically being proposed for Red Hat Gluster Storage 3.1.0 by setting the release flag 'rhgs‑3.1.0' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from Shruti Sampat on 2015-07-01 15:38:22 EDT --- Find sosreports from servers at - http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1238404/ --- Additional comment from Shruti Sampat on 2015-07-01 15:39:03 EDT --- Volume configuration - # gluster v info rep Volume Name: rep Type: Replicate Volume ID: 364ec34f-c989-47b7-b2e4-a07185e84b79 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.70.37.168:/rhs/brick6/b1 Brick2: 10.70.37.199:/rhs/brick6/b1 Options Reconfigured: cluster.consistent-metadata: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on features.uss: on performance.readdir-ahead: on --- Additional comment from Niels de Vos on 2015-07-02 10:19:15 EDT --- Under investigation... --- Additional comment from Niels de Vos on 2015-07-02 11:32:54 EDT --- This is not easily reproducible, I have run the "git checkout" and "./autogen.sh" a few times now, but it continues to succeed for me. I tested both with default volume options and the ones given in comment #3. How many times out of how many runs does this fail for you? Do you have the logs somewhere so that I can have a look? A typical run on my test environment looks like this: [root@vm016 ~]# time /tmp/clone-and-autogen.sh + mount -t nfs -o vers=3 vm017.example.com:/bz1238404 /mnt/ + pushd /mnt/ /mnt ~ + git clone /srv/src/glusterfs/ Cloning into 'glusterfs'... done. Checking out files: 100% (1877/1877), done. + pushd glusterfs/ /mnt/glusterfs /mnt ~ + ./autogen.sh ... GlusterFS autogen ... Running aclocal... Running autoheader... Running libtoolize... Running autoconf... Running automake... configure.ac:249: installing './config.guess' configure.ac:249: installing './config.sub' configure.ac:16: installing './install-sh' configure.ac:16: installing './missing' api/examples/Makefile.am: installing './depcomp' geo-replication/syncdaemon/Makefile.am:3: installing './py-compile' parallel-tests: installing './test-driver' Running autogen.sh in argp-standalone ... configure.ac:10: installing './install-sh' configure.ac:10: installing './missing' Makefile.am: installing './depcomp' Please proceed with configuring, compiling, and installing. + popd /mnt ~ + rm -rf glusterfs + popd ~ + umount /mnt real 4m29.031s user 0m28.391s sys 0m6.374s --- Additional comment from Niels de Vos on 2015-07-02 11:47:57 EDT --- I've taken a look at the nfs.log from the sosreports mentioned in comment #2. There are quite some obvious messages in sosreport-dhcp37-168.1238404-20150702010409.tar.xz , and I wonder if you have missed those? I do not know if that is the NFS-server you mounted the volume from, the other sosreport does not have them. It is also not clear what NFS-client you used, and if you have an sosreport from that one. It would be trivial to check if a firewall and/or rpcbind is enabled and running there... [2015-07-01 19:18:28.943961] E [MSGID: 112167] [nlm4.c:1013:nlm4_establish_callback] 0-nfs-NLM: Unable to get NLM port of the client. Is the firewall running on client? OR Are RPC services running (rpcinfo -p)? --- Additional comment from Shruti Sampat on 2015-07-03 03:08:52 EDT --- sosreports from the NFS client - http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1238404/sosreport-vm8-rhsqa13.1238404-20150703115855-5b79.tar.xz Let me know if I can provide any other information. --- Additional comment from Niels de Vos on 2015-07-03 04:10:47 EDT --- Please also answer the questions I posted in the previous comments: - How many times out of how many runs does this fail for you? - There are quite some obvious messages in sosreport-dhcp37-168.1238404-20150702010409.tar.xz , and I wonder if you have missed those? - Did you check if rpcbind was running and no firewall interfered? --- Additional comment from Shruti Sampat on 2015-07-03 05:54:45 EDT --- (In reply to Niels de Vos from comment #8) > Please also answer the questions I posted in the previous comments: > > - How many times out of how many runs does this fail for you? I have tried about 2-3 times and it has failed every time. > - There are quite some obvious messages in > sosreport-dhcp37-168.1238404-20150702010409.tar.xz , and I wonder if you > have > missed those? Are you referring to these messages - [2015-07-01 19:18:28.943961] E [MSGID: 112167] [nlm4.c:1013:nlm4_establish_callback] 0-nfs-NLM: Unable to get NLM port of the client. Is the firewall running on client? OR Are RPC services running (rpcinfo -p)? I have seen them. I checked iptables and rpcbind and they seemed to be okay. Am I missing something here? > - Did you check if rpcbind was running and no firewall interfered? rpcbind was running, see below. [root@dhcp37-168 ~]# pgrep rpcbind 2048 Firewall does not seem to be an issue either. [root@dhcp37-168 ~]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination --- Additional comment from Niels de Vos on 2015-07-03 14:40:38 EDT --- I have tried again in a clean environment, but can still not reproduce this. Did you check for a firewall and rpcbind on the NFS-CLIENT too? You gave the output from one of the NFS-SERVERs (dhcp37-168), but the NFS-CLIENT has vm8-rhsqa13.lab.eng.blr.redhat.com as hostname. Can you give me access to an environment where this problem occurs? That would surely speed up things. --- Additional comment from Shruti Sampat on 2015-07-04 00:54:35 EDT --- On the NFS-client (vm8-rhsqa13), firewall was running as follows - [root@vm8-rhsqa13 ~]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED ACCEPT icmp -- anywhere anywhere ACCEPT all -- anywhere anywhere ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh REJECT all -- anywhere anywhere reject-with icmp-host-prohibited Chain FORWARD (policy ACCEPT) target prot opt source destination REJECT all -- anywhere anywhere reject-with icmp-host-prohibited Chain OUTPUT (policy ACCEPT) target prot opt source destination [root@vm8-rhsqa13 ~]# pgrep rpcbind 1200 I flushed the iptables rules and tried running autogen.sh and could easily reproduce the issue. Setup details for your use below. Password for root on all machines is `rhscqe'. It is a 6-node cluster with bricks from the following servers - dhcp37-208.lab.eng.blr.redhat.com dhcp37-134.lab.eng.blr.redhat.com The volume being exported is rep2 - [root@dhcp37-134 ~]# gluster v info rep2 Volume Name: rep2 Type: Replicate Volume ID: b1ab634f-3ba4-4321-b57a-90f1d33ec06f Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.70.37.208:/rhs/brick6/b1 Brick2: 10.70.37.134:/rhs/brick6/b1 Options Reconfigured: cluster.consistent-metadata: on performance.readdir-ahead: on NFS-client - vm8-rhsqa13.lab.eng.blr.redhat.com [root@vm8-rhsqa13 ~]# mount -l -t nfs 10.70.37.168:rep2 on /mnt/rep type nfs (rw,vers=3,addr=10.70.37.168) Let me know if I can provide any other information. --- Additional comment from Niels de Vos on 2015-07-04 04:32:18 EDT --- On the NFS-client: # ps axf ... 15541 pts/0 S+ 0:00 | \_ /bin/sh ./autogen.sh 15605 pts/0 S+ 0:01 | \_ /usr/bin/perl -w /usr... 15606 pts/0 S+ 0:00 | \_ /usr/bin/perl -w ... ... # cat /proc/15606/stack [<ffffffffa023e536>] nlmclnt_block+0xe6/0x130 [lockd] [<ffffffffa023f53e>] nlmclnt_proc+0x25e/0x740 [lockd] [<ffffffffa0274478>] nfs3_proc_lock+0x28/0x30 [nfs] [<ffffffffa025fb68>] do_setlk+0xf8/0x110 [nfs] [<ffffffffa025fc0f>] nfs_flock+0x8f/0xf0 [nfs] [<ffffffff811dff5d>] sys_flock+0x10d/0x1c0 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff For some reason, NLM on the NFS-client seems to be stuck. Most probably because the NFS-server could not connect to the NLM port earlier (like the log message in comment #6). I have killed the processes that keep the mountpoint open, so that I could have a clean attempt. # lsof /mnt/rep/ ... kill $PIDs # umount /mnt/rep/ # rmmod nfs # rmmod lockd # service rpcbind restart The plan was to restart the NFS-server for the volume "rep" by disabling and re-enabling with the "nfs.disable" option. But that failed due to a problem in the Trusted Pool: [root@dhcp37-168 ~]# gluster volume set rep nfs.disable true volume set: failed: Commit failed on 10.70.37.60. Please check log file for details. Commit failed on 10.70.37.115. Please check log file for details. Commit failed on 10.70.37.134. Please check log file for details. Commit failed on 10.70.37.208. Please check log file for details. Failed back to killing the glusterfs/nfs process and "gluster volume start rep force". Mounting again, and tail'ing the /var/log/glusterfs/nfs.log shows these messages when starting ./autogen.sh: [2015-07-04 08:07:22.120983] E [MSGID: 112167] [nlm4.c:1013:nlm4_establish_callback] 0-nfs-NLM: Unable to get NLM port of the client. Is the firewall running on client? OR Are RPC services running (rpcinfo -p)? [2015-07-04 08:07:22.120991] E [MSGID: 112164] [nlm4.c:558:nsm_monitor] 0-nfs-NLM: Clnt_create(): RPC: Remote system error - Permission denied Thats weird, something prevents access to the NLM service on the client. No more firewalls (checked *again*) and rpcbind on the client reports that everything is fine too: [root@dhcp37-168 ~]# rpcinfo -p 10.70.44.89 program vers proto port service 100000 4 tcp 111 portmapper 100000 3 tcp 111 portmapper 100000 2 tcp 111 portmapper 100000 4 udp 111 portmapper 100000 3 udp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 38417 status 100024 1 tcp 58958 status 100021 1 udp 47831 nlockmgr 100021 3 udp 47831 nlockmgr 100021 4 udp 47831 nlockmgr 100021 1 tcp 33163 nlockmgr 100021 3 tcp 33163 nlockmgr 100021 4 tcp 33163 nlockmgr This seems to be an SElinux issue. After changing the SElinux mode to "Permissive" instead of "Enforcing", things just work. AVC denial in the /var/log/audit/audit.log: type=AVC msg=audit(1435997568.099:13230): avc: denied { name_connect } for pid=16323 comm="glusterfs" dest=111 scontext=unconfined_u:system_r:glusterd_t:s0 tcontext=system_u:object_r:portmap_port_t:s0 tclass=tcp_socket The Gluster/NFS server needs to be able to connect to RPC services on the NFS-client: - portmapper (port 111) - nlockmgr (dynamically assigned) - status (dynamically assigned) Depending on the NFS-client and its configuration, TCP or UDP can be required. Processes/binaries that are involved, hopefully making it easier for SElinux people to modify the policy: - glusterd: main Gluster management daemon that starts the glusterfs (Gluster client) binary with an NFS-server configuration - glusterfs: the binary that acts as an NFS-server on one side and a Gluster client on the other side (similar to a proxy/gateway) --- Additional comment from Niels de Vos on 2015-07-04 05:00:40 EDT --- Prasanth, to have locking on Gluster/NFS work, the "glusterfs" binary acting as NFS-server needs to be allowed to connect to some of the RPC services on the NFS-client. At the moment SElinux prevents this. I am not sure (yet) if this only is an issue on RHEL6, or also on RHEL7. Should we replace this bug with a dedicated one (or two if RHEL7 is affected too?) for correcting the selinux-policy? --- Additional comment from Prasanth on 2015-07-06 02:33:10 EDT --- (In reply to Niels de Vos from comment #13) > Prasanth, > > to have locking on Gluster/NFS work, the "glusterfs" binary acting as > NFS-server needs to be allowed to connect to some of the RPC services on the > NFS-client. At the moment SElinux prevents this. I am not sure (yet) if this > only is an issue on RHEL6, or also on RHEL7. > > Should we replace this bug with a dedicated one (or two if RHEL7 is affected > too?) for correcting the selinux-policy? I would recommend you to have separate RHGS BZ's for RHEL-6 and RHEL-7 (if RHEL7 is affected too) and clone it against "selinux-policy" component in RHEL-6 and RHEL-7 to get the corresponding SELinux fixes. And once the fix is made available in RHEL-7.2, tested and verified by QE, i'll propose it for a RHEL-7.1.Z clone so that we get it backported in 7.1. Hope this helps! --- Additional comment from Red Hat Bugzilla Rules Engine on 2015-07-06 10:25:19 EDT --- Since this bug has been approved for the Red Hat Gluster Storage 3.1.0 release, through release flag 'rhgs-3.1.0+', the Target Release is being automatically set to 'RHGS 3.1.0' --- Additional comment from Milos Malik on 2015-07-07 05:19:22 EDT --- Here is a beaker task, which provides a local policy that solves the AVC in comment#12. You can prepend it to list of your beaker tasks: --task "! echo -en 'policy_module(bz1238404,1.0)\n\nrequire {\ntype glusterd_t;\n}\n\ncorenet_tcp_connect_portmap_port(glusterd_t)\n' > bz1238404.te ; make -f /usr/share/selinux/devel/Makefile ; semodule -i bz1238404.pp ; semodule -l | grep bz1238404"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0763.html