Description of problem: nfs-ganesha get killed once the cthon lock test executed, it fails while unloking in Test#7. Version-Release number of selected component (if applicable): glusterfs-3.7.1-13.el7rhgs.x86_64 nfs-ganesha-2.2.0-6.el7rhgs.x86_64 How reproducible: everytime Steps to Reproduce: 1. create a volume of type 6x2 2. configure nfs-ganesha 3. on the client execute the cthon lock while mounting with option vers=3 Actual results: cthon locks suite's Test#7 waits for UNLOCK server side /var/log/messages, Aug 28 23:31:46 nfs11 kernel: ganesha.nfsd[3837]: segfault at 18 ip 0000000000491eba sp 00007fb1377fce40 error 4 in ganesha.nfsd[400000+1f0000] Aug 28 23:31:46 nfs11 systemd: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV Aug 28 23:31:46 nfs11 systemd: Unit nfs-ganesha.service entered failed state. Aug 28 23:31:49 nfs11 cibadmin[3887]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Aug 28 23:31:49 nfs11 cibadmin[3887]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin --replace -o configuration -V --xml-pipe Aug 28 23:31:49 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs11-dead_ip-1_monitor_0: not running (node=nfs11.lab.eng.blr.redhat.com, call=599, rc=7, cib-update=282, confirmed=true) Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs11-dead_ip-1_start_0: ok (node=nfs11.lab.eng.blr.redhat.com, call=602, rc=0, cib-update=283, confirmed=true) Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs11-dead_ip-1_monitor_10000: ok (node=nfs11.lab.eng.blr.redhat.com, call=603, rc=0, cib-update=284, confirmed=false) Aug 28 23:31:56 nfs11 IPaddr(nfs11.lab.eng.blr.redhat.com-cluster_ip-1)[3921]: INFO: IP status = ok, IP_CIP= Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs11.lab.eng.blr.redhat.com-cluster_ip-1_stop_0: ok (node=nfs11.lab.eng.blr.redhat.com, call=601, rc=0, cib-update=285, confirmed=true) Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs-grace_stop_0: ok (node=nfs11.lab.eng.blr.redhat.com, call=605, rc=0, cib-update=286, confirmed=true) Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs11.lab.eng.blr.redhat.com-trigger_ip-1_stop_0: ok (node=nfs11.lab.eng.blr.redhat.com, call=607, rc=0, cib-update=287, confirmed=true) Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs-grace_start_0: ok (node=nfs11.lab.eng.blr.redhat.com, call=608, rc=0, cib-update=288, confirmed=true) Aug 28 23:31:57 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs-grace_monitor_5000: ok (node=nfs11.lab.eng.blr.redhat.com, call=609, rc=0, cib-update=289, confirmed=false) Aug 28 23:32:01 nfs11 systemd: Starting Session 7379 of user root. Aug 28 23:32:01 nfs11 systemd: Started Session 7379 of user root. Aug 28 23:32:06 nfs11 logger: warning: pcs resource create nfs11-dead_ip-1 ocf:heartbeat:Dummy failed Aug 28 23:32:23 nfs11 logger: warning: pcs resource create nfs11-dead_ip-1 ocf:heartbeat:Dummy failed Aug 28 23:32:41 nfs11 logger: warning: pcs resource create nfs11-dead_ip-1 ocf:heartbeat:Dummy failed Aug 28 23:32:58 nfs11 logger: warning: pcs resource create nfs11-dead_ip-1 ocf:heartbeat:Dummy failed Aug 28 23:33:01 nfs11 systemd: Starting Session 7380 of user root. Aug 28 23:33:01 nfs11 systemd: Started Session 7380 of user root. pcs status, pcs status Cluster name: nozomer Last updated: Sat Aug 29 00:12:08 2015 Last change: Fri Aug 28 23:37:49 2015 Stack: corosync Current DC: nfs12.lab.eng.blr.redhat.com (2) - partition with quorum Version: 1.1.12-a14efad 3 Nodes configured 15 Resources configured Online: [ nfs11.lab.eng.blr.redhat.com nfs12.lab.eng.blr.redhat.com nfs13.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs11.lab.eng.blr.redhat.com nfs12.lab.eng.blr.redhat.com nfs13.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs11.lab.eng.blr.redhat.com ] Stopped: [ nfs12.lab.eng.blr.redhat.com nfs13.lab.eng.blr.redhat.com ] nfs11.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped nfs11.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs12.lab.eng.blr.redhat.com nfs12.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped nfs12.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs13.lab.eng.blr.redhat.com nfs13.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped nfs13.lab.eng.blr.redhat.com-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs12.lab.eng.blr.redhat.com nfs11-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs13.lab.eng.blr.redhat.com nfs13-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs11.lab.eng.blr.redhat.com nfs12-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs13.lab.eng.blr.redhat.com PCSD Status: nfs11.lab.eng.blr.redhat.com: Online nfs12.lab.eng.blr.redhat.com: Online nfs13.lab.eng.blr.redhat.com: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled Expected results: the nfs-ganesha process is not suppose to get killed because of a UNLOCK call, the lock call should be handled by process. Additional info:
Created attachment 1067996 [details] sosreport of nfs11
There was a segmentation fault that triggered the failover: Aug 28 23:31:46 nfs11 kernel: ganesha.nfsd[3837]: segfault at 18 ip 0000000000491eba sp 00007fb1377fce40 error 4 in ganesha.nfsd[400000+1f0000] The sosreport does not contain the coredump, could you upload it separately?
After some updating settings in the system, I was able to generate the coredump, though due to lack of some debug-symbols it's non-informative, (gdb) bt #0 0x0000000000491eba in nlm_send_async () #1 0x0000000000493577 in nlm4_send_grant_msg () #2 0x000000000049d630 in state_async_func_caller () #3 0x000000000050afeb in fridgethr_start_routine () #4 0x00007f477c143df5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f477bc691ad in clone () from /lib64/libc.so.6
Created attachment 1069040 [details] coredump nfs-ganesha on nfs11
*** Bug 1242756 has been marked as a duplicate of this bug. ***
Created attachment 1070023 [details] Try again when getaddrinfo() fails It seems that getaddrinfo() in src/Protocols/NLM/nlm_async.c:222 fails. This causes 'result' to be NULL, and accessing result->ai_addr crashes. The attached patch logs the error, and tries again by using the existing for-loop.
Created attachment 1070071 [details] Try again when getaddrinfo() fails, with short delay The previous patch leads to quickly repeated errors, all in the same second, probably not what we want: failed to resolve vm020.example.com to an address: Name or service not known failed to resolve vm020.example.com to an address: Name or service not known MAJ :GRANTED_MSG RPC call failed with return code 19. Removing the blocking lock This patch adds a nanosleep() with a timeout that has been pre-defined. It's rather short, but maybe sufficient to have a retry work correctly.
Posted patch for upstream review: https://review.gerrithub.io/245285
Thanks Niels.
cthon fails to take the lock, # time ./server -l -o vers=3 -p /vol1 -m /mnt -N 1 10.70.44.92 sh ./runtests -l -t /mnt/dhcp35-194.test Starting LOCKING tests: test directory /mnt/dhcp35-194.test (arg: -t) Testing native post-LFS locking Creating parent/child synchronization pipes. Test #1 - Test regions of an unlocked file. Parent: 1.1 - F_TEST [ 0, 1] PASSED. Parent: 1.2 - F_TEST [ 0, ENDING] PASSED. Parent: 1.3 - F_TEST [ 0,7fffffffffffffff] PASSED. Parent: 1.4 - F_TEST [ 1, 1] PASSED. Parent: 1.5 - F_TEST [ 1, ENDING] PASSED. Parent: 1.6 - F_TEST [ 1,7fffffffffffffff] PASSED. Parent: 1.7 - F_TEST [7fffffffffffffff, 1] PASSED. Parent: 1.8 - F_TEST [7fffffffffffffff, ENDING] PASSED. Parent: 1.9 - F_TEST [7fffffffffffffff,7fffffffffffffff] PASSED. Test #2 - Try to lock the whole file. Parent: 2.0 - F_TLOCK [ 0, ENDING] FAILED! Parent: **** Expected success, returned errno=37... Parent: **** Probably implementation error. ** PARENT pass 1 results: 9/9 pass, 0/0 warn, 1/1 fail (pass/total). ** CHILD pass 1 results: 0/0 pass, 0/0 warn, 0/0 fail (pass/total). lock tests failed Tests failed, leaving /mnt mounted real 0m3.221s user 0m0.029s sys 0m0.045s rpms in consideration, glusterfs-3.7.1-15.el7rhgs.x86_64 nfs-ganesha-2.2.0-7.el7rhgs.x86_64
The fix for this bug is provided with nfs-ganesha-2.2.0-8.el6 version. Though the issue seen now is not same as the one reported in the bug description, it shall help to know the cthon results after installing the rpms containing the fix.
I am on RHGS 3.1.1(based on RHEL 7.1), a. the firewall settings are already done, # firewall-cmd --zone=public --list-service dhcpv6-client glusterfs high-availability mountd nfs rpc-bind ssh b. the client and server are able to ping each other using the hostname, # ping nfs11 PING nfs11 (10.70.44.92) 56(84) bytes of data. 64 bytes from nfs11 (10.70.44.92): icmp_seq=1 ttl=63 time=0.455 ms c. still the vers=3 lock test of cthon does not more forward, # time ./server -l -o vers=3 -p /vol1 -m /mnt -N 3 nfs11 sh ./runtests -N 3 -l /mnt/rhsauto003.test ... Pass 1 ... Starting LOCKING tests: test directory /mnt/rhsauto003.test (arg: /mnt/rhsauto003.test) Testing native post-LFS locking Creating parent/child synchronization pipes. Test #1 - Test regions of an unlocked file. from client, # ps -auxww | grep tlock root 31944 0.0 0.0 6524 512 pts/2 D+ 09:54 0:00 tlocklfs /mnt/rhsauto003.test root 31945 0.0 0.0 6524 196 pts/2 S+ 09:54 0:00 tlocklfs /mnt/rhsauto003.test
Created attachment 1082326 [details] packet trace from nfs11
Created attachment 1091691 [details] NFS client conversation while connecting to the NLM service It seems that the NFS-server does not allow the NFS-client to connect to the NLM service. The attached text file contains the following conversation: NFS-client: portmap/getport NLM NFS-server: NLM is on port 58575 NFS-client: TCP connect on port 58575 NFS-server: return "Host administratively prohibited" This suggests that the firewalld configuration is not sufficient. The NLM service needs to be opened up in the firewall too. Unfortunately there is no standard port for NLM to listen on. You might want to configure a NLM port in /etc/ganesha/ganesha.conf... http://thread.gmane.org/gmane.linux.nfs/73398 contains the upstream discussion about this. We should probably follow up on that.
Saurabh, could you test again with the ports opened, or with firewall disabled?
Yeah, the UNLOCK works fine. I tested it with vers=3 and cthon lock test while keeping firewalld disabled.
After opening the ports for NLM. The cthon locks test are working.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html