Bug 1257957
Summary: | nfs-ganesha: nfs-ganesha process gets killed while executing UNLOCK with a cthon test on vers=3 | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Saurabh <saujain> |
Component: | nfs-ganesha | Assignee: | Niels de Vos <ndevos> |
Status: | CLOSED ERRATA | QA Contact: | Matt Zywusko <mzywusko> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.1 | CC: | byarlaga, mzywusko, ndevos, nlevinki, rcyriac, sankarshan, skoduri |
Target Milestone: | --- | Keywords: | Regression, ZStream |
Target Release: | RHGS 3.1.2 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-03-01 05:34:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1244792, 1260783 | ||
Attachments: |
Description
Saurabh
2015-08-28 13:20:48 UTC
Created attachment 1067996 [details]
sosreport of nfs11
There was a segmentation fault that triggered the failover: Aug 28 23:31:46 nfs11 kernel: ganesha.nfsd[3837]: segfault at 18 ip 0000000000491eba sp 00007fb1377fce40 error 4 in ganesha.nfsd[400000+1f0000] The sosreport does not contain the coredump, could you upload it separately? After some updating settings in the system, I was able to generate the coredump, though due to lack of some debug-symbols it's non-informative, (gdb) bt #0 0x0000000000491eba in nlm_send_async () #1 0x0000000000493577 in nlm4_send_grant_msg () #2 0x000000000049d630 in state_async_func_caller () #3 0x000000000050afeb in fridgethr_start_routine () #4 0x00007f477c143df5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f477bc691ad in clone () from /lib64/libc.so.6 Created attachment 1069040 [details]
coredump nfs-ganesha on nfs11
*** Bug 1242756 has been marked as a duplicate of this bug. *** Created attachment 1070023 [details]
Try again when getaddrinfo() fails
It seems that getaddrinfo() in src/Protocols/NLM/nlm_async.c:222 fails. This causes 'result' to be NULL, and accessing result->ai_addr crashes.
The attached patch logs the error, and tries again by using the existing for-loop.
Created attachment 1070071 [details]
Try again when getaddrinfo() fails, with short delay
The previous patch leads to quickly repeated errors, all in the same second, probably not what we want:
failed to resolve vm020.example.com to an address: Name or service not known
failed to resolve vm020.example.com to an address: Name or service not known
MAJ :GRANTED_MSG RPC call failed with return code 19. Removing the blocking lock
This patch adds a nanosleep() with a timeout that has been pre-defined. It's rather short, but maybe sufficient to have a retry work correctly.
Posted patch for upstream review: https://review.gerrithub.io/245285 Thanks Niels. cthon fails to take the lock, # time ./server -l -o vers=3 -p /vol1 -m /mnt -N 1 10.70.44.92 sh ./runtests -l -t /mnt/dhcp35-194.test Starting LOCKING tests: test directory /mnt/dhcp35-194.test (arg: -t) Testing native post-LFS locking Creating parent/child synchronization pipes. Test #1 - Test regions of an unlocked file. Parent: 1.1 - F_TEST [ 0, 1] PASSED. Parent: 1.2 - F_TEST [ 0, ENDING] PASSED. Parent: 1.3 - F_TEST [ 0,7fffffffffffffff] PASSED. Parent: 1.4 - F_TEST [ 1, 1] PASSED. Parent: 1.5 - F_TEST [ 1, ENDING] PASSED. Parent: 1.6 - F_TEST [ 1,7fffffffffffffff] PASSED. Parent: 1.7 - F_TEST [7fffffffffffffff, 1] PASSED. Parent: 1.8 - F_TEST [7fffffffffffffff, ENDING] PASSED. Parent: 1.9 - F_TEST [7fffffffffffffff,7fffffffffffffff] PASSED. Test #2 - Try to lock the whole file. Parent: 2.0 - F_TLOCK [ 0, ENDING] FAILED! Parent: **** Expected success, returned errno=37... Parent: **** Probably implementation error. ** PARENT pass 1 results: 9/9 pass, 0/0 warn, 1/1 fail (pass/total). ** CHILD pass 1 results: 0/0 pass, 0/0 warn, 0/0 fail (pass/total). lock tests failed Tests failed, leaving /mnt mounted real 0m3.221s user 0m0.029s sys 0m0.045s rpms in consideration, glusterfs-3.7.1-15.el7rhgs.x86_64 nfs-ganesha-2.2.0-7.el7rhgs.x86_64 The fix for this bug is provided with nfs-ganesha-2.2.0-8.el6 version. Though the issue seen now is not same as the one reported in the bug description, it shall help to know the cthon results after installing the rpms containing the fix. I am on RHGS 3.1.1(based on RHEL 7.1), a. the firewall settings are already done, # firewall-cmd --zone=public --list-service dhcpv6-client glusterfs high-availability mountd nfs rpc-bind ssh b. the client and server are able to ping each other using the hostname, # ping nfs11 PING nfs11 (10.70.44.92) 56(84) bytes of data. 64 bytes from nfs11 (10.70.44.92): icmp_seq=1 ttl=63 time=0.455 ms c. still the vers=3 lock test of cthon does not more forward, # time ./server -l -o vers=3 -p /vol1 -m /mnt -N 3 nfs11 sh ./runtests -N 3 -l /mnt/rhsauto003.test ... Pass 1 ... Starting LOCKING tests: test directory /mnt/rhsauto003.test (arg: /mnt/rhsauto003.test) Testing native post-LFS locking Creating parent/child synchronization pipes. Test #1 - Test regions of an unlocked file. from client, # ps -auxww | grep tlock root 31944 0.0 0.0 6524 512 pts/2 D+ 09:54 0:00 tlocklfs /mnt/rhsauto003.test root 31945 0.0 0.0 6524 196 pts/2 S+ 09:54 0:00 tlocklfs /mnt/rhsauto003.test Created attachment 1082326 [details]
packet trace from nfs11
Created attachment 1091691 [details] NFS client conversation while connecting to the NLM service It seems that the NFS-server does not allow the NFS-client to connect to the NLM service. The attached text file contains the following conversation: NFS-client: portmap/getport NLM NFS-server: NLM is on port 58575 NFS-client: TCP connect on port 58575 NFS-server: return "Host administratively prohibited" This suggests that the firewalld configuration is not sufficient. The NLM service needs to be opened up in the firewall too. Unfortunately there is no standard port for NLM to listen on. You might want to configure a NLM port in /etc/ganesha/ganesha.conf... http://thread.gmane.org/gmane.linux.nfs/73398 contains the upstream discussion about this. We should probably follow up on that. Saurabh, could you test again with the ports opened, or with firewall disabled? Yeah, the UNLOCK works fine. I tested it with vers=3 and cthon lock test while keeping firewalld disabled. After opening the ports for NLM. The cthon locks test are working. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html |