Bug 1257957 - nfs-ganesha: nfs-ganesha process gets killed while executing UNLOCK with a cthon test on vers=3
Summary: nfs-ganesha: nfs-ganesha process gets killed while executing UNLOCK with a ct...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.1.2
Assignee: Niels de Vos
QA Contact: Matt Zywusko
URL:
Whiteboard:
: 1242756 (view as bug list)
Depends On:
Blocks: 1244792 1260783
TreeView+ depends on / blocked
 
Reported: 2015-08-28 13:20 UTC by Saurabh
Modified: 2016-03-01 05:34 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-01 05:34:50 UTC
Embargoed:


Attachments (Terms of Use)
sosreport of nfs11 (13.00 MB, application/x-xz)
2015-08-28 13:35 UTC, Saurabh
no flags Details
coredump nfs-ganesha on nfs11 (4.30 MB, application/x-bzip)
2015-09-01 14:34 UTC, Saurabh
no flags Details
Try again when getaddrinfo() fails (1.75 KB, patch)
2015-09-03 18:02 UTC, Niels de Vos
no flags Details | Diff
Try again when getaddrinfo() fails, with short delay (2.28 KB, patch)
2015-09-03 19:42 UTC, Niels de Vos
no flags Details | Diff
packet trace from nfs11 (1.60 MB, application/octet-stream)
2015-10-13 10:44 UTC, Saurabh
no flags Details
NFS client conversation while connecting to the NLM service (2.42 KB, text/plain)
2015-11-09 11:18 UTC, Niels de Vos
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0193 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.1 update 2 2016-03-01 10:20:36 UTC

Description Saurabh 2015-08-28 13:20:48 UTC
Description of problem:
nfs-ganesha get killed once the cthon lock test executed, it fails while unloking in Test#7.

Version-Release number of selected component (if applicable):
glusterfs-3.7.1-13.el7rhgs.x86_64
nfs-ganesha-2.2.0-6.el7rhgs.x86_64

How reproducible:
everytime

Steps to Reproduce:
1. create a volume of type 6x2
2. configure nfs-ganesha
3. on the client execute the cthon lock while mounting with option vers=3

Actual results:
cthon locks suite's Test#7 waits for UNLOCK

server side /var/log/messages,

Aug 28 23:31:46 nfs11 kernel: ganesha.nfsd[3837]: segfault at 18 ip 0000000000491eba sp 00007fb1377fce40 error 4 in ganesha.nfsd[400000+1f0000]
Aug 28 23:31:46 nfs11 systemd: nfs-ganesha.service: main process exited, code=killed, status=11/SEGV
Aug 28 23:31:46 nfs11 systemd: Unit nfs-ganesha.service entered failed state.
Aug 28 23:31:49 nfs11 cibadmin[3887]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Aug 28 23:31:49 nfs11 cibadmin[3887]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin --replace -o configuration -V --xml-pipe
Aug 28 23:31:49 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs11-dead_ip-1_monitor_0: not running (node=nfs11.lab.eng.blr.redhat.com, call=599, rc=7, cib-update=282, confirmed=true)
Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs11-dead_ip-1_start_0: ok (node=nfs11.lab.eng.blr.redhat.com, call=602, rc=0, cib-update=283, confirmed=true)
Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs11-dead_ip-1_monitor_10000: ok (node=nfs11.lab.eng.blr.redhat.com, call=603, rc=0, cib-update=284, confirmed=false)
Aug 28 23:31:56 nfs11 IPaddr(nfs11.lab.eng.blr.redhat.com-cluster_ip-1)[3921]: INFO: IP status = ok, IP_CIP=
Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs11.lab.eng.blr.redhat.com-cluster_ip-1_stop_0: ok (node=nfs11.lab.eng.blr.redhat.com, call=601, rc=0, cib-update=285, confirmed=true)
Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs-grace_stop_0: ok (node=nfs11.lab.eng.blr.redhat.com, call=605, rc=0, cib-update=286, confirmed=true)
Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs11.lab.eng.blr.redhat.com-trigger_ip-1_stop_0: ok (node=nfs11.lab.eng.blr.redhat.com, call=607, rc=0, cib-update=287, confirmed=true)
Aug 28 23:31:56 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs-grace_start_0: ok (node=nfs11.lab.eng.blr.redhat.com, call=608, rc=0, cib-update=288, confirmed=true)
Aug 28 23:31:57 nfs11 crmd[22383]: notice: process_lrm_event: Operation nfs-grace_monitor_5000: ok (node=nfs11.lab.eng.blr.redhat.com, call=609, rc=0, cib-update=289, confirmed=false)
Aug 28 23:32:01 nfs11 systemd: Starting Session 7379 of user root.
Aug 28 23:32:01 nfs11 systemd: Started Session 7379 of user root.
Aug 28 23:32:06 nfs11 logger: warning: pcs resource create nfs11-dead_ip-1 ocf:heartbeat:Dummy failed
Aug 28 23:32:23 nfs11 logger: warning: pcs resource create nfs11-dead_ip-1 ocf:heartbeat:Dummy failed
Aug 28 23:32:41 nfs11 logger: warning: pcs resource create nfs11-dead_ip-1 ocf:heartbeat:Dummy failed
Aug 28 23:32:58 nfs11 logger: warning: pcs resource create nfs11-dead_ip-1 ocf:heartbeat:Dummy failed
Aug 28 23:33:01 nfs11 systemd: Starting Session 7380 of user root.
Aug 28 23:33:01 nfs11 systemd: Started Session 7380 of user root.

pcs status,
pcs status
Cluster name: nozomer
Last updated: Sat Aug 29 00:12:08 2015
Last change: Fri Aug 28 23:37:49 2015
Stack: corosync
Current DC: nfs12.lab.eng.blr.redhat.com (2) - partition with quorum
Version: 1.1.12-a14efad
3 Nodes configured
15 Resources configured


Online: [ nfs11.lab.eng.blr.redhat.com nfs12.lab.eng.blr.redhat.com nfs13.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ nfs11.lab.eng.blr.redhat.com nfs12.lab.eng.blr.redhat.com nfs13.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ nfs11.lab.eng.blr.redhat.com ]
     Stopped: [ nfs12.lab.eng.blr.redhat.com nfs13.lab.eng.blr.redhat.com ]
 nfs11.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped 
 nfs11.lab.eng.blr.redhat.com-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs12.lab.eng.blr.redhat.com 
 nfs12.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped 
 nfs12.lab.eng.blr.redhat.com-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs13.lab.eng.blr.redhat.com 
 nfs13.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped 
 nfs13.lab.eng.blr.redhat.com-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs12.lab.eng.blr.redhat.com 
 nfs11-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs13.lab.eng.blr.redhat.com 
 nfs13-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs11.lab.eng.blr.redhat.com 
 nfs12-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs13.lab.eng.blr.redhat.com 

PCSD Status:
  nfs11.lab.eng.blr.redhat.com: Online
  nfs12.lab.eng.blr.redhat.com: Online
  nfs13.lab.eng.blr.redhat.com: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/disabled


Expected results:
the nfs-ganesha process is not suppose to get killed because of a UNLOCK call, the lock call should be handled by process.

Additional info:

Comment 2 Saurabh 2015-08-28 13:35:42 UTC
Created attachment 1067996 [details]
sosreport of nfs11

Comment 3 Niels de Vos 2015-09-01 08:07:36 UTC
There was a segmentation fault that triggered the failover:

Aug 28 23:31:46 nfs11 kernel: ganesha.nfsd[3837]: segfault at 18 ip 0000000000491eba sp 00007fb1377fce40 error 4 in ganesha.nfsd[400000+1f0000]


The sosreport does not contain the coredump, could you upload it separately?

Comment 4 Saurabh 2015-09-01 14:30:47 UTC
After some updating settings in the system, I was able to generate the coredump, though due to lack of some debug-symbols it's non-informative,

(gdb) bt
#0  0x0000000000491eba in nlm_send_async ()
#1  0x0000000000493577 in nlm4_send_grant_msg ()
#2  0x000000000049d630 in state_async_func_caller ()
#3  0x000000000050afeb in fridgethr_start_routine ()
#4  0x00007f477c143df5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f477bc691ad in clone () from /lib64/libc.so.6

Comment 5 Saurabh 2015-09-01 14:34:23 UTC
Created attachment 1069040 [details]
coredump nfs-ganesha on nfs11

Comment 7 Niels de Vos 2015-09-03 15:50:12 UTC
*** Bug 1242756 has been marked as a duplicate of this bug. ***

Comment 9 Niels de Vos 2015-09-03 18:02:11 UTC
Created attachment 1070023 [details]
Try again when getaddrinfo() fails

It seems that getaddrinfo() in src/Protocols/NLM/nlm_async.c:222 fails. This causes 'result' to be NULL, and accessing result->ai_addr crashes.

The attached patch logs the error, and tries again by using the existing for-loop.

Comment 10 Niels de Vos 2015-09-03 19:42:34 UTC
Created attachment 1070071 [details]
Try again when getaddrinfo() fails, with short delay

The previous patch leads to quickly repeated errors, all in the same second, probably not what we want:

  failed to resolve vm020.example.com to an address: Name or service not known
  failed to resolve vm020.example.com to an address: Name or service not known
  MAJ :GRANTED_MSG RPC call failed with return code 19. Removing the blocking lock


This patch adds a nanosleep() with a timeout that has been pre-defined. It's rather short, but maybe sufficient to have a retry work correctly.

Comment 13 Niels de Vos 2015-09-04 08:31:37 UTC
Posted patch for upstream review: https://review.gerrithub.io/245285

Comment 16 Vivek Agarwal 2015-09-12 08:46:08 UTC
Thanks Niels.

Comment 17 Saurabh 2015-09-16 09:58:03 UTC
cthon fails to take the lock,

# time ./server -l -o vers=3 -p /vol1 -m /mnt -N 1 10.70.44.92
sh ./runtests  -l -t /mnt/dhcp35-194.test

Starting LOCKING tests: test directory /mnt/dhcp35-194.test (arg: -t)

Testing native post-LFS locking

Creating parent/child synchronization pipes.

Test #1 - Test regions of an unlocked file.
	Parent: 1.1  - F_TEST  [               0,               1] PASSED.
	Parent: 1.2  - F_TEST  [               0,          ENDING] PASSED.
	Parent: 1.3  - F_TEST  [               0,7fffffffffffffff] PASSED.
	Parent: 1.4  - F_TEST  [               1,               1] PASSED.
	Parent: 1.5  - F_TEST  [               1,          ENDING] PASSED.
	Parent: 1.6  - F_TEST  [               1,7fffffffffffffff] PASSED.
	Parent: 1.7  - F_TEST  [7fffffffffffffff,               1] PASSED.
	Parent: 1.8  - F_TEST  [7fffffffffffffff,          ENDING] PASSED.
	Parent: 1.9  - F_TEST  [7fffffffffffffff,7fffffffffffffff] PASSED.

Test #2 - Try to lock the whole file.
	Parent: 2.0  - F_TLOCK [               0,          ENDING] FAILED!
	Parent: **** Expected success, returned errno=37...
	Parent: **** Probably implementation error.

** PARENT pass 1 results: 9/9 pass, 0/0 warn, 1/1 fail (pass/total).

**  CHILD pass 1 results: 0/0 pass, 0/0 warn, 0/0 fail (pass/total).
lock tests failed
Tests failed, leaving /mnt mounted

real	0m3.221s
user	0m0.029s
sys	0m0.045s


rpms in consideration,
glusterfs-3.7.1-15.el7rhgs.x86_64
nfs-ganesha-2.2.0-7.el7rhgs.x86_64

Comment 21 Soumya Koduri 2015-09-16 16:57:16 UTC
The fix for this bug is provided with nfs-ganesha-2.2.0-8.el6 version.

Though the issue seen now is not same as the one reported in the bug description, it shall help to know the cthon results after installing the rpms containing the 
fix.

Comment 24 Saurabh 2015-10-13 10:02:54 UTC
I am on RHGS 3.1.1(based on RHEL 7.1),

a. the firewall settings are already done,
# firewall-cmd --zone=public --list-service
dhcpv6-client glusterfs high-availability mountd nfs rpc-bind ssh

b. the client and server are able to ping each other using the hostname,
# ping nfs11
PING nfs11 (10.70.44.92) 56(84) bytes of data.
64 bytes from nfs11 (10.70.44.92): icmp_seq=1 ttl=63 time=0.455 ms

c. still the vers=3 lock test of cthon does not more forward,

# time ./server -l -o vers=3 -p /vol1 -m /mnt -N 3 nfs11
sh ./runtests -N 3 -l  /mnt/rhsauto003.test
... Pass 1 ...

Starting LOCKING tests: test directory /mnt/rhsauto003.test (arg: /mnt/rhsauto003.test)

Testing native post-LFS locking

Creating parent/child synchronization pipes.

Test #1 - Test regions of an unlocked file.


from client,
# ps -auxww | grep tlock
root     31944  0.0  0.0   6524   512 pts/2    D+   09:54   0:00 tlocklfs /mnt/rhsauto003.test
root     31945  0.0  0.0   6524   196 pts/2    S+   09:54   0:00 tlocklfs /mnt/rhsauto003.test

Comment 25 Saurabh 2015-10-13 10:44:24 UTC
Created attachment 1082326 [details]
packet trace from nfs11

Comment 26 Niels de Vos 2015-11-09 11:18:37 UTC
Created attachment 1091691 [details]
NFS client conversation while connecting to the NLM service

It seems that the NFS-server does not allow the NFS-client to connect to the NLM service.

The attached text file contains the following conversation:

   NFS-client: portmap/getport NLM
   NFS-server: NLM is on port 58575
   NFS-client: TCP connect on port 58575
   NFS-server: return "Host administratively prohibited"

This suggests that the firewalld configuration is not sufficient. The NLM service needs to be opened up in the firewall too. Unfortunately there is no standard port for NLM to listen on. You might want to configure a NLM port in /etc/ganesha/ganesha.conf...

http://thread.gmane.org/gmane.linux.nfs/73398 contains the upstream discussion about this. We should probably follow up on that.

Comment 27 Niels de Vos 2015-11-09 11:20:12 UTC
Saurabh, could you test again with the ports opened, or with firewall disabled?

Comment 29 Saurabh 2015-11-17 13:11:51 UTC
Yeah, the UNLOCK works fine.

I tested it with vers=3 and cthon lock test while keeping firewalld disabled.

Comment 31 Saurabh 2015-11-23 07:16:15 UTC
After opening the ports for NLM.
The cthon locks test are working.

Comment 35 errata-xmlrpc 2016-03-01 05:34:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html


Note You need to log in before you can comment on or make changes to this bug.