1332129 – nfs-ganesha might fails to comeup with "Error binding to V6 interface"

Bug 1332129 - nfs-ganesha might fails to comeup with "Error binding to V6 interface"

Summary: nfs-ganesha might fails to comeup with "Error binding to V6 interface"

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	doc-Administration_Guide
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 4
Assignee:	Pratik Mulay
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1369781 1477507 1477511 1657798 1672843
TreeView+	depends on / blocked

Reported:	2016-05-02 10:11 UTC by Shashank Raj
Modified:	2023-09-14 03:21 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1477507 1477511 (view as bug list)
Environment:
Last Closed:	2019-06-03 05:11:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shashank Raj 2016-05-02 10:11:01 UTC

Description of problem:

Sometimes it might happen (mostly seen in case of a reboot)that nfs-ganesha service might fails to comeup with "Error binding to V6 interface. cannot continue" messages in /var/log/ganesha.log

Version-Release number of selected component (if applicable):

nfs-ganesha-2.3.1-4

How reproducible:

intermittent

Steps to Reproduce:

1. Create a 4 node cluster and configure ganesha on it.
2. Create a tiered volume, enable quota, attach tier and enable ganesha on it.
3. mount the volume with vers=3 on client.
4. Start creating IO from the mount point.
5. reboot the mounted node.
6. Once the node came back, observed below error messages in /var/log/ganesha.log

01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] Bind_sockets_V6 :DISP :WARN :Cannot bind RQUOTA tcp6 socket, error 98 (Address already in use)
01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue.
01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] unregister_fsal :FSAL :CRIT :Unregister FSAL GLUSTER with non-zero refcount=1
01/05/2016 12:57:16 : epoch 5725afd3 : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-4839[main] glusterfs_unload :FSAL :CRIT :FSAL Gluster unable to unload.  Dying ...

7.Restarting nfs-ganesha service on this node fails everytime after that

Actual results:

nfs-ganesha might fails to comeup with "Error binding to V6 interface"

Expected results:

ganesha service should not fail to restart.

Additional info:

below comments from Dev team on this bug:

From ganesha.log,


02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] Bind_sockets_V6 :DISP :WARN :Cannot bind RQUOTA tcp6 socket, error 98 (Address already in use)
02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue.
02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] unregister_fsal :FSAL :CRIT :Unregister FSAL GLUSTER with non-zero refcount=1
02/05/2016 03:17:11 : epoch 5726795d : dhcp37-180.lab.eng.blr.redhat.com : ganesha.nfsd-13892[main] glusterfs_unload :FSAL :CRIT :FSAL Gluster unable to unload.  Dying ...



Rquota port (875) was already in use by some other process. 

[root@dhcp37-180 ~]#  netstat -ntaunlp | grep 875
tcp        0      0 10.70.37.180:875        10.70.37.127:24007      TIME_WAIT   - 

It doesn't list any process pid which is using this port, but 

Looks like the port was being used by one of the processes which is/was connected to glusterd port. But seems very strange why its pid is not being listed in the above netstat command. 


However when configured a different port for RQuota in '/etc/ganesha/ganesha.conf', nfs-ganesha process has got started.



NFS_Core_Param {
        #Use supplied name other tha IP In NSM operations
        NSM_Use_Caller_Name = true;
        #Copy lock states into "/var/lib/nfs/ganesha" dir
        Clustered = false;
        #By default port number '2049' is used for NFS service.
        #Configure ports for MNT, NLM, RQuota services.
        #The ports chosen here are from '/etc/sysconfig/nfs'
        MNT_Port = 20048;
        NLM_Port = 32803;
        Rquota_Port = 8750;
}

%include "/etc/ganesha/exports/export.tiervolume.conf"
[root@dhcp37-180 ~]# 


[root@dhcp37-180 ~]# showmount -e localhost
Export list for localhost:
/tiervolume (everyone)
[root@dhcp37-180 ~]# 

This seems like a known issue. Since the ports which we use for NLM/RQuota are not registered, we could occasionally run into these issues if there is any other process using them. We need to document to configure a different port in such cases, open that port via firewalld and then start nfs-ganesha.

Comment 2 Niels de Vos 2016-05-02 12:02:07 UTC

When this happens, make sure to have "netstat" (from the net-utils package) installed before generating the sosreport. With the details from netstat we might be able to see something like a pattern.


from the TCP specs - https://tools.ietf.org/html/rfc793#page-22

TIME-WAIT - represents waiting for enough time to pass to be sure the remote TCP received the acknowledgment of its connection termination request.


There might have been an issue with GlusterD that prevented it from communication/confirming back that the socket can be closed. It would be helpful to get the status+logs from the system that has its IP listed in the TIME_WAIT connection entry from "netstat".

Comment 3 Shashank Raj 2016-05-13 13:28:30 UTC

We are hitting this particular issue more frequently now where nfs-ganesha service fails to come up with below messages:

13/05/2016 23:57:50 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-16417[main] Bind_sockets_V6 :DISP :WARN :Cannot bind RQUOTA tcp6 socket, error 98 (Address already in use)
13/05/2016 23:57:50 : epoch c8c00000 : dhcp42-20.lab.eng.blr.redhat.com : ganesha.nfsd-16417[main] Bind_sockets :DISP :FATAL :Error binding to V6 interface. Cannot continue.

and in one scenario we have observed that 875 port which we configure for rquota is being used by shd process

[root@dhcp42-20 ~]# netstat -ntaun | grep 875
tcp        0      0 10.70.42.20:49212       10.70.43.175:875        ESTABLISHED
[root@dhcp42-20 ~]# gluster v status | grep 49212
cks/brick5/nfsvol4_brick0                   49212     0          Y       16249


[root@dhcp43-175 ~]# netstat -ntaunp | grep 875
tcp        0      0 10.70.43.175:875        10.70.42.20:49212       ESTABLISHED 7049/glusterfs      
[root@dhcp43-175 ~]# ps aux|grep glusterfs
root      7049  0.0  0.8 1115812 65480 ?       Ssl  May13   0:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/971aa92e442920e8802d63fd4bd001a5.socket --xlator-option *replicate*.node-uuid=f63c454f-b24e-49f9-b65f-dee681762100

Since we have moved this bug to 3.2.0, we need to propose this for documentation for 3.1.3

@Niels

Please let us know youe thoughts on the same.

Comment 4 Niels de Vos 2016-05-19 14:22:20 UTC

I thought that clients are not binding to privileged (< 1024) ports by default anymore. But if shd is often occupying port 875, this has been reverted maybe?

(Older) Gluster clients starts to bind ports from 1024 and iterates downwards. There must have been quite some Gluster clients running when even port 875 is in use already.

Can you see what the other ports between 875 - 1024 are used for? The option "client.bind-insecure" is expected to be enabled by default. SHD is expected to use a much higher port number.

Comment 5 Niels de Vos 2016-09-29 13:37:51 UTC

The changes that come in with glusterfs-3.8.x should prevent clients from using ports < 1024. Please re-test with the latest RHGS-3.2 builds.

Thanks!

Comment 6 Shashank Raj 2016-10-19 09:42:34 UTC

I didn't hit this issue so far during downstream testing of below 3.2 build

[root@dhcp43-110 exports]# rpm -qa|grep ganesha
nfs-ganesha-debuginfo-2.4.0-2.el7rhgs.x86_64
nfs-ganesha-2.4.0-2.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.0-2.el7rhgs.x86_64

Will keep this bug open and update if i see it again.

Comment 18 Manisha Saini 2017-08-02 09:09:38 UTC

This is applicable for all 3.1.3,3.2 and 3.3

Comment 25 Jilju Joy 2019-03-08 11:59:43 UTC

The change requested in this bug is already present in 
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/administration_guide/#ganesha_Troubleshooting_1

Marking this as verified.

Comment 27 Red Hat Bugzilla 2023-09-14 03:21:52 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.