Bug 1027410

Summary:	netfs unmount/self_fence integration
Product:	Red Hat Enterprise Linux 6	Reporter:	Jan Kurik <jkurik>
Component:	resource-agents	Assignee:	David Vossel <dvossel>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	6.6	CC:	agk, cluster-maint, djansa, dvossel, fdinitto, jcastillo, jharriga, jruemker, jsvarova, luvilla, michele, mnovacek, pm-eus
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	0day
Fixed In Version:	resource-agents-3.9.2-40.el6_5.3	Doc Type:	Bug Fix
Doc Text:	Prior to this update, the netfs agent could hang during a stop operation, even with the self_fence option enabled. With this update, self fence operation is executed sooner in the process, which ensures that NFS client detects server leaving if umount can not succeed, and self fencing occurs.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-11-22 00:28:06 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1024065, 1031636, 1031641
Bug Blocks:

Description Jan Kurik 2013-11-06 18:47:10 UTC

This bug has been copied from bug #1024065 and has been proposed
to be backported to 6.5 z-stream (EUS).

Comment 5 michal novacek 2013-11-15 17:29:53 UTC

It does not seem to work for me with resource-agents-3.9.2-40.el6_5.2.x86_64.

Here is what I did:

* I configured NFS Server outside of the cluster (virt-065).

* I defined rgmanager service to mount this same exported directory to
* /exports/1

virt-066# cat /etc/cluster/cluster.conf
...
<resources>
    <netfs 
        name="le-nfs-mount1" 
        host="virt-065" 
        export="/exports" 
        mountpoint="/exports/1" 
        force_unmount="1" 
        self_fence="1"
    />
</resources>
<service domain="le-domain" name="nfs-mount" recovery="relocate">
    <netfs ref="le-nfs-mount1"/>
</service>
...

virt-066# ccs -h localhost --lsservices
ccs -h localhost --lsservices
service: name=nfs-mount, domain=le-domain, recovery=relocate
  netfs: ref=le-nfs-mount1
resources: 
  netfs: name=le-nfs-mount1, self_fence=1, force_unmount=1, \
host=virt-065, export=/exports, \
mountpoint=/exports/1

* I started the service and checked that it mounted 
virt-066# mount | grep /exports
virt-065:/exports on /exports/1 type nfs \
(rw,sync,soft,noac,vers=4,addr=10.34.71.65,clientaddr=10.34.71.66)


* I blocked all the traffic from and to cluster nodes with iptables.
virt-060# date
Fri Nov 15 17:57:20 CET 2013

virt-060# for a in 66 67 68; do \
iptables -I INPUT -s 10.34.71.$a -j DROP; \
iptables -I OUTPUT -d 10.34.71.$a -j DROP; \
done

* I issued relocate command to see whether resource agent correctly umounts the
directory.

virt-066# clusvcadm -r nfs-mount
Trying to relocate service:nfs-mount...
[never finishes]

virt-066# tail -f /var/log/messages
...
Nov 15 18:01:20 virt-066 kernel: nfs: server virt-065 not responding, timed out
Nov 15 18:01:20 virt-066 rgmanager[1096]: [netfs] netfs:le-nfs-mount1: virt-065:/exports is not mounted on /exports/1
Nov 15 18:01:20 virt-066 rgmanager[2376]: status on netfs "le-nfs-mount1" returned 7 (unspecified)
Nov 15 18:01:20 virt-066 rgmanager[2376]: Stopping service service:nfs-mount
Nov 15 18:06:53 virt-066 kernel: nfs: server virt-065 not responding, timed out

virt-066# date
Fri Nov 15 18:06:52 CET 2013

virt-066# clustat
Cluster Status for STSRHTS2638 @ Fri Nov 15 18:07:24 2013                                             
Member Status: Quorate                                                                          

 Member Name                                        ID   Status
 ------ ----                                        ---- ------
 virt-066             1 Online, Local, rgmanager
 virt-067             2 Online, rgmanager
 virt-068             3 Online, rgmanager

 Service Name                              Owner (Last)   State         
 ------- ----                              ----- ------   -----         
 service:nfs-mount                         (virt-067)     recoverable 

No reboot ocured and and relocate command never finished.

It seems to me that the problem is that netfs resource agent thinks that filesystem is not mounted

Fencing is working when I manually issued 'fence_node virt-067' from virt-066
it got fenced.

Please advise on whether this is configuration issue an my side or FAIL_QA.

Comment 7 Fabio Massimo Di Nitto 2013-11-16 12:51:49 UTC

(In reply to michal novacek from comment #5)
> It does not seem to work for me with resource-agents-3.9.2-40.el6_5.2.x86_64.
> 
> Here is what I did:
> 
> * I configured NFS Server outside of the cluster (virt-065).
> 
> * I defined rgmanager service to mount this same exported directory to
> * /exports/1
> 
> virt-066# cat /etc/cluster/cluster.conf
> ...
> <resources>
>     <netfs 
>         name="le-nfs-mount1" 
>         host="virt-065" 
>         export="/exports" 
>         mountpoint="/exports/1" 
>         force_unmount="1" 
>         self_fence="1"
>     />

Let's make sure we are using nfsv3. There is a known kernel bug for v4 that we can't workaround. Once that's fixed/zstreamed then we can test v4 too.

> </resources>
> <service domain="le-domain" name="nfs-mount" recovery="relocate">
>     <netfs ref="le-nfs-mount1"/>
> </service>
> ...
> 
> virt-066# ccs -h localhost --lsservices
> ccs -h localhost --lsservices
> service: name=nfs-mount, domain=le-domain, recovery=relocate
>   netfs: ref=le-nfs-mount1
> resources: 
>   netfs: name=le-nfs-mount1, self_fence=1, force_unmount=1, \
> host=virt-065, export=/exports, \
> mountpoint=/exports/1
> 
> * I started the service and checked that it mounted 
> virt-066# mount | grep /exports
> virt-065:/exports on /exports/1 type nfs \
> (rw,sync,soft,noac,vers=4,addr=10.34.71.65,clientaddr=10.34.71.66)
> 
> 
> * I blocked all the traffic from and to cluster nodes with iptables.
> virt-060# date
> Fri Nov 15 17:57:20 CET 2013
> 
> virt-060# for a in 66 67 68; do \
> iptables -I INPUT -s 10.34.71.$a -j DROP; \
> iptables -I OUTPUT -d 10.34.71.$a -j DROP; \
> done

This is neither a cluster node or the nfsserver? I am not sure where virt-060 come in the game here vs routing.

> 
> * I issued relocate command to see whether resource agent correctly umounts
> the
> directory.
> 
> virt-066# clusvcadm -r nfs-mount
> Trying to relocate service:nfs-mount...
> [never finishes]

The use case we are covering here:

mount from cluster node
server goes away
netfs agent should detect that the server is gone, then try to stop the mount, recognize that the server is gone and self_fence the node.

> 
> virt-066# tail -f /var/log/messages
> ...
> Nov 15 18:01:20 virt-066 kernel: nfs: server virt-065 not responding, timed
> out
> Nov 15 18:01:20 virt-066 rgmanager[1096]: [netfs] netfs:le-nfs-mount1:
> virt-065:/exports is not mounted on /exports/1
> Nov 15 18:01:20 virt-066 rgmanager[2376]: status on netfs "le-nfs-mount1"
> returned 7 (unspecified)
> Nov 15 18:01:20 virt-066 rgmanager[2376]: Stopping service service:nfs-mount
> Nov 15 18:06:53 virt-066 kernel: nfs: server virt-065 not responding, timed
> out
> 

I'll need to see more logs around the event, or give me access to the nodes and we can simulate/test together.

> virt-066# date
> Fri Nov 15 18:06:52 CET 2013
> 
> virt-066# clustat
> Cluster Status for STSRHTS2638 @ Fri Nov 15 18:07:24 2013                   
> 
> Member Status: Quorate                                                      
> 
> 
>  Member Name                                        ID   Status
>  ------ ----                                        ---- ------
>  virt-066             1 Online, Local, rgmanager
>  virt-067             2 Online, rgmanager
>  virt-068             3 Online, rgmanager
> 
>  Service Name                              Owner (Last)   State         
>  ------- ----                              ----- ------   -----         
>  service:nfs-mount                         (virt-067)     recoverable 
> 
> No reboot ocured and and relocate command never finished.
> 
> It seems to me that the problem is that netfs resource agent thinks that
> filesystem is not mounted
> 
> Fencing is working when I manually issued 'fence_node virt-067' from virt-066
> it got fenced.
> 
> Please advise on whether this is configuration issue an my side or FAIL_QA.

It could be a combination of things, catch me monday morning and i am happy to look at it in your setup.

Comment 8 Fabio Massimo Di Nitto 2013-11-17 15:31:49 UTC

I found the various issues (need to clone also to relevant components)

bug 1: mount.nfs4 adds a trailing / to the device in /proc/mounts

      <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/>

(or use fstype="nfs4" and drop vers=4 from options has the same effect)

nfsv4:

192.168.3.1:/mnt/ /mnt/nfs nfs4 rw,relatime,vers=4,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp,port=0,timeo=100,retrans=1,sec=sys,clientaddr=192.168.2.66,minorversion=0,local_lock=none,addr=192.168.3.1 0 0

nfsv3:

      <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="vers=3,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/>

192.168.3.1:/mnt /mnt/nfs nfs rw,sync,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp,timeo=100,retrans=1,sec=sys,mountaddr=192.168.3.1,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.168.3.1 0 0

bug 2: findmnt is picky about the trailing /

bug 3: resource-agents could add/remove the trailing / as necessary.

workaround is simple:

      <netfs export="/mnt/" force_unmount="1" fstype="nfs" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/>

by adding a / at the end of the export config option.

I bet that we see different results from QE and TAM because of the trailing slash in cluster.conf.

Also, the relocation test is not useful to verify this scenario because the netfs status check will take a long time to detect that the server has gone and the relocation request will be aborted by rgmanager. Adding extra detection time before recovery will take place. It's best to just let rgmanager complete the status check.

Verification has been done using the kernel that fixes EIO error from nfsv4 (Linux rhel6-node2.int.fabbione.net 2.6.32-358.23.2.sfdc00898941.1.el6.x86_64 #1 SMP Tue Nov 12 14:11:07 EST 2013 x86_64 x86_64 x86_64 GNU/Linux)

      <netfs export="/mnt/" force_unmount="1" fstype="nfs4" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="rw,sync,soft,retrans=1,timeo=100" self_fence="1"/>


mount point is nfsv4 with workaround

closing down the access to nfs server with iptables:
Nov 17 16:27:52 rgmanager [netfs] Checking fs "nfs", Level 0
Nov 17 16:29:10 rgmanager [netfs] netfs:nfs: is_alive: /mnt/nfs is not a directory
Nov 17 16:29:10 rgmanager [netfs] fs:nfs: Mount point is not accessible!
Nov 17 16:29:10 rgmanager status on netfs "nfs" returned 1 (generic error)
Nov 17 16:29:10 rgmanager Stopping service service:notoporcodi3lettere
Nov 17 16:29:43 rgmanager [netfs] pre unmount: checking if nfs server 192.168.3.1 is alive
Nov 17 16:29:43 rgmanager [netfs] Testing generic rpc access on server 192.168.3.1 with protocol tcp
Nov 17 16:30:46 rgmanager [netfs] RPC server on 192.168.3.1 with tcp is not responding
Nov 17 16:30:46 rgmanager [netfs] NFS server not responding - REBOOTING

Comment 9 Fabio Massimo Di Nitto 2013-11-18 13:05:42 UTC

(In reply to Fabio Massimo Di Nitto from comment #8)
> I found the various issues (need to clone also to relevant components)
> 
> bug 1: mount.nfs4 adds a trailing / to the device in /proc/mounts

filed as bug#1031636

> 
>       <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1"
> mountpoint="/mnt/nfs" name="nfs"
> options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/>
> 
> (or use fstype="nfs4" and drop vers=4 from options has the same effect)
> 
> nfsv4:
> 
> 192.168.3.1:/mnt/ /mnt/nfs nfs4
> rw,relatime,vers=4,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp,
> port=0,timeo=100,retrans=1,sec=sys,clientaddr=192.168.2.66,minorversion=0,
> local_lock=none,addr=192.168.3.1 0 0
> 
> nfsv3:
> 
>       <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1"
> mountpoint="/mnt/nfs" name="nfs"
> options="vers=3,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/>
> 
> 192.168.3.1:/mnt /mnt/nfs nfs
> rw,sync,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp,
> timeo=100,retrans=1,sec=sys,mountaddr=192.168.3.1,mountvers=3,
> mountport=20048,mountproto=udp,local_lock=none,addr=192.168.3.1 0 0
> 
> bug 2: findmnt is picky about the trailing /

filed as bug#1031641

> 
> bug 3: resource-agents could add/remove the trailing / as necessary.

https://github.com/fabbione/resource-agents/tree/mountfixes

> 
> workaround is simple:
> 
>       <netfs export="/mnt/" force_unmount="1" fstype="nfs"
> host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs"
> options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/>
> 
> by adding a / at the end of the export config option.
> 
> I bet that we see different results from QE and TAM because of the trailing
> slash in cluster.conf.
> 
> Also, the relocation test is not useful to verify this scenario because the
> netfs status check will take a long time to detect that the server has gone
> and the relocation request will be aborted by rgmanager. Adding extra
> detection time before recovery will take place. It's best to just let
> rgmanager complete the status check.
> 
> Verification has been done using the kernel that fixes EIO error from nfsv4
> (Linux rhel6-node2.int.fabbione.net
> 2.6.32-358.23.2.sfdc00898941.1.el6.x86_64 #1 SMP Tue Nov 12 14:11:07 EST
> 2013 x86_64 x86_64 x86_64 GNU/Linux)
> 
>       <netfs export="/mnt/" force_unmount="1" fstype="nfs4"
> host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs"
> options="rw,sync,soft,retrans=1,timeo=100" self_fence="1"/>
> 
> 
> mount point is nfsv4 with workaround
> 
> closing down the access to nfs server with iptables:
> Nov 17 16:27:52 rgmanager [netfs] Checking fs "nfs", Level 0
> Nov 17 16:29:10 rgmanager [netfs] netfs:nfs: is_alive: /mnt/nfs is not a
> directory
> Nov 17 16:29:10 rgmanager [netfs] fs:nfs: Mount point is not accessible!
> Nov 17 16:29:10 rgmanager status on netfs "nfs" returned 1 (generic error)
> Nov 17 16:29:10 rgmanager Stopping service service:notoporcodi3lettere
> Nov 17 16:29:43 rgmanager [netfs] pre unmount: checking if nfs server
> 192.168.3.1 is alive
> Nov 17 16:29:43 rgmanager [netfs] Testing generic rpc access on server
> 192.168.3.1 with protocol tcp
> Nov 17 16:30:46 rgmanager [netfs] RPC server on 192.168.3.1 with tcp is not
> responding
> Nov 17 16:30:46 rgmanager [netfs] NFS server not responding - REBOOTING

Comment 10 michal novacek 2013-11-18 17:05:29 UTC

I have verified that the node self fence if it looses connection with nfs 
server works for nfs version 3 and version 4 mounts.

---

virt-067$ rpm -q resource-agents
resource-agents-3.9.2-40.el6_5.3.x86_64

virt-067$ cat /etc/cluster/cluster.conf
...
<resources>
    <netfs 
        name="le-nfs-mount1" 
        host="virt-065" 
        export="/exports" 
        mountpoint="/exports/1" 
        force_unmount="1" 
        self_fence="1"
    />
</resources>
<service domain="le-domain" name="nfs-mount" recovery="relocate">
    <netfs ref="le-nfs-mount1"/>
</service>
...

virt-067$ ccs -h localhost --lsservices
service: name=nfs-mount, domain=le-domain, recovery=relocate
  netfs: ref=le-nfs-mount
resources: 
  netfs: name=le-nfs-mount, self_fence=1, force_unmount=1, host=virt-065, \
export=/exports, mountpoint=/exports/1

nfs version 3:
==============
virt-067$ mount | grep /exports/1
virt-065:/exports on /exports/1 type nfs (rw,sync,vers=3,soft,retrans=1,timeo=100,addr=10.34.71.65)

virt-067$ date
Mon Nov 18 17:36:15 CET 2013

virt-067$ clustat
Cluster Status for STSRHTS2638 @ Mon Nov 18 17:38:04 2013
Member Status: Quorate

 Member Name                       ID   Status
 ------ ----                       ---- ------
 virt-066                          1 Online, rgmanager
 virt-067                          2 Online, Local, rgmanager
 virt-068                          3 Online, rgmanager

 Service Name                      Owner (Last)                State         
 ------- ----                      ----- ------                -----         
 service:nfs-mount                 virt-067                    started     

nfs-server# for a in 66 67 68; do \
iptables -I INPUT -s 10.34.71.$a -j DROP; \
iptables -I OUTPUT -d 10.34.71.$a -j DROP; \
done

virt-067$ tail -f /var/log/cluster/rgmanager.log
...
Nov 18 17:39:09 rgmanager [netfs] netfs:le-nfs-mount: is_alive: /exports/1 is not a directory
Nov 18 17:39:09 rgmanager [netfs] fs:le-nfs-mount: Mount point is not accessible!
Nov 18 17:39:09 rgmanager status on netfs "le-nfs-mount" returned 1 (generic error)
Nov 18 17:39:09 rgmanager Stopping service service:nfs-mount
Nov 18 17:39:39 rgmanager [netfs] pre unmount: checking if nfs server virt-065 is alive
Nov 18 17:39:39 rgmanager [netfs] Testing generic rpc access on server virt-065 with protocol tcp
Nov 18 17:40:42 rgmanager [netfs] RPC server on virt-065 with tcp is not responding
Nov 18 17:40:42 rgmanager [netfs] NFS server not responding - REBOOTING



nfs version 4
=============
virt-067$ mount | grep /exports/1
virt-065:/exports on /exports/1 type nfs (rw,sync,vers=4,soft,retrans=1,timeo=100,addr=10.34.71.65,clientaddr=10.34.71.67)

(blocked traffic to nfsserver with iptables)

virt-067$ tail -f /var/log/cluster/rgmanager.log
...
Nov 18 17:56:21 rgmanager [netfs] Checking fs "le-nfs-mount", Level 0
Nov 18 17:57:21 rgmanager [netfs] Checking fs "le-nfs-mount", Level 0
Nov 18 17:58:39 rgmanager [netfs] netfs:le-nfs-mount: is_alive: /exports/1 is not a directory
Nov 18 17:58:39 rgmanager [netfs] fs:le-nfs-mount: Mount point is not accessible!
Nov 18 17:58:39 rgmanager status on netfs "le-nfs-mount" returned 1 (generic error)
Nov 18 17:58:39 rgmanager Stopping service service:nfs-mount
Nov 18 17:59:12 rgmanager [netfs] pre unmount: checking if nfs server virt-065 is alive
Nov 18 17:59:12 rgmanager [netfs] Testing generic rpc access on server virt-065 with protocol tcp
Nov 18 18:00:15 rgmanager [netfs] RPC server on virt-065 with tcp is not responding
Nov 18 18:00:15 rgmanager [netfs] NFS server not responding - REBOOTING

Comment 12 errata-xmlrpc 2013-11-22 00:28:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1746.html