Bug 1027410
| Summary: | netfs unmount/self_fence integration | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Jan Kurik <jkurik> |
| Component: | resource-agents | Assignee: | David Vossel <dvossel> |
| Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 6.6 | CC: | agk, cluster-maint, djansa, dvossel, fdinitto, jcastillo, jharriga, jruemker, jsvarova, luvilla, michele, mnovacek, pm-eus |
| Target Milestone: | rc | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | 0day | ||
| Fixed In Version: | resource-agents-3.9.2-40.el6_5.3 | Doc Type: | Bug Fix |
| Doc Text: |
Prior to this update, the netfs agent could hang during a stop operation, even with the self_fence option enabled. With this update, self fence operation is executed sooner in the process, which ensures that NFS client detects server leaving if umount can not succeed, and self fencing occurs.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-11-22 00:28:06 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1024065, 1031636, 1031641 | ||
| Bug Blocks: | |||
|
Description
Jan Kurik
2013-11-06 18:47:10 UTC
It does not seem to work for me with resource-agents-3.9.2-40.el6_5.2.x86_64.
Here is what I did:
* I configured NFS Server outside of the cluster (virt-065).
* I defined rgmanager service to mount this same exported directory to
* /exports/1
virt-066# cat /etc/cluster/cluster.conf
...
<resources>
<netfs
name="le-nfs-mount1"
host="virt-065"
export="/exports"
mountpoint="/exports/1"
force_unmount="1"
self_fence="1"
/>
</resources>
<service domain="le-domain" name="nfs-mount" recovery="relocate">
<netfs ref="le-nfs-mount1"/>
</service>
...
virt-066# ccs -h localhost --lsservices
ccs -h localhost --lsservices
service: name=nfs-mount, domain=le-domain, recovery=relocate
netfs: ref=le-nfs-mount1
resources:
netfs: name=le-nfs-mount1, self_fence=1, force_unmount=1, \
host=virt-065, export=/exports, \
mountpoint=/exports/1
* I started the service and checked that it mounted
virt-066# mount | grep /exports
virt-065:/exports on /exports/1 type nfs \
(rw,sync,soft,noac,vers=4,addr=10.34.71.65,clientaddr=10.34.71.66)
* I blocked all the traffic from and to cluster nodes with iptables.
virt-060# date
Fri Nov 15 17:57:20 CET 2013
virt-060# for a in 66 67 68; do \
iptables -I INPUT -s 10.34.71.$a -j DROP; \
iptables -I OUTPUT -d 10.34.71.$a -j DROP; \
done
* I issued relocate command to see whether resource agent correctly umounts the
directory.
virt-066# clusvcadm -r nfs-mount
Trying to relocate service:nfs-mount...
[never finishes]
virt-066# tail -f /var/log/messages
...
Nov 15 18:01:20 virt-066 kernel: nfs: server virt-065 not responding, timed out
Nov 15 18:01:20 virt-066 rgmanager[1096]: [netfs] netfs:le-nfs-mount1: virt-065:/exports is not mounted on /exports/1
Nov 15 18:01:20 virt-066 rgmanager[2376]: status on netfs "le-nfs-mount1" returned 7 (unspecified)
Nov 15 18:01:20 virt-066 rgmanager[2376]: Stopping service service:nfs-mount
Nov 15 18:06:53 virt-066 kernel: nfs: server virt-065 not responding, timed out
virt-066# date
Fri Nov 15 18:06:52 CET 2013
virt-066# clustat
Cluster Status for STSRHTS2638 @ Fri Nov 15 18:07:24 2013
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
virt-066 1 Online, Local, rgmanager
virt-067 2 Online, rgmanager
virt-068 3 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:nfs-mount (virt-067) recoverable
No reboot ocured and and relocate command never finished.
It seems to me that the problem is that netfs resource agent thinks that filesystem is not mounted
Fencing is working when I manually issued 'fence_node virt-067' from virt-066
it got fenced.
Please advise on whether this is configuration issue an my side or FAIL_QA.
(In reply to michal novacek from comment #5) > It does not seem to work for me with resource-agents-3.9.2-40.el6_5.2.x86_64. > > Here is what I did: > > * I configured NFS Server outside of the cluster (virt-065). > > * I defined rgmanager service to mount this same exported directory to > * /exports/1 > > virt-066# cat /etc/cluster/cluster.conf > ... > <resources> > <netfs > name="le-nfs-mount1" > host="virt-065" > export="/exports" > mountpoint="/exports/1" > force_unmount="1" > self_fence="1" > /> Let's make sure we are using nfsv3. There is a known kernel bug for v4 that we can't workaround. Once that's fixed/zstreamed then we can test v4 too. > </resources> > <service domain="le-domain" name="nfs-mount" recovery="relocate"> > <netfs ref="le-nfs-mount1"/> > </service> > ... > > virt-066# ccs -h localhost --lsservices > ccs -h localhost --lsservices > service: name=nfs-mount, domain=le-domain, recovery=relocate > netfs: ref=le-nfs-mount1 > resources: > netfs: name=le-nfs-mount1, self_fence=1, force_unmount=1, \ > host=virt-065, export=/exports, \ > mountpoint=/exports/1 > > * I started the service and checked that it mounted > virt-066# mount | grep /exports > virt-065:/exports on /exports/1 type nfs \ > (rw,sync,soft,noac,vers=4,addr=10.34.71.65,clientaddr=10.34.71.66) > > > * I blocked all the traffic from and to cluster nodes with iptables. > virt-060# date > Fri Nov 15 17:57:20 CET 2013 > > virt-060# for a in 66 67 68; do \ > iptables -I INPUT -s 10.34.71.$a -j DROP; \ > iptables -I OUTPUT -d 10.34.71.$a -j DROP; \ > done This is neither a cluster node or the nfsserver? I am not sure where virt-060 come in the game here vs routing. > > * I issued relocate command to see whether resource agent correctly umounts > the > directory. > > virt-066# clusvcadm -r nfs-mount > Trying to relocate service:nfs-mount... > [never finishes] The use case we are covering here: mount from cluster node server goes away netfs agent should detect that the server is gone, then try to stop the mount, recognize that the server is gone and self_fence the node. > > virt-066# tail -f /var/log/messages > ... > Nov 15 18:01:20 virt-066 kernel: nfs: server virt-065 not responding, timed > out > Nov 15 18:01:20 virt-066 rgmanager[1096]: [netfs] netfs:le-nfs-mount1: > virt-065:/exports is not mounted on /exports/1 > Nov 15 18:01:20 virt-066 rgmanager[2376]: status on netfs "le-nfs-mount1" > returned 7 (unspecified) > Nov 15 18:01:20 virt-066 rgmanager[2376]: Stopping service service:nfs-mount > Nov 15 18:06:53 virt-066 kernel: nfs: server virt-065 not responding, timed > out > I'll need to see more logs around the event, or give me access to the nodes and we can simulate/test together. > virt-066# date > Fri Nov 15 18:06:52 CET 2013 > > virt-066# clustat > Cluster Status for STSRHTS2638 @ Fri Nov 15 18:07:24 2013 > > Member Status: Quorate > > > Member Name ID Status > ------ ---- ---- ------ > virt-066 1 Online, Local, rgmanager > virt-067 2 Online, rgmanager > virt-068 3 Online, rgmanager > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > service:nfs-mount (virt-067) recoverable > > No reboot ocured and and relocate command never finished. > > It seems to me that the problem is that netfs resource agent thinks that > filesystem is not mounted > > Fencing is working when I manually issued 'fence_node virt-067' from virt-066 > it got fenced. > > Please advise on whether this is configuration issue an my side or FAIL_QA. It could be a combination of things, catch me monday morning and i am happy to look at it in your setup. I found the various issues (need to clone also to relevant components) bug 1: mount.nfs4 adds a trailing / to the device in /proc/mounts <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> (or use fstype="nfs4" and drop vers=4 from options has the same effect) nfsv4: 192.168.3.1:/mnt/ /mnt/nfs nfs4 rw,relatime,vers=4,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp,port=0,timeo=100,retrans=1,sec=sys,clientaddr=192.168.2.66,minorversion=0,local_lock=none,addr=192.168.3.1 0 0 nfsv3: <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="vers=3,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> 192.168.3.1:/mnt /mnt/nfs nfs rw,sync,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp,timeo=100,retrans=1,sec=sys,mountaddr=192.168.3.1,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.168.3.1 0 0 bug 2: findmnt is picky about the trailing / bug 3: resource-agents could add/remove the trailing / as necessary. workaround is simple: <netfs export="/mnt/" force_unmount="1" fstype="nfs" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> by adding a / at the end of the export config option. I bet that we see different results from QE and TAM because of the trailing slash in cluster.conf. Also, the relocation test is not useful to verify this scenario because the netfs status check will take a long time to detect that the server has gone and the relocation request will be aborted by rgmanager. Adding extra detection time before recovery will take place. It's best to just let rgmanager complete the status check. Verification has been done using the kernel that fixes EIO error from nfsv4 (Linux rhel6-node2.int.fabbione.net 2.6.32-358.23.2.sfdc00898941.1.el6.x86_64 #1 SMP Tue Nov 12 14:11:07 EST 2013 x86_64 x86_64 x86_64 GNU/Linux) <netfs export="/mnt/" force_unmount="1" fstype="nfs4" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> mount point is nfsv4 with workaround closing down the access to nfs server with iptables: Nov 17 16:27:52 rgmanager [netfs] Checking fs "nfs", Level 0 Nov 17 16:29:10 rgmanager [netfs] netfs:nfs: is_alive: /mnt/nfs is not a directory Nov 17 16:29:10 rgmanager [netfs] fs:nfs: Mount point is not accessible! Nov 17 16:29:10 rgmanager status on netfs "nfs" returned 1 (generic error) Nov 17 16:29:10 rgmanager Stopping service service:notoporcodi3lettere Nov 17 16:29:43 rgmanager [netfs] pre unmount: checking if nfs server 192.168.3.1 is alive Nov 17 16:29:43 rgmanager [netfs] Testing generic rpc access on server 192.168.3.1 with protocol tcp Nov 17 16:30:46 rgmanager [netfs] RPC server on 192.168.3.1 with tcp is not responding Nov 17 16:30:46 rgmanager [netfs] NFS server not responding - REBOOTING (In reply to Fabio Massimo Di Nitto from comment #8) > I found the various issues (need to clone also to relevant components) > > bug 1: mount.nfs4 adds a trailing / to the device in /proc/mounts filed as bug#1031636 > > <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1" > mountpoint="/mnt/nfs" name="nfs" > options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> > > (or use fstype="nfs4" and drop vers=4 from options has the same effect) > > nfsv4: > > 192.168.3.1:/mnt/ /mnt/nfs nfs4 > rw,relatime,vers=4,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp, > port=0,timeo=100,retrans=1,sec=sys,clientaddr=192.168.2.66,minorversion=0, > local_lock=none,addr=192.168.3.1 0 0 > > nfsv3: > > <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1" > mountpoint="/mnt/nfs" name="nfs" > options="vers=3,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> > > 192.168.3.1:/mnt /mnt/nfs nfs > rw,sync,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp, > timeo=100,retrans=1,sec=sys,mountaddr=192.168.3.1,mountvers=3, > mountport=20048,mountproto=udp,local_lock=none,addr=192.168.3.1 0 0 > > bug 2: findmnt is picky about the trailing / filed as bug#1031641 > > bug 3: resource-agents could add/remove the trailing / as necessary. https://github.com/fabbione/resource-agents/tree/mountfixes > > workaround is simple: > > <netfs export="/mnt/" force_unmount="1" fstype="nfs" > host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" > options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> > > by adding a / at the end of the export config option. > > I bet that we see different results from QE and TAM because of the trailing > slash in cluster.conf. > > Also, the relocation test is not useful to verify this scenario because the > netfs status check will take a long time to detect that the server has gone > and the relocation request will be aborted by rgmanager. Adding extra > detection time before recovery will take place. It's best to just let > rgmanager complete the status check. > > Verification has been done using the kernel that fixes EIO error from nfsv4 > (Linux rhel6-node2.int.fabbione.net > 2.6.32-358.23.2.sfdc00898941.1.el6.x86_64 #1 SMP Tue Nov 12 14:11:07 EST > 2013 x86_64 x86_64 x86_64 GNU/Linux) > > <netfs export="/mnt/" force_unmount="1" fstype="nfs4" > host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" > options="rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> > > > mount point is nfsv4 with workaround > > closing down the access to nfs server with iptables: > Nov 17 16:27:52 rgmanager [netfs] Checking fs "nfs", Level 0 > Nov 17 16:29:10 rgmanager [netfs] netfs:nfs: is_alive: /mnt/nfs is not a > directory > Nov 17 16:29:10 rgmanager [netfs] fs:nfs: Mount point is not accessible! > Nov 17 16:29:10 rgmanager status on netfs "nfs" returned 1 (generic error) > Nov 17 16:29:10 rgmanager Stopping service service:notoporcodi3lettere > Nov 17 16:29:43 rgmanager [netfs] pre unmount: checking if nfs server > 192.168.3.1 is alive > Nov 17 16:29:43 rgmanager [netfs] Testing generic rpc access on server > 192.168.3.1 with protocol tcp > Nov 17 16:30:46 rgmanager [netfs] RPC server on 192.168.3.1 with tcp is not > responding > Nov 17 16:30:46 rgmanager [netfs] NFS server not responding - REBOOTING
I have verified that the node self fence if it looses connection with nfs
server works for nfs version 3 and version 4 mounts.
---
virt-067$ rpm -q resource-agents
resource-agents-3.9.2-40.el6_5.3.x86_64
virt-067$ cat /etc/cluster/cluster.conf
...
<resources>
<netfs
name="le-nfs-mount1"
host="virt-065"
export="/exports"
mountpoint="/exports/1"
force_unmount="1"
self_fence="1"
/>
</resources>
<service domain="le-domain" name="nfs-mount" recovery="relocate">
<netfs ref="le-nfs-mount1"/>
</service>
...
virt-067$ ccs -h localhost --lsservices
service: name=nfs-mount, domain=le-domain, recovery=relocate
netfs: ref=le-nfs-mount
resources:
netfs: name=le-nfs-mount, self_fence=1, force_unmount=1, host=virt-065, \
export=/exports, mountpoint=/exports/1
nfs version 3:
==============
virt-067$ mount | grep /exports/1
virt-065:/exports on /exports/1 type nfs (rw,sync,vers=3,soft,retrans=1,timeo=100,addr=10.34.71.65)
virt-067$ date
Mon Nov 18 17:36:15 CET 2013
virt-067$ clustat
Cluster Status for STSRHTS2638 @ Mon Nov 18 17:38:04 2013
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
virt-066 1 Online, rgmanager
virt-067 2 Online, Local, rgmanager
virt-068 3 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:nfs-mount virt-067 started
nfs-server# for a in 66 67 68; do \
iptables -I INPUT -s 10.34.71.$a -j DROP; \
iptables -I OUTPUT -d 10.34.71.$a -j DROP; \
done
virt-067$ tail -f /var/log/cluster/rgmanager.log
...
Nov 18 17:39:09 rgmanager [netfs] netfs:le-nfs-mount: is_alive: /exports/1 is not a directory
Nov 18 17:39:09 rgmanager [netfs] fs:le-nfs-mount: Mount point is not accessible!
Nov 18 17:39:09 rgmanager status on netfs "le-nfs-mount" returned 1 (generic error)
Nov 18 17:39:09 rgmanager Stopping service service:nfs-mount
Nov 18 17:39:39 rgmanager [netfs] pre unmount: checking if nfs server virt-065 is alive
Nov 18 17:39:39 rgmanager [netfs] Testing generic rpc access on server virt-065 with protocol tcp
Nov 18 17:40:42 rgmanager [netfs] RPC server on virt-065 with tcp is not responding
Nov 18 17:40:42 rgmanager [netfs] NFS server not responding - REBOOTING
nfs version 4
=============
virt-067$ mount | grep /exports/1
virt-065:/exports on /exports/1 type nfs (rw,sync,vers=4,soft,retrans=1,timeo=100,addr=10.34.71.65,clientaddr=10.34.71.67)
(blocked traffic to nfsserver with iptables)
virt-067$ tail -f /var/log/cluster/rgmanager.log
...
Nov 18 17:56:21 rgmanager [netfs] Checking fs "le-nfs-mount", Level 0
Nov 18 17:57:21 rgmanager [netfs] Checking fs "le-nfs-mount", Level 0
Nov 18 17:58:39 rgmanager [netfs] netfs:le-nfs-mount: is_alive: /exports/1 is not a directory
Nov 18 17:58:39 rgmanager [netfs] fs:le-nfs-mount: Mount point is not accessible!
Nov 18 17:58:39 rgmanager status on netfs "le-nfs-mount" returned 1 (generic error)
Nov 18 17:58:39 rgmanager Stopping service service:nfs-mount
Nov 18 17:59:12 rgmanager [netfs] pre unmount: checking if nfs server virt-065 is alive
Nov 18 17:59:12 rgmanager [netfs] Testing generic rpc access on server virt-065 with protocol tcp
Nov 18 18:00:15 rgmanager [netfs] RPC server on virt-065 with tcp is not responding
Nov 18 18:00:15 rgmanager [netfs] NFS server not responding - REBOOTING
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1746.html |