Bug 1027410
Summary: | netfs unmount/self_fence integration | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jan Kurik <jkurik> |
Component: | resource-agents | Assignee: | David Vossel <dvossel> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 6.6 | CC: | agk, cluster-maint, djansa, dvossel, fdinitto, jcastillo, jharriga, jruemker, jsvarova, luvilla, michele, mnovacek, pm-eus |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | 0day | ||
Fixed In Version: | resource-agents-3.9.2-40.el6_5.3 | Doc Type: | Bug Fix |
Doc Text: |
Prior to this update, the netfs agent could hang during a stop operation, even with the self_fence option enabled. With this update, self fence operation is executed sooner in the process, which ensures that NFS client detects server leaving if umount can not succeed, and self fencing occurs.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2013-11-22 00:28:06 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1024065, 1031636, 1031641 | ||
Bug Blocks: |
Description
Jan Kurik
2013-11-06 18:47:10 UTC
It does not seem to work for me with resource-agents-3.9.2-40.el6_5.2.x86_64. Here is what I did: * I configured NFS Server outside of the cluster (virt-065). * I defined rgmanager service to mount this same exported directory to * /exports/1 virt-066# cat /etc/cluster/cluster.conf ... <resources> <netfs name="le-nfs-mount1" host="virt-065" export="/exports" mountpoint="/exports/1" force_unmount="1" self_fence="1" /> </resources> <service domain="le-domain" name="nfs-mount" recovery="relocate"> <netfs ref="le-nfs-mount1"/> </service> ... virt-066# ccs -h localhost --lsservices ccs -h localhost --lsservices service: name=nfs-mount, domain=le-domain, recovery=relocate netfs: ref=le-nfs-mount1 resources: netfs: name=le-nfs-mount1, self_fence=1, force_unmount=1, \ host=virt-065, export=/exports, \ mountpoint=/exports/1 * I started the service and checked that it mounted virt-066# mount | grep /exports virt-065:/exports on /exports/1 type nfs \ (rw,sync,soft,noac,vers=4,addr=10.34.71.65,clientaddr=10.34.71.66) * I blocked all the traffic from and to cluster nodes with iptables. virt-060# date Fri Nov 15 17:57:20 CET 2013 virt-060# for a in 66 67 68; do \ iptables -I INPUT -s 10.34.71.$a -j DROP; \ iptables -I OUTPUT -d 10.34.71.$a -j DROP; \ done * I issued relocate command to see whether resource agent correctly umounts the directory. virt-066# clusvcadm -r nfs-mount Trying to relocate service:nfs-mount... [never finishes] virt-066# tail -f /var/log/messages ... Nov 15 18:01:20 virt-066 kernel: nfs: server virt-065 not responding, timed out Nov 15 18:01:20 virt-066 rgmanager[1096]: [netfs] netfs:le-nfs-mount1: virt-065:/exports is not mounted on /exports/1 Nov 15 18:01:20 virt-066 rgmanager[2376]: status on netfs "le-nfs-mount1" returned 7 (unspecified) Nov 15 18:01:20 virt-066 rgmanager[2376]: Stopping service service:nfs-mount Nov 15 18:06:53 virt-066 kernel: nfs: server virt-065 not responding, timed out virt-066# date Fri Nov 15 18:06:52 CET 2013 virt-066# clustat Cluster Status for STSRHTS2638 @ Fri Nov 15 18:07:24 2013 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ virt-066 1 Online, Local, rgmanager virt-067 2 Online, rgmanager virt-068 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:nfs-mount (virt-067) recoverable No reboot ocured and and relocate command never finished. It seems to me that the problem is that netfs resource agent thinks that filesystem is not mounted Fencing is working when I manually issued 'fence_node virt-067' from virt-066 it got fenced. Please advise on whether this is configuration issue an my side or FAIL_QA. (In reply to michal novacek from comment #5) > It does not seem to work for me with resource-agents-3.9.2-40.el6_5.2.x86_64. > > Here is what I did: > > * I configured NFS Server outside of the cluster (virt-065). > > * I defined rgmanager service to mount this same exported directory to > * /exports/1 > > virt-066# cat /etc/cluster/cluster.conf > ... > <resources> > <netfs > name="le-nfs-mount1" > host="virt-065" > export="/exports" > mountpoint="/exports/1" > force_unmount="1" > self_fence="1" > /> Let's make sure we are using nfsv3. There is a known kernel bug for v4 that we can't workaround. Once that's fixed/zstreamed then we can test v4 too. > </resources> > <service domain="le-domain" name="nfs-mount" recovery="relocate"> > <netfs ref="le-nfs-mount1"/> > </service> > ... > > virt-066# ccs -h localhost --lsservices > ccs -h localhost --lsservices > service: name=nfs-mount, domain=le-domain, recovery=relocate > netfs: ref=le-nfs-mount1 > resources: > netfs: name=le-nfs-mount1, self_fence=1, force_unmount=1, \ > host=virt-065, export=/exports, \ > mountpoint=/exports/1 > > * I started the service and checked that it mounted > virt-066# mount | grep /exports > virt-065:/exports on /exports/1 type nfs \ > (rw,sync,soft,noac,vers=4,addr=10.34.71.65,clientaddr=10.34.71.66) > > > * I blocked all the traffic from and to cluster nodes with iptables. > virt-060# date > Fri Nov 15 17:57:20 CET 2013 > > virt-060# for a in 66 67 68; do \ > iptables -I INPUT -s 10.34.71.$a -j DROP; \ > iptables -I OUTPUT -d 10.34.71.$a -j DROP; \ > done This is neither a cluster node or the nfsserver? I am not sure where virt-060 come in the game here vs routing. > > * I issued relocate command to see whether resource agent correctly umounts > the > directory. > > virt-066# clusvcadm -r nfs-mount > Trying to relocate service:nfs-mount... > [never finishes] The use case we are covering here: mount from cluster node server goes away netfs agent should detect that the server is gone, then try to stop the mount, recognize that the server is gone and self_fence the node. > > virt-066# tail -f /var/log/messages > ... > Nov 15 18:01:20 virt-066 kernel: nfs: server virt-065 not responding, timed > out > Nov 15 18:01:20 virt-066 rgmanager[1096]: [netfs] netfs:le-nfs-mount1: > virt-065:/exports is not mounted on /exports/1 > Nov 15 18:01:20 virt-066 rgmanager[2376]: status on netfs "le-nfs-mount1" > returned 7 (unspecified) > Nov 15 18:01:20 virt-066 rgmanager[2376]: Stopping service service:nfs-mount > Nov 15 18:06:53 virt-066 kernel: nfs: server virt-065 not responding, timed > out > I'll need to see more logs around the event, or give me access to the nodes and we can simulate/test together. > virt-066# date > Fri Nov 15 18:06:52 CET 2013 > > virt-066# clustat > Cluster Status for STSRHTS2638 @ Fri Nov 15 18:07:24 2013 > > Member Status: Quorate > > > Member Name ID Status > ------ ---- ---- ------ > virt-066 1 Online, Local, rgmanager > virt-067 2 Online, rgmanager > virt-068 3 Online, rgmanager > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > service:nfs-mount (virt-067) recoverable > > No reboot ocured and and relocate command never finished. > > It seems to me that the problem is that netfs resource agent thinks that > filesystem is not mounted > > Fencing is working when I manually issued 'fence_node virt-067' from virt-066 > it got fenced. > > Please advise on whether this is configuration issue an my side or FAIL_QA. It could be a combination of things, catch me monday morning and i am happy to look at it in your setup. I found the various issues (need to clone also to relevant components) bug 1: mount.nfs4 adds a trailing / to the device in /proc/mounts <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> (or use fstype="nfs4" and drop vers=4 from options has the same effect) nfsv4: 192.168.3.1:/mnt/ /mnt/nfs nfs4 rw,relatime,vers=4,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp,port=0,timeo=100,retrans=1,sec=sys,clientaddr=192.168.2.66,minorversion=0,local_lock=none,addr=192.168.3.1 0 0 nfsv3: <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="vers=3,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> 192.168.3.1:/mnt /mnt/nfs nfs rw,sync,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp,timeo=100,retrans=1,sec=sys,mountaddr=192.168.3.1,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.168.3.1 0 0 bug 2: findmnt is picky about the trailing / bug 3: resource-agents could add/remove the trailing / as necessary. workaround is simple: <netfs export="/mnt/" force_unmount="1" fstype="nfs" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> by adding a / at the end of the export config option. I bet that we see different results from QE and TAM because of the trailing slash in cluster.conf. Also, the relocation test is not useful to verify this scenario because the netfs status check will take a long time to detect that the server has gone and the relocation request will be aborted by rgmanager. Adding extra detection time before recovery will take place. It's best to just let rgmanager complete the status check. Verification has been done using the kernel that fixes EIO error from nfsv4 (Linux rhel6-node2.int.fabbione.net 2.6.32-358.23.2.sfdc00898941.1.el6.x86_64 #1 SMP Tue Nov 12 14:11:07 EST 2013 x86_64 x86_64 x86_64 GNU/Linux) <netfs export="/mnt/" force_unmount="1" fstype="nfs4" host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" options="rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> mount point is nfsv4 with workaround closing down the access to nfs server with iptables: Nov 17 16:27:52 rgmanager [netfs] Checking fs "nfs", Level 0 Nov 17 16:29:10 rgmanager [netfs] netfs:nfs: is_alive: /mnt/nfs is not a directory Nov 17 16:29:10 rgmanager [netfs] fs:nfs: Mount point is not accessible! Nov 17 16:29:10 rgmanager status on netfs "nfs" returned 1 (generic error) Nov 17 16:29:10 rgmanager Stopping service service:notoporcodi3lettere Nov 17 16:29:43 rgmanager [netfs] pre unmount: checking if nfs server 192.168.3.1 is alive Nov 17 16:29:43 rgmanager [netfs] Testing generic rpc access on server 192.168.3.1 with protocol tcp Nov 17 16:30:46 rgmanager [netfs] RPC server on 192.168.3.1 with tcp is not responding Nov 17 16:30:46 rgmanager [netfs] NFS server not responding - REBOOTING (In reply to Fabio Massimo Di Nitto from comment #8) > I found the various issues (need to clone also to relevant components) > > bug 1: mount.nfs4 adds a trailing / to the device in /proc/mounts filed as bug#1031636 > > <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1" > mountpoint="/mnt/nfs" name="nfs" > options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> > > (or use fstype="nfs4" and drop vers=4 from options has the same effect) > > nfsv4: > > 192.168.3.1:/mnt/ /mnt/nfs nfs4 > rw,relatime,vers=4,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp, > port=0,timeo=100,retrans=1,sec=sys,clientaddr=192.168.2.66,minorversion=0, > local_lock=none,addr=192.168.3.1 0 0 > > nfsv3: > > <netfs export="/mnt" force_unmount="1" fstype="nfs" host="192.168.3.1" > mountpoint="/mnt/nfs" name="nfs" > options="vers=3,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> > > 192.168.3.1:/mnt /mnt/nfs nfs > rw,sync,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp, > timeo=100,retrans=1,sec=sys,mountaddr=192.168.3.1,mountvers=3, > mountport=20048,mountproto=udp,local_lock=none,addr=192.168.3.1 0 0 > > bug 2: findmnt is picky about the trailing / filed as bug#1031641 > > bug 3: resource-agents could add/remove the trailing / as necessary. https://github.com/fabbione/resource-agents/tree/mountfixes > > workaround is simple: > > <netfs export="/mnt/" force_unmount="1" fstype="nfs" > host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" > options="vers=4,rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> > > by adding a / at the end of the export config option. > > I bet that we see different results from QE and TAM because of the trailing > slash in cluster.conf. > > Also, the relocation test is not useful to verify this scenario because the > netfs status check will take a long time to detect that the server has gone > and the relocation request will be aborted by rgmanager. Adding extra > detection time before recovery will take place. It's best to just let > rgmanager complete the status check. > > Verification has been done using the kernel that fixes EIO error from nfsv4 > (Linux rhel6-node2.int.fabbione.net > 2.6.32-358.23.2.sfdc00898941.1.el6.x86_64 #1 SMP Tue Nov 12 14:11:07 EST > 2013 x86_64 x86_64 x86_64 GNU/Linux) > > <netfs export="/mnt/" force_unmount="1" fstype="nfs4" > host="192.168.3.1" mountpoint="/mnt/nfs" name="nfs" > options="rw,sync,soft,retrans=1,timeo=100" self_fence="1"/> > > > mount point is nfsv4 with workaround > > closing down the access to nfs server with iptables: > Nov 17 16:27:52 rgmanager [netfs] Checking fs "nfs", Level 0 > Nov 17 16:29:10 rgmanager [netfs] netfs:nfs: is_alive: /mnt/nfs is not a > directory > Nov 17 16:29:10 rgmanager [netfs] fs:nfs: Mount point is not accessible! > Nov 17 16:29:10 rgmanager status on netfs "nfs" returned 1 (generic error) > Nov 17 16:29:10 rgmanager Stopping service service:notoporcodi3lettere > Nov 17 16:29:43 rgmanager [netfs] pre unmount: checking if nfs server > 192.168.3.1 is alive > Nov 17 16:29:43 rgmanager [netfs] Testing generic rpc access on server > 192.168.3.1 with protocol tcp > Nov 17 16:30:46 rgmanager [netfs] RPC server on 192.168.3.1 with tcp is not > responding > Nov 17 16:30:46 rgmanager [netfs] NFS server not responding - REBOOTING I have verified that the node self fence if it looses connection with nfs server works for nfs version 3 and version 4 mounts. --- virt-067$ rpm -q resource-agents resource-agents-3.9.2-40.el6_5.3.x86_64 virt-067$ cat /etc/cluster/cluster.conf ... <resources> <netfs name="le-nfs-mount1" host="virt-065" export="/exports" mountpoint="/exports/1" force_unmount="1" self_fence="1" /> </resources> <service domain="le-domain" name="nfs-mount" recovery="relocate"> <netfs ref="le-nfs-mount1"/> </service> ... virt-067$ ccs -h localhost --lsservices service: name=nfs-mount, domain=le-domain, recovery=relocate netfs: ref=le-nfs-mount resources: netfs: name=le-nfs-mount, self_fence=1, force_unmount=1, host=virt-065, \ export=/exports, mountpoint=/exports/1 nfs version 3: ============== virt-067$ mount | grep /exports/1 virt-065:/exports on /exports/1 type nfs (rw,sync,vers=3,soft,retrans=1,timeo=100,addr=10.34.71.65) virt-067$ date Mon Nov 18 17:36:15 CET 2013 virt-067$ clustat Cluster Status for STSRHTS2638 @ Mon Nov 18 17:38:04 2013 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ virt-066 1 Online, rgmanager virt-067 2 Online, Local, rgmanager virt-068 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:nfs-mount virt-067 started nfs-server# for a in 66 67 68; do \ iptables -I INPUT -s 10.34.71.$a -j DROP; \ iptables -I OUTPUT -d 10.34.71.$a -j DROP; \ done virt-067$ tail -f /var/log/cluster/rgmanager.log ... Nov 18 17:39:09 rgmanager [netfs] netfs:le-nfs-mount: is_alive: /exports/1 is not a directory Nov 18 17:39:09 rgmanager [netfs] fs:le-nfs-mount: Mount point is not accessible! Nov 18 17:39:09 rgmanager status on netfs "le-nfs-mount" returned 1 (generic error) Nov 18 17:39:09 rgmanager Stopping service service:nfs-mount Nov 18 17:39:39 rgmanager [netfs] pre unmount: checking if nfs server virt-065 is alive Nov 18 17:39:39 rgmanager [netfs] Testing generic rpc access on server virt-065 with protocol tcp Nov 18 17:40:42 rgmanager [netfs] RPC server on virt-065 with tcp is not responding Nov 18 17:40:42 rgmanager [netfs] NFS server not responding - REBOOTING nfs version 4 ============= virt-067$ mount | grep /exports/1 virt-065:/exports on /exports/1 type nfs (rw,sync,vers=4,soft,retrans=1,timeo=100,addr=10.34.71.65,clientaddr=10.34.71.67) (blocked traffic to nfsserver with iptables) virt-067$ tail -f /var/log/cluster/rgmanager.log ... Nov 18 17:56:21 rgmanager [netfs] Checking fs "le-nfs-mount", Level 0 Nov 18 17:57:21 rgmanager [netfs] Checking fs "le-nfs-mount", Level 0 Nov 18 17:58:39 rgmanager [netfs] netfs:le-nfs-mount: is_alive: /exports/1 is not a directory Nov 18 17:58:39 rgmanager [netfs] fs:le-nfs-mount: Mount point is not accessible! Nov 18 17:58:39 rgmanager status on netfs "le-nfs-mount" returned 1 (generic error) Nov 18 17:58:39 rgmanager Stopping service service:nfs-mount Nov 18 17:59:12 rgmanager [netfs] pre unmount: checking if nfs server virt-065 is alive Nov 18 17:59:12 rgmanager [netfs] Testing generic rpc access on server virt-065 with protocol tcp Nov 18 18:00:15 rgmanager [netfs] RPC server on virt-065 with tcp is not responding Nov 18 18:00:15 rgmanager [netfs] NFS server not responding - REBOOTING Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1746.html |