Bug 1091102

Summary:	The pacemaker nfsserver resource agent's execution of sm-notify fails during startup
Product:	Red Hat Enterprise Linux 6	Reporter:	David Vossel <dvossel>
Component:	resource-agents	Assignee:	David Vossel <dvossel>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.6	CC:	agk, cluster-maint, djansa, fdinitto, jherrman, mnovacek, sbradley
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	resource-agents-3.9.5-8.el6	Doc Type:	Bug Fix
Doc Text:	Previously, Pacemaker's nfsserver resource agent was unable to properly perform NFSv3 network status monitor (NSM) state notifications. As a consequence, NFSv3 clients could not reclaim file locks after server relocation or recovery. This update introduces the nfsnotify resource agent, thanks to which NSM notifications can be sent correctly, thus allowing NFSv3 clients to reclaim file locks.	Story Points:	---
Clone Of:	1091101	Environment:
Last Closed:	2014-10-14 05:00:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1091101
Bug Blocks:

Description David Vossel 2014-04-24 22:04:45 UTC

+++ This bug was initially created as a clone of Bug #1091101 +++

Description of problem:

The nfsserver resource-agent's call to sm-notify during startup fails because we do not properly maintain file permissions on the statd folder.

If during a failover, the server needs to notify a client using sm-notify, sm-notify is going to display this warning.

Apr 24 17:52:40 rhel7-node1 lrmd[2771]: notice: operation_finished: nfs-daemon_start_0:3848:stderr [ sm-notify: Failed to delete: could not open original file /var/lib/nfs/statd/sm.ha/sm.bak/rhel7-node3: Permission denied ]


How reproducible:
100%

Steps to Reproduce:

1. deploy this pacemaker scenario for setting up a nfs server and file export. https://github.com/davidvossel/phd/blob/master/scenarios/nfs-basic.scenario
2. mount the export on a node outside of the cluster. grab a filelock on some file within the export

I did this. 

flock /root/nfsshare/clientdatafile -c "sleep 10000"

3. Put whatever node that was hosting the nfs server in standby

pcs cluster standby

4. Watch sm-notify fail during nfs startup on whatever node the nfs server moves to.

Actual results:

sm-notify does not properly delete notify entries.


Expected results:

sm-notify properly deletes notify entries after the notify is complete.

--- Additional comment from David Vossel on 2014-04-24 18:04:12 EDT ---

There is a patch posted for this upstream.

https://github.com/ClusterLabs/resource-agents/pull/414

Comment 2 David Vossel 2014-05-08 15:43:48 UTC

There's an upstream pull request related to this issue.

https://github.com/ClusterLabs/resource-agents/pull/420

Comment 3 David Vossel 2014-05-08 15:44:40 UTC

*** Bug 1091474 has been marked as a duplicate of this bug. ***

Comment 7 michal novacek 2014-07-25 17:26:28 UTC

I have verified (using inscructions from comment #6) that sm-notify works
correctly after nfs server failover with new nfs-notify resource agent
from resource-agents-3.9.5-11.el6.x86_64.

----

nfs-client# mount | grep shared
# mount | grep shared
10.34.70.136:/mnt/shared/1 on /exports/1 type nfs (rw,vers=3,addr=10.34.70.136)

nfs-client# flock /exports/1/urandom -c 'sleep 10000'
...

# tshark -i eth0 -R nlm 
Running as user "root" and group "root". This could be dangerous.
Capturing on eth0
 10.523062 10.34.71.133 -> 10.34.70.136 NLM 330 V4 LOCK Call FH:0x6c895d9c svid:137 pos:0-0
 10.523350 10.34.70.136 -> 10.34.71.133 NLM 106 V4 LOCK Reply (Call In 52)
<failover occurs>
 29.301472 10.34.71.133 -> 10.34.70.136 NLM 330 V4 LOCK Call FH:0x6c895d9c svid:137 pos:0-0
 32.303873 10.34.71.133 -> 10.34.70.136 NLM 330 V4 LOCK Call FH:0x6c895d9c svid:137 pos:0-0
 32.332312 10.34.70.136 -> 10.34.71.133 NLM 106 V4 LOCK Reply (Call In 120)

# tshark -i eth0 -R stat
Running as user "root" and group "root". This could be dangerous.
Capturing on eth0
<failover occurs>
 27.793019 10.34.70.136 -> 10.34.71.133 STAT 142 V1 NOTIFY Call
 27.793204 10.34.71.133 -> 10.34.70.136 STAT 66 V1 NOTIFY Reply (Call In 75)
 27.793440 10.34.70.136 -> 10.34.71.133 STAT 142 V1 NOTIFY Call
 27.793672 10.34.71.133 -> 10.34.70.136 STAT 66 V1 NOTIFY Reply (Call In 77)


Obtaining another lock fails so the locks is still being held by the original
process:
nfs-client# flock --nonblock /exports/1/urandom -c 'sleep 10'
nfs-client# echo $?
1


cluster configuration is as follows: 
virt-136# pcs status 
Cluster name: STSRHTS24129
Last updated: Fri Jul 25 19:03:58 2014
Last change: Fri Jul 25 18:55:14 2014
Stack: cman
Current DC: virt-137 - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured
10 Resources configured


Online: [ virt-136 virt-137 ]

Full list of resources:

 fence-virt-136 (stonith:fence_xvm):    Started virt-136 
 fence-virt-137 (stonith:fence_xvm):    Started virt-137 
 fence-virt-138 (stonith:fence_xvm):    Started virt-136 
 Resource Group: hanfs
     mnt-shared (ocf::heartbeat:Filesystem):    Started virt-136 
     nfs-daemon (ocf::heartbeat:nfsserver):     Started virt-136 
     export-root        (ocf::heartbeat:exportfs):      Started virt-136 
     export0    (ocf::heartbeat:exportfs):      Started virt-136 
     export1    (ocf::heartbeat:exportfs):      Started virt-136 
     vip        (ocf::heartbeat:IPaddr2):       Started virt-136 
     nfs-notify (ocf::heartbeat:nfsnotify):     Started virt-136 

virt-136# pcs resource show hanfs
 Group: hanfs
  Resource: mnt-shared (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/shared/shared0 directory=/mnt/shared fstype=ext4 options= force_unmount=safe 
   Operations: start interval=0s timeout=60 (mnt-shared-start-timeout-60)
               stop interval=0s timeout=60 (mnt-shared-stop-timeout-60)
               monitor interval=30s (mnt-shared-monitor-interval-30s)
  Resource: nfs-daemon (class=ocf provider=heartbeat type=nfsserver)
   Attributes: nfs_ip=10.34.70.136 nfs_shared_infodir=/mnt/shared/nfs nfs_no_notify=True 
   Operations: start interval=0s timeout=40 (nfs-daemon-start-timeout-40)
               stop interval=0s timeout=20s (nfs-daemon-stop-timeout-20s)
               monitor interval=30s (nfs-daemon-monitor-interval-30s)
  Resource: export-root (class=ocf provider=heartbeat type=exportfs)
   Attributes: directory=/mnt/shared clientspec=* options=rw,sync fsid=304 
   Operations: start interval=0s timeout=40 (export-root-start-timeout-40)
               stop interval=0s timeout=120 (export-root-stop-timeout-120)
               monitor interval=10 timeout=20 (export-root-monitor-interval-10)
  Resource: export0 (class=ocf provider=heartbeat type=exportfs)
   Attributes: directory=/mnt/shared/0 clientspec=* options=rw,sync fsid=1 
   Operations: start interval=0s timeout=40 (export0-start-timeout-40)
               stop interval=0s timeout=120 (export0-stop-timeout-120)
               monitor interval=10 timeout=20 (export0-monitor-interval-10)
  Resource: export1 (class=ocf provider=heartbeat type=exportfs)
   Attributes: directory=/mnt/shared/1 clientspec=* options=rw,sync fsid=2 
   Operations: start interval=0s timeout=40 (export1-start-timeout-40)
               stop interval=0s timeout=120 (export1-stop-timeout-120)
               monitor interval=10 timeout=20 (export1-monitor-interval-10)
  Resource: vip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.34.70.136 cidr_netmask=23 
   Operations: start interval=0s timeout=20s (vip-start-timeout-20s)
               stop interval=0s timeout=20s (vip-stop-timeout-20s)
               monitor interval=30s (vip-monitor-interval-30s)
  Resource: nfs-notify (class=ocf provider=heartbeat type=nfsnotify)
   Attributes: source_host=pool-10-34-70-136.cluster-qe.lab.eng.brq.redhat.com 
   Operations: start interval=0s timeout=90 (nfs-notify-start-timeout-90)
               stop interval=0s timeout=90 (nfs-notify-stop-timeout-90)
               monitor interval=30 timeout=90 (nfs-notify-monitor-interval-30)

Comment 8 errata-xmlrpc 2014-10-14 05:00:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1428.html