Created attachment 1128469 [details] ganesha.log Description of problem: Crash seen on a node during nfs-ganesha upgrade from 3.1 to 3.1.2 Version-Release number of selected component (if applicable): glusterfs-3.7.5-19 How reproducible: Once Steps to Reproduce: 1.Create a 4 node cluster and install 3.1 rhgs on all the nodes. 2.Configure all the required settings and setup ganesha on the cluster. 3.Mount the volume and start some dd from the client. 4.Do a upgrade of node1 by following the procedure mentioned below: service nfs-ganesha stop failover happened from node1 to node1 service glusterd stop pkill glusterfs pkill glusterfsd pcs cluster standby node1 pcs cluster stop node1 enable puddles for 3.1.2 latest yum update nfs-ganesha pcs cluster start node1 pcs cluster unstandby node1 service glusterd start service nfs-ganesha start 5. Node1 got upgraded properly without any issues. 6. Followed the same steps for the upgrade of node2 as below mounted the volume on client with node2 VIP service nfs-ganesha stop failover happened from node2 to node1 service glusterd stop pkill glusterfs pkill glusterfsd pcs cluster standby node2 pcs cluster stop node2 IO was going on yum update nfs-ganesha - all the packages got updated. pcs cluster start node2 pcs cluster unstandby node2 After this pcs status gives below output: Full list of resources: Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 ] Stopped: [ nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 ] Stopped: [ nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs1-trigger_ip-1 (ocf::heartbeatummy): Started nfs3 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs2-trigger_ip-1 (ocf::heartbeatummy): Started nfs3 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeatummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs4-trigger_ip-1 (ocf::heartbeatummy): Started nfs3 nfs2-dead_ip-1 (ocf::heartbeatummy): Started nfs2 nfs1-dead_ip-1 (ocf::heartbeatummy): Started nfs1 and crash is reported in /var/log/messages but it got deleted Feb 18 08:00:31 nfs1 abrt[5878]: Saved core dump of pid 28950 (/usr/bin/ganesha.nfsd) to /var/spool/abrt/ccpp-2016-02-18-08:00:28-28950 (703098880 bytes) Feb 18 08:00:31 nfs1 abrtd: Directory 'ccpp-2016-02-18-08:00:28-28950' creation detected Feb 18 08:00:31 nfs1 abrtd: Package 'nfs-ganesha' isn't signed with proper key Feb 18 08:00:31 nfs1 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2016-02-18-08:00:28-28950' exited with 1 Feb 18 08:00:31 nfs1 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2016-02-18-08:00:28-28950' Attaching ganesha and ganesha-gfapi logs alongwith the sos report for node1 Actual results: Expected results: Additional info:
Created attachment 1128470 [details] ganesha-gfapi.log
sosreports are placed under http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1309984
Core seem to have got deleted. Niels..any idea why? 18/02/2016 08:00:25 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-13] cache_inode_lru_clean :INODE LRU :CRIT :Error closing file in cleanup: 37. 18/02/2016 08:00:26 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-11] file_close :FSAL :CRIT :Error : close returns with Read-only file system 18/02/2016 08:00:26 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-11] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7fb71c011f90 18/02/2016 08:00:26 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-11] cache_inode_lru_clean :INODE LRU :CRIT :Error closing file in cleanup: 37. 18/02/2016 08:00:27 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-12] file_close :FSAL :CRIT :Error : close returns with Read-only file system 18/02/2016 08:00:27 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-12] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7fb7100128f0 18/02/2016 08:00:27 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-12] cache_inode_lru_clean :INODE LRU :CRIT :Error closing file in cleanup: 37. 18/02/2016 08:00:28 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-2] file_close :FSAL :CRIT :Error : close returns with Read-only file system 18/02/2016 08:00:28 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-2] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7fb6c4004730 18/02/2016 08:00:28 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-2] cache_inode_lru_clean :INODE LRU :CRIT :Error closing file in cleanup: 37. 18/02/2016 08:00:28 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-2] clean_mapping :RW LOCK :CRIT :Error 35, write locking 0x7fb6c4004730 (&entry->attr_lock) at /builddir/build/BUILD/nfs-ganesha-2.2.0/src/cache_inode/cache_inode_get.c:157 Many file CLOSE operations failed with read-only file system error due to split-brain or quorum issues. This can be confirmed from below log message in gfapi.log [2016-02-18 02:27:50.699668] W [MSGID: 108001] [afr-transaction.c:686:afr_handle_quorum] 0-testvolume-replicate-1: 92989e49-afca-4555-89f9-8718dc8880fb: Failing WRITE as quorum is not met [2016-02-18 02:27:50.699696] E [MSGID: 114031] [client-rpc-fops.c:1676:client3_3_finodelk_cbk] 0-testvolume-client-3: remote operation failed [Transport endpoint is not connected] But nfs-ganesha process got aborted due to below error - 18/02/2016 08:00:28 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-2] clean_mapping :RW LOCK :CRIT :Error 35, write locking 0x7fb6c4004730 (&entry->attr_lock) at /builddir/build/BUILD/nfs-ganesha-2.2.0/src/cache_inode/cache_inode_get.c:157 Need to check from the code why pthread_write_lock resulted in EDEADLK.
The core file creation fails due to th gpg key failure for abrt, you can solve the following by steps mentioned in https://support.zend.com/hc/en-us/articles/203782516-ABRT-logs-messages-with-Package-packagename-isn-t-signed-with-proper-key- from the link "There are two ways to remedy this. Method 1: Alter the default abrt behavior. This may be best if you have multiple third-party packages installed and want to ensure all associated application cores are caught. Edit the file /etc/abrt/abrt-action-save-package-data.conf Set OpenGPGCheck = no Reload abrtd with the command: service abrtd reload. Method 2: Add Zend's GPG key to the rpm and abrtd key caches. # wget -O /etc/pki/rpm-gpg/zend.key http://repos.zend.com/zend.key # rpm --import /etc/pki/rpm-gpg/zend.key # echo '/etc/pki/rpm-gpg/zend.key' >> /etc/abrt/gpg_keys # service abrtd reload " Method one had worked for me
Thanks Jiffin. From the code-inspection, I suspect below code flow for now.. cache_inode_rdwr(plus) or similar | v cache_inode_refresh_attrs | v cache_inode_kill_entry | v cih_remove_checked | v cache_inode_lru_unref | v cache_inode_lru_cleanup | v check_mapping. We may be able to confirm this from core. Shashank, Could you please apply the steps provide by Jiffin, reproduce this issue and provide us the core.
Will be testing this scenario during 3.1.3 ganesha upgrade testing and update the results.
in-service upgrade is not supported