Bug 1309984 - Crash seen on a node during in-service nfs-ganesha upgrade from 3.1 to 3.1.2
Summary: Crash seen on a node during in-service nfs-ganesha upgrade from 3.1 to 3.1.2
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.1
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Soumya Koduri
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-19 07:02 UTC by Shashank Raj
Modified: 2016-11-08 03:53 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-20 12:18:29 UTC
Embargoed:


Attachments (Terms of Use)
ganesha.log (41.88 KB, text/plain)
2016-02-19 07:02 UTC, Shashank Raj
no flags Details
ganesha-gfapi.log (54.98 KB, text/plain)
2016-02-19 07:05 UTC, Shashank Raj
no flags Details

Description Shashank Raj 2016-02-19 07:02:51 UTC
Created attachment 1128469 [details]
ganesha.log

Description of problem:
Crash seen on a node during nfs-ganesha upgrade from 3.1 to 3.1.2

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-19

How reproducible:
Once

Steps to Reproduce:
1.Create a 4 node cluster and install 3.1 rhgs on all the nodes.
2.Configure all the required settings and setup ganesha on the cluster.
3.Mount the volume and start some dd from the client.
4.Do a upgrade of node1 by following the procedure mentioned below:

service nfs-ganesha stop
failover happened from node1 to node1
service glusterd stop
pkill glusterfs
pkill glusterfsd
pcs cluster standby node1
pcs cluster stop node1 
enable puddles for 3.1.2 latest
yum update nfs-ganesha
pcs cluster start node1
pcs cluster unstandby node1
service glusterd start
service nfs-ganesha start

5. Node1 got upgraded properly without any issues.
6. Followed the same steps for the upgrade of node2 as below

mounted the volume on client with node2 VIP
service nfs-ganesha stop
failover happened from node2 to node1
service glusterd stop
pkill glusterfs
pkill glusterfsd
pcs cluster standby node2
pcs cluster stop node2
IO was going on
yum update nfs-ganesha - all the packages got updated.
pcs cluster start node2
pcs cluster unstandby node2

After this pcs status gives below output:

Full list of resources:

 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ nfs1 nfs2 nfs3 ]
     Stopped: [ nfs4 ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ nfs1 nfs2 nfs3 ]
     Stopped: [ nfs4 ]
 nfs1-cluster_ip-1      (ocf::heartbeat:IPaddr):        Started nfs3 
 nfs1-trigger_ip-1      (ocf::heartbeatummy): Started nfs3
 nfs2-cluster_ip-1      (ocf::heartbeat:IPaddr):        Started nfs3
 nfs2-trigger_ip-1      (ocf::heartbeatummy): Started nfs3
 nfs3-cluster_ip-1      (ocf::heartbeat:IPaddr):        Started nfs3
 nfs3-trigger_ip-1      (ocf::heartbeatummy): Started nfs3
 nfs4-cluster_ip-1      (ocf::heartbeat:IPaddr):        Started nfs3
 nfs4-trigger_ip-1      (ocf::heartbeatummy): Started nfs3
 nfs2-dead_ip-1 (ocf::heartbeatummy): Started nfs2
 nfs1-dead_ip-1 (ocf::heartbeatummy): Started nfs1

and crash is reported in /var/log/messages but it got deleted 

Feb 18 08:00:31 nfs1 abrt[5878]: Saved core dump of pid 28950 (/usr/bin/ganesha.nfsd) to /var/spool/abrt/ccpp-2016-02-18-08:00:28-28950 (703098880 bytes)
Feb 18 08:00:31 nfs1 abrtd: Directory 'ccpp-2016-02-18-08:00:28-28950' creation detected
Feb 18 08:00:31 nfs1 abrtd: Package 'nfs-ganesha' isn't signed with proper key
Feb 18 08:00:31 nfs1 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2016-02-18-08:00:28-28950' exited with 1
Feb 18 08:00:31 nfs1 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2016-02-18-08:00:28-28950'

Attaching ganesha and ganesha-gfapi logs alongwith the sos report for node1

Actual results:


Expected results:


Additional info:

Comment 2 Shashank Raj 2016-02-19 07:05:26 UTC
Created attachment 1128470 [details]
ganesha-gfapi.log

Comment 3 Shashank Raj 2016-02-19 07:13:08 UTC
sosreports are placed under http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1309984

Comment 4 Soumya Koduri 2016-02-19 08:50:38 UTC
Core seem to have got deleted. Niels..any idea why? 

18/02/2016 08:00:25 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-13] cache_inode_lru_clean :INODE LRU :CRIT :Error closing file in cleanup: 37.
18/02/2016 08:00:26 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-11] file_close :FSAL :CRIT :Error : close returns with Read-only file system
18/02/2016 08:00:26 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-11] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7fb71c011f90
18/02/2016 08:00:26 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-11] cache_inode_lru_clean :INODE LRU :CRIT :Error closing file in cleanup: 37.
18/02/2016 08:00:27 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-12] file_close :FSAL :CRIT :Error : close returns with Read-only file system
18/02/2016 08:00:27 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-12] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7fb7100128f0
18/02/2016 08:00:27 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-12] cache_inode_lru_clean :INODE LRU :CRIT :Error closing file in cleanup: 37.
18/02/2016 08:00:28 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-2] file_close :FSAL :CRIT :Error : close returns with Read-only file system
18/02/2016 08:00:28 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-2] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 37(CACHE_INODE_SERVERFAULT) for entry 0x7fb6c4004730
18/02/2016 08:00:28 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-2] cache_inode_lru_clean :INODE LRU :CRIT :Error closing file in cleanup: 37.
18/02/2016 08:00:28 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-2] clean_mapping :RW LOCK :CRIT :Error 35, write locking 0x7fb6c4004730 (&entry->attr_lock) at /builddir/build/BUILD/nfs-ganesha-2.2.0/src/cache_inode/cache_inode_get.c:157

Many file CLOSE operations failed with read-only file system error due to split-brain or quorum issues. This can be confirmed from below log message in gfapi.log 

[2016-02-18 02:27:50.699668] W [MSGID: 108001] [afr-transaction.c:686:afr_handle_quorum] 0-testvolume-replicate-1: 92989e49-afca-4555-89f9-8718dc8880fb: Failing WRITE as quorum is not met
[2016-02-18 02:27:50.699696] E [MSGID: 114031] [client-rpc-fops.c:1676:client3_3_finodelk_cbk] 0-testvolume-client-3: remote operation failed [Transport endpoint is not connected]


But nfs-ganesha process got aborted due to below error - 
18/02/2016 08:00:28 : epoch 56c52890 : nfs1 : ganesha.nfsd-28950[work-2] clean_mapping :RW LOCK :CRIT :Error 35, write locking 0x7fb6c4004730 (&entry->attr_lock) at /builddir/build/BUILD/nfs-ganesha-2.2.0/src/cache_inode/cache_inode_get.c:157

Need to check from the code why pthread_write_lock resulted in EDEADLK.

Comment 5 Jiffin 2016-02-19 09:21:13 UTC
The core file creation fails due to th gpg key failure for abrt, you can solve the following by steps mentioned in https://support.zend.com/hc/en-us/articles/203782516-ABRT-logs-messages-with-Package-packagename-isn-t-signed-with-proper-key-

from the link


"There are two ways to remedy this.

Method 1: Alter the default abrt behavior.  This may be best if you have multiple third-party packages installed and want to ensure all associated application cores are caught.

    Edit the file /etc/abrt/abrt-action-save-package-data.conf
    Set OpenGPGCheck = no
    Reload abrtd with the command: service abrtd reload.

Method 2: Add Zend's GPG key to the rpm and abrtd key caches.

# wget -O /etc/pki/rpm-gpg/zend.key http://repos.zend.com/zend.key
# rpm --import /etc/pki/rpm-gpg/zend.key
# echo '/etc/pki/rpm-gpg/zend.key' >> /etc/abrt/gpg_keys
# service abrtd reload "

Method one had worked for me

Comment 6 Soumya Koduri 2016-02-19 14:04:04 UTC
Thanks Jiffin.

From the code-inspection, I suspect below code flow for now.. 

cache_inode_rdwr(plus) or similar 
     |
     v
cache_inode_refresh_attrs
     |
     v
cache_inode_kill_entry
     |
     v
cih_remove_checked
     |
     v
cache_inode_lru_unref
     |
     v
cache_inode_lru_cleanup
     |
     v
check_mapping.

We may be able to confirm this from core. 

Shashank,
Could you please apply the steps provide by Jiffin, reproduce this issue and provide us the core.

Comment 7 Shashank Raj 2016-04-04 12:06:40 UTC
Will be testing this scenario during 3.1.3 ganesha upgrade testing and update the results.

Comment 8 Kaleb KEITHLEY 2016-06-20 12:18:29 UTC
in-service upgrade is not supported


Note You need to log in before you can comment on or make changes to this bug.