Bug 1255470

Summary:	crash when NFS Ganesha Volume is 100% full
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Harold Miller <hamiller>
Component:	nfs-ganesha	Assignee:	Jiffin <jthottan>
Status:	CLOSED ERRATA	QA Contact:	Shashank Raj <sraj>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	jthottan, nlevinki, rcyriac, rhinduja, sashinde, skoduri
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.1.3
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	nfs-ganesha-2.3.1-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-06-23 05:35:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1299184

Description Harold Miller 2015-08-20 16:31:48 UTC

Description of problem: Writing more data than NFS volume will hold crashes all nodes in a Gluster Voume


Version-Release number of selected component (if applicable):


How reproducible: consistent


Steps to Reproduce:

1. setup two VMs with Redhat Gluster Storage
2. Configure NFS Ganesha
3. setup test volume
4. fill the volume with a file, which is too big for the volume:
The test volume was 2GB and i wrote a 2.5 GB file to the mounted directory using dd


Actual results:
5. NFS Ganesha crashes on the node, of which the volume was initially mounted:

19/08/2015 20:41:10 : epoch 55d4ca38 : gluster2.local : ganesha.nfsd-5969[work-13] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 22(CACHE_INODE_IO_ERROR) for entry 0x2ab57a0
19/08/2015 20:41:10 : epoch 55d4ca38 : gluster2.local : ganesha.nfsd-5969[work-13] cache_inode_rdwr_plus :INODE :CRIT :Error closing file in cache_inode_rdwr: 22.
19/08/2015 20:41:10 : epoch 55d4ca38 : gluster2.local : ganesha.nfsd-5969[work-13] cache_inode_rdwr_plus :INODE :CRIT :Error closing file in cache_inode_rdwr: 22.

6. Even worse, after some seconds, the second NFS Ganesha Process on the 2nd node crashes also:

19/08/2015 20:41:10 : epoch 55d4ca38 : gluster2.local : ganesha.nfsd-5969[work-13] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 22(CACHE_INODE_IO_ERROR) for entry 0x2ab57a0
19/08/2015 20:41:10 : epoch 55d4ca38 : gluster2.local : ganesha.nfsd-5969[work-13] cache_inode_rdwr_plus :INODE :CRIT :Error closing file in cache_inode_rdwr: 22.

Expected results: Error message, but no crash


Additional info: From customer - "Conclusion: By writing a file to a too small volume it's possible to crash the ENTIRE HA cluster. Quite bad :-("

Comment 5 Shashank Raj 2016-04-06 13:25:54 UTC

Verified this bug with the latest build; the observations and steps followed are as below:

1. Create a cluster of 4 nodes.
2. Configure nfs-ganesha on the cluster.
3. Create a volume with size 4GB and enable ganesha on the volume.
4. Mount the volume on the client

[root@dhcp37-206 ~]# df -h
Filesystem                              Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dhcp37--206-root        27G  2.3G   25G   9% /
devtmpfs                                3.9G     0  3.9G   0% /dev
tmpfs                                   3.9G     0  3.9G   0% /dev/shm
tmpfs                                   3.9G  8.4M  3.9G   1% /run
tmpfs                                   3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/vda1                               497M  124M  373M  25% /boot
tmpfs                                   783M     0  783M   0% /run/user/0
rhsqe-repo.lab.eng.blr.redhat.com:/opt  1.9T  353G  1.4T  20% /opt
10.70.36.217:/test                      4.0G  130M  3.9G   4% /mnt

5. Create a file on the mount point of size 5GB.

After some time observe the status of nfs-ganesha services on all the nodes. On the mounted node observed below messages in ganesha service status and for the other nodes no such messages are seen and no crash observed on any node:

[root@dhcp37-180 brick0]# service nfs-ganesha status -l
Redirecting to /bin/systemctl status  -l nfs-ganesha.service
● nfs-ganesha.service - NFS-Ganesha file server
   Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2016-04-06 07:20:16 IST; 7min ago
     Docs: http://github.com/nfs-ganesha/nfs-ganesha/wiki
  Process: 13197 ExecStop=/bin/dbus-send --system --dest=org.ganesha.nfsd --type=method_call /org/ganesha/nfsd/admin org.ganesha.nfsd.admin.shutdown (code=exited, status=0/SUCCESS)
  Process: 13706 ExecStartPost=/bin/bash -c prlimit --pid $MAINPID --nofile=$NOFILE:$NOFILE (code=exited, status=0/SUCCESS)
  Process: 13704 ExecStart=/bin/bash -c ${NUMACTL} ${NUMAOPTS} /usr/bin/ganesha.nfsd ${OPTIONS} ${EPOCH} (code=exited, status=0/SUCCESS)
 Main PID: 13705 (ganesha.nfsd)
   CGroup: /system.slice/nfs-ganesha.service
           └─13705 /usr/bin/ganesha.nfsd

Apr 06 07:21:55 dhcp37-180.lab.eng.blr.redhat.com nfs-ganesha[13705]: [dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume test exported at : '/'
Apr 06 07:25:52 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [work-13] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr 06 07:25:52 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [work-8] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr 06 07:25:52 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [work-9] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr 06 07:25:52 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [work-12] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr 06 07:25:53 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [work-7] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr 06 07:26:51 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr 06 07:26:52 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr 06 07:26:53 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr 06 07:26:54 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat


in /var/log/messages:


Apr  6 07:25:52 dhcp37-141 ganesha.nfsd[13705]: [work-13] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr  6 07:25:52 dhcp37-141 ganesha.nfsd[13705]: [work-8] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr  6 07:25:52 dhcp37-141 ganesha.nfsd[13705]: [work-9] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr  6 07:25:52 dhcp37-141 ganesha.nfsd[13705]: [work-12] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr  6 07:25:53 dhcp37-141 ganesha.nfsd[13705]: [work-7] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable


Apr  6 07:26:51 dhcp37-141 ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr  6 07:26:52 dhcp37-141 ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr  6 07:26:53 dhcp37-141 ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr  6 07:26:54 dhcp37-141 ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat

in ganesha-agapi.log:


The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-test-client-2: remote operation failed [No space left on device]" repeated 57 times between [2016-04-06 01:55:53.061325] and [2016-04-06 01:55:54.203684]


As per discussion with jiffin, these messages are expected.

Based on the above observations, marking this bug as Verified.

Comment 7 errata-xmlrpc 2016-06-23 05:35:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1288