Bug 1255470 - crash when NFS Ganesha Volume is 100% full
Summary: crash when NFS Ganesha Volume is 100% full
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: nfs-ganesha
Version: rhgs-3.1
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.1.3
Assignee: Jiffin
QA Contact: Shashank Raj
URL:
Whiteboard:
Depends On:
Blocks: 1299184
TreeView+ depends on / blocked
 
Reported: 2015-08-20 16:31 UTC by Harold Miller
Modified: 2016-11-08 03:52 UTC (History)
6 users (show)

Fixed In Version: nfs-ganesha-2.3.1-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-23 05:35:52 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:1288 0 normal SHIPPED_LIVE nfs-ganesha update for Red Hat Gluster Storage 3.1 update 3 2016-06-23 09:12:51 UTC

Description Harold Miller 2015-08-20 16:31:48 UTC
Description of problem: Writing more data than NFS volume will hold crashes all nodes in a Gluster Voume


Version-Release number of selected component (if applicable):


How reproducible: consistent


Steps to Reproduce:

1. setup two VMs with Redhat Gluster Storage
2. Configure NFS Ganesha
3. setup test volume
4. fill the volume with a file, which is too big for the volume:
The test volume was 2GB and i wrote a 2.5 GB file to the mounted directory using dd


Actual results:
5. NFS Ganesha crashes on the node, of which the volume was initially mounted:

19/08/2015 20:41:10 : epoch 55d4ca38 : gluster2.local : ganesha.nfsd-5969[work-13] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 22(CACHE_INODE_IO_ERROR) for entry 0x2ab57a0
19/08/2015 20:41:10 : epoch 55d4ca38 : gluster2.local : ganesha.nfsd-5969[work-13] cache_inode_rdwr_plus :INODE :CRIT :Error closing file in cache_inode_rdwr: 22.
19/08/2015 20:41:10 : epoch 55d4ca38 : gluster2.local : ganesha.nfsd-5969[work-13] cache_inode_rdwr_plus :INODE :CRIT :Error closing file in cache_inode_rdwr: 22.

6. Even worse, after some seconds, the second NFS Ganesha Process on the 2nd node crashes also:

19/08/2015 20:41:10 : epoch 55d4ca38 : gluster2.local : ganesha.nfsd-5969[work-13] cache_inode_close :INODE :CRIT :FSAL_close failed, returning 22(CACHE_INODE_IO_ERROR) for entry 0x2ab57a0
19/08/2015 20:41:10 : epoch 55d4ca38 : gluster2.local : ganesha.nfsd-5969[work-13] cache_inode_rdwr_plus :INODE :CRIT :Error closing file in cache_inode_rdwr: 22.

Expected results: Error message, but no crash


Additional info: From customer - "Conclusion: By writing a file to a too small volume it's possible to crash the ENTIRE HA cluster. Quite bad :-("

Comment 5 Shashank Raj 2016-04-06 13:25:54 UTC
Verified this bug with the latest build; the observations and steps followed are as below:

1. Create a cluster of 4 nodes.
2. Configure nfs-ganesha on the cluster.
3. Create a volume with size 4GB and enable ganesha on the volume.
4. Mount the volume on the client

[root@dhcp37-206 ~]# df -h
Filesystem                              Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dhcp37--206-root        27G  2.3G   25G   9% /
devtmpfs                                3.9G     0  3.9G   0% /dev
tmpfs                                   3.9G     0  3.9G   0% /dev/shm
tmpfs                                   3.9G  8.4M  3.9G   1% /run
tmpfs                                   3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/vda1                               497M  124M  373M  25% /boot
tmpfs                                   783M     0  783M   0% /run/user/0
rhsqe-repo.lab.eng.blr.redhat.com:/opt  1.9T  353G  1.4T  20% /opt
10.70.36.217:/test                      4.0G  130M  3.9G   4% /mnt

5. Create a file on the mount point of size 5GB.

After some time observe the status of nfs-ganesha services on all the nodes. On the mounted node observed below messages in ganesha service status and for the other nodes no such messages are seen and no crash observed on any node:

[root@dhcp37-180 brick0]# service nfs-ganesha status -l
Redirecting to /bin/systemctl status  -l nfs-ganesha.service
● nfs-ganesha.service - NFS-Ganesha file server
   Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2016-04-06 07:20:16 IST; 7min ago
     Docs: http://github.com/nfs-ganesha/nfs-ganesha/wiki
  Process: 13197 ExecStop=/bin/dbus-send --system --dest=org.ganesha.nfsd --type=method_call /org/ganesha/nfsd/admin org.ganesha.nfsd.admin.shutdown (code=exited, status=0/SUCCESS)
  Process: 13706 ExecStartPost=/bin/bash -c prlimit --pid $MAINPID --nofile=$NOFILE:$NOFILE (code=exited, status=0/SUCCESS)
  Process: 13704 ExecStart=/bin/bash -c ${NUMACTL} ${NUMAOPTS} /usr/bin/ganesha.nfsd ${OPTIONS} ${EPOCH} (code=exited, status=0/SUCCESS)
 Main PID: 13705 (ganesha.nfsd)
   CGroup: /system.slice/nfs-ganesha.service
           └─13705 /usr/bin/ganesha.nfsd

Apr 06 07:21:55 dhcp37-180.lab.eng.blr.redhat.com nfs-ganesha[13705]: [dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume test exported at : '/'
Apr 06 07:25:52 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [work-13] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr 06 07:25:52 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [work-8] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr 06 07:25:52 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [work-9] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr 06 07:25:52 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [work-12] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr 06 07:25:53 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [work-7] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr 06 07:26:51 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr 06 07:26:52 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr 06 07:26:53 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr 06 07:26:54 dhcp37-180.lab.eng.blr.redhat.com ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat


in /var/log/messages:


Apr  6 07:25:52 dhcp37-141 ganesha.nfsd[13705]: [work-13] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr  6 07:25:52 dhcp37-141 ganesha.nfsd[13705]: [work-8] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr  6 07:25:52 dhcp37-141 ganesha.nfsd[13705]: [work-9] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr  6 07:25:52 dhcp37-141 ganesha.nfsd[13705]: [work-12] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable
Apr  6 07:25:53 dhcp37-141 ganesha.nfsd[13705]: [work-7] nfs3_Errno_verbose :NFS3 :CRIT :Error CACHE_INODE_IO_ERROR in nfs3_write converted to NFS3ERR_IO but was set non-retryable


Apr  6 07:26:51 dhcp37-141 ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr  6 07:26:52 dhcp37-141 ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr  6 07:26:53 dhcp37-141 ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat
Apr  6 07:26:54 dhcp37-141 ganesha.nfsd[13705]: [dbus_heartbeat] dbus_heartbeat_cb :DBUS :WARN :Health status is unhealthy.  Not sending heartbeat

in ganesha-agapi.log:


The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-test-client-2: remote operation failed [No space left on device]" repeated 57 times between [2016-04-06 01:55:53.061325] and [2016-04-06 01:55:54.203684]


As per discussion with jiffin, these messages are expected.

Based on the above observations, marking this bug as Verified.

Comment 7 errata-xmlrpc 2016-06-23 05:35:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1288


Note You need to log in before you can comment on or make changes to this bug.