Bug 853685

Summary: All clients are marked fools due to "No space left on device"
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vidya Sakar <vinaraya>
Component: glusterfsAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED WONTFIX QA Contact: spandura
Severity: medium Docs Contact:
Priority: medium    
Version: 2.0CC: gluster-bugs, jdarcy, nsathyan, pkarampu, rhs-bugs, rwheeler, sdharane, vagarwal, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0qa8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 800803 Environment:
Last Closed: 2015-03-23 07:38:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 800803    
Bug Blocks:    
Attachments:
Description Flags
SOS Reports none

Description Vidya Sakar 2012-09-02 06:40:44 UTC
+++ This bug was initially created as a clone of Bug #800803 +++

Description of problem:
If size of the files created on volume exceeds the space available on the volume all the clients are marked fools. 

Version-Release number of selected component (if applicable):
mainline

How reproducible:


Steps to Reproduce:
1.create a distribute-replicate volume. start the volume (each brick has 50G space available) 
2.create gluster,nfs mounts from client1
3.perform "dd if=/dev/zero of=gfsf1 bs=1M count=102400" from mount1
4.perform "dd if=/dev/zero of=nfsf1 bs=1M count=102400" from mount2
5.perform "dd if=/dev/urandom of=gfsf2 bs=1M count=102400" from mount3
6.perform "dd if=/dev/urandom of=nfsf2 bs=1M count=102400" from mount4
7.The file sizes created should exceed the space on the device.   

Actual results:
gluster volume info
 
Volume Name: datastore
Type: Distributed-Replicate
Volume ID: bc4bb820-400f-493e-bef7-ed09b87c8c91
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 192.168.2.35:/export1
Brick2: 192.168.2.36:/export1
Brick3: 192.168.2.35:/export2
Brick4: 192.168.2.36:/export2
Options Reconfigured:
diagnostics.brick-log-level: DEBUG
diagnostics.client-log-level: DEBUG

Brick1:-
---------
[03/07/12 - 20:33:10 root@APP-SERVER1 ~]# getfattr -R -m . -d -e hex /export1/*
getfattr: Removing leading '/' from absolute path names
# file: export1/nfsf1
trusted.afr.datastore-client-0=0x000001d10000000000000000
trusted.afr.datastore-client-1=0x000001d10000000000000000
trusted.gfid=0x4505225ade9d470290588082a5260ccb

Brick2:-
--------
[03/07/12 - 20:22:43 root@APP-SERVER2 glusterfs]# getfattr -m . -d -e hex /export1/*
getfattr: Removing leading '/' from absolute path names
# file: export1/nfsf1
trusted.afr.datastore-client-0=0x000001d10000000000000000
trusted.afr.datastore-client-1=0x000001d10000000000000000
trusted.gfid=0x4505225ade9d470290588082a5260ccb

Brick3:-
----------
[03/07/12 - 20:33:07 root@APP-SERVER1 ~]# getfattr -R -m . -d -e hex /export2/*
getfattr: Removing leading '/' from absolute path names
# file: export2/gfsf1
trusted.afr.datastore-client-2=0x0000000b0000000000000000
trusted.afr.datastore-client-3=0x0000000a0000000000000000
trusted.gfid=0x4ddef0724f4346d9b486a4a83ac649c6

# file: export2/gfsf2
trusted.afr.datastore-client-2=0x000000120000000000000000
trusted.afr.datastore-client-3=0x000000100000000000000000
trusted.gfid=0xfd5014406f67407f9abae7d3b97d7206

# file: export2/nfsf2
trusted.afr.datastore-client-2=0x00000d3e0000000000000000
trusted.afr.datastore-client-3=0x00000d3f0000000000000000
trusted.gfid=0x2f2a9de774d947e1830c8777ee4bbadf


Brick4:-
-------
[03/07/12 - 20:34:00 root@APP-SERVER2 glusterfs]# getfattr -m . -d -e hex /export2/*
getfattr: Removing leading '/' from absolute path names
# file: export2/gfsf1
trusted.afr.datastore-client-2=0x0000000b0000000000000000
trusted.afr.datastore-client-3=0x0000000a0000000000000000
trusted.gfid=0x4ddef0724f4346d9b486a4a83ac649c6

# file: export2/gfsf2
trusted.afr.datastore-client-2=0x000000120000000000000000
trusted.afr.datastore-client-3=0x000000100000000000000000
trusted.gfid=0xfd5014406f67407f9abae7d3b97d7206

# file: export2/nfsf2
trusted.afr.datastore-client-2=0x00000d3f0000000000000000
trusted.afr.datastore-client-3=0x00000d400000000000000000
trusted.gfid=0x2f2a9de774d947e1830c8777ee4bbadf


Expected results:

Additional info:

--- Additional comment from shwetha.h.panduranga on 2012-03-08 01:11:51 EST ---

After marking the clients all-fools , lookup on files has various behaviors.

1) cat gfsf2/nfsf2 : Successful
2) ls -l gfsf1 : No such file or directory

[03/08/12 - 11:45:49 root@Shwetha-Laptop nfsc1]# ls -lh
ls: cannot access file10: Invalid argument
ls: cannot access gfsf1: Invalid argument
ls: cannot access gfsf2: Invalid argument
total 46G
-?????????? ? ?    ?       ?            ? file10
-?????????? ? ?    ?       ?            ? gfsf1
-?????????? ? ?    ?       ?            ? gfsf2
-rw-r--r--. 1 root root  41G Mar  8  2012 nfsf1
-rw-r--r--. 1 root root 4.9G Mar  7 23:33 nfsf2

[03/08/12 - 11:46:09 root@Shwetha-Laptop nfsc1]# ls -lh gfsf1
-rw-r--r--. 1 root root 41G Mar  8  2012 gfsf1
[03/08/12 - 11:46:24 root@Shwetha-Laptop nfsc1]# ls -lh
ls: cannot access file10: Invalid argument
ls: cannot access gfsf2: Invalid argument
total 86G
-?????????? ? ?    ?       ?            ? file10
-rw-r--r--. 1 root root  41G Mar  8  2012 gfsf1
-?????????? ? ?    ?       ?            ? gfsf2
-rw-r--r--. 1 root root  41G Mar  8  2012 nfsf1
-rw-r--r--. 1 root root 4.9G Mar  7 23:33 nfsf2

Comment 2 Jeff Darcy 2013-01-21 18:03:19 UTC
Setting NEEDINFO to match upstream.

Comment 5 spandura 2013-08-19 06:20:12 UTC
I am able to recreate this issue on the build:
=============================================
glusterfs 3.4.0.19rhs built on Aug 14 2013 00:11:42

Steps to recreate the issue:-
==========================
1. Create 1 x 2 replicate volume. Start the volume

root@king [Aug-19-2013-10:43:37] >gluster  v info
 
Volume Name: vol_rep
Type: Replicate
Volume ID: 197798f7-8261-4013-8c35-e7d3a433a6e2
Status: Created
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: king:/rhs/bricks/b0
Brick2: hicks:/rhs/bricks/b1
Options Reconfigured:
cluster.eager-lock: on

root@king [Aug-19-2013-10:43:53] >gluster v status
Status of volume: vol_rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick king:/rhs/bricks/b0				49152	Y	24402
Brick hicks:/rhs/bricks/b1				49152	Y	6531
NFS Server on localhost					2049	Y	24415
Self-heal Daemon on localhost				N/A	Y	24420
NFS Server on hicks					2049	Y	6543
Self-heal Daemon on hicks				N/A	Y	6549


2. Create 2 fuse and 2 nfs mounts. 

3. Start dd on files from each mount point:

Mount point command execution output:
+++++++++++++++++++++++++++++++++++++

fuse 1 :
~~~~~~~~~~~
root@darrel [Aug-19-2013-10:46:04] >dd if=/dev/urandom of=./testfile1 bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 2803.42 s, 3.8 MB/s

fuse 2 :
~~~~~~~~~~~
root@darrel [Aug-19-2013-10:46:04] >dd if=/dev/urandom of=./testfile2 bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 2774.8 s, 3.9 MB/s

nfs 1 :
~~~~~~~~~~~
root@darrel [Aug-19-2013-10:46:04] >dd if=/dev/urandom of=./testfile3 bs=1M count=10240
dd: writing `./testfile3': Input/output error
9412+0 records in
9411+0 records out
9868148736 bytes (9.9 GB) copied, 2654.3 s, 3.7 MB/s

root@darrel [Aug-19-2013-11:30:29] >cat testfile3 > /dev/null
cat: testfile3: Input/output error

nfs 2 :
~~~~~~~~~~~
root@darrel [Aug-19-2013-10:46:04] >dd if=/dev/urandom of=./testfile4 bs=1M count=10240
dd: writing `./testfile4': Input/output error
9288+0 records in
9287+0 records out
9738125312 bytes (9.7 GB) copied, 2667.62 s, 3.7 MB/s


Extended attributes of the files on both the bricks:
+++++++++++++++++++++++++++++++++++++++++++++++++++

brick-0:
~~~~~~~~
root@king [Aug-19-2013-11:38:46] >getfattr -d -e hex -m . /rhs/bricks/b0/*
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b0/testfile1
trusted.afr.vol_rep-client-0=0x000000000000000000000000
trusted.afr.vol_rep-client-1=0x000000000000000000000000
trusted.gfid=0xc694b71e5a1548648df9e5a446430ca3

# file: rhs/bricks/b0/testfile2
trusted.afr.vol_rep-client-0=0x000000000000000000000000
trusted.afr.vol_rep-client-1=0x000000000000000000000000
trusted.gfid=0xcc249ca2945e44d19be9357ae2e617fd

# file: rhs/bricks/b0/testfile3
trusted.afr.vol_rep-client-0=0x0000011d0000000000000000
trusted.afr.vol_rep-client-1=0x000001110000000000000000
trusted.gfid=0x6240a081730244939b84fb45ccb088ba

# file: rhs/bricks/b0/testfile4
trusted.afr.vol_rep-client-0=0x000000130000000000000000
trusted.afr.vol_rep-client-1=0x000000130000000000000000
trusted.gfid=0x6bebc7921ee84f75971cebaf5abb0af8

root@king [Aug-19-2013-11:45:57] >ls -lh /rhs/bricks/b0/*
-rw-r--r-- 2 root root  10G Aug 19 11:32 /rhs/bricks/b0/testfile1
-rw-r--r-- 2 root root  10G Aug 19 11:32 /rhs/bricks/b0/testfile2
-rw-r--r-- 2 root root 9.2G Aug 19 11:30 /rhs/bricks/b0/testfile3
-rw-r--r-- 2 root root 9.0G Aug 19 11:30 /rhs/bricks/b0/testfile4
root@king [Aug-19-2013-11:45:59] >

brick-1:
~~~~~~~~
root@hicks [Aug-19-2013-11:38:45] >getfattr -d -e hex -m . /rhs/bricks/b1/*
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b1/testfile1
trusted.afr.vol_rep-client-0=0x000000000000000000000000
trusted.afr.vol_rep-client-1=0x000000000000000000000000
trusted.gfid=0xc694b71e5a1548648df9e5a446430ca3

# file: rhs/bricks/b1/testfile2
trusted.afr.vol_rep-client-0=0x000000000000000000000000
trusted.afr.vol_rep-client-1=0x000000000000000000000000
trusted.gfid=0xcc249ca2945e44d19be9357ae2e617fd

# file: rhs/bricks/b1/testfile3
trusted.afr.vol_rep-client-0=0x0000011d0000000000000000
trusted.afr.vol_rep-client-1=0x000001110000000000000000
trusted.gfid=0x6240a081730244939b84fb45ccb088ba

# file: rhs/bricks/b1/testfile4
trusted.afr.vol_rep-client-0=0x000000130000000000000000
trusted.afr.vol_rep-client-1=0x000000130000000000000000
trusted.gfid=0x6bebc7921ee84f75971cebaf5abb0af8

root@hicks [Aug-19-2013-11:45:56] >ls -lh /rhs/bricks/b1/*
-rw-r--r-- 2 root root  10G Aug 19 11:32 /rhs/bricks/b1/testfile1
-rw-r--r-- 2 root root  10G Aug 19 11:32 /rhs/bricks/b1/testfile2
-rw-r--r-- 2 root root 9.2G Aug 19 11:30 /rhs/bricks/b1/testfile3
-rw-r--r-- 2 root root 9.0G Aug 19 11:30 /rhs/bricks/b1/testfile4

Comment 6 spandura 2013-08-19 06:51:38 UTC
Created attachment 787910 [details]
SOS Reports

Comment 7 spandura 2013-08-19 07:24:28 UTC
glustershd.log reports the following in each crawl:
===================================================

[2013-08-19 07:13:54.737429] I [afr-self-heal-data.c:817:afr_sh_data_fix] 0-vol_rep-replicate-0: no active sinks for performing self-heal on file <gfid:6bebc792-1ee8-4f75-971c-ebaf5abb0af8>

[2013-08-19 07:13:54.739011] I [afr-self-heal-common.c:2741:afr_log_self_heal_completion_status] 0-vol_rep-replicate-0:  foreground data self heal  is successfully completed,  from vol_rep-client-0 with 9594470400 9594470400  sizes - Pending matrix:  [ [ 19 19 ] [ 19 19 ] ] on <gfid:6bebc792-1ee8-4f75-971c-ebaf5abb0af8>

But the self-heal is not actually successful.

root@king [Aug-19-2013-12:52:55] >getfattr -d -e hex -m . /rhs/bricks/b0/testfile4
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b0/testfile4
trusted.afr.vol_rep-client-0=0x000000130000000000000000
trusted.afr.vol_rep-client-1=0x000000130000000000000000
trusted.gfid=0x6bebc7921ee84f75971cebaf5abb0af8


root@hicks [Aug-19-2013-12:52:54] >getfattr -d -e hex -m . /rhs/bricks/b1/testfile4
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/b1/testfile4
trusted.afr.vol_rep-client-0=0x000000130000000000000000
trusted.afr.vol_rep-client-1=0x000000130000000000000000
trusted.gfid=0x6bebc7921ee84f75971cebaf5abb0af8

Comment 8 Vivek Agarwal 2015-03-23 07:38:15 UTC
The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version.







[1] https://rhn.redhat.com/errata/RHSA-2014-0821.html

Comment 9 Vivek Agarwal 2015-03-23 07:39:44 UTC
The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version.







[1] https://rhn.redhat.com/errata/RHSA-2014-0821.html