Bug 1332199

Summary:	Self Heal fails on a replica3 volume with 'disk quota exceeded'
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Sweta Anandpara <sanandpa>
Component:	quota	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED ERRATA	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, asrivast, pkarampu, rgowdapp, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone:	---	Keywords:	Regression, ZStream
Target Release:	RHGS 3.1.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-4	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1332994 (view as bug list)		Environment:
Last Closed:	2016-06-23 05:20:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1311817, 1332994, 1335283, 1335686

Description Sweta Anandpara 2016-05-02 13:30:07 UTC

Description of problem:
--------------------------

Had a 2*3 volume (on a 4node cluster) with limit-objects and limit-usage set. Killed one of the brick on node3, created 8-10 files, rebooted node4, and did a 'gluster v start <volname> force' to restart the brick on node3. Expected to see the files healed, but 'gluster v heal <volname> info' continued to show the unhealed files. The logs from which heal tries to take place gives 'disk quota exceeded' warning.

The setup is still in the same state in case it has to be looked at. The IP is 10.70.46.231 (with the password that has already been shared)

Sosreports will be copied in http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/


Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-3.7.9-2.el6rhs.x86_64


How reproducible: 1:1
------------------


Steps to Reproduce:
----------------------

1. Have a 4 node cluster, and create a 2*3 volume 'dist-rep3' (which has bricks on node1, node3 and node4)
2. Set quota on a directory path (say /dir1), and limit the object-count, say, to 5
3. Create a few files (say 8) on the root directory, and 3 files under dir1
4. Kill one of the brick (say, on node3) and copy 8 files from '/' to '/dir1/'
Verify that only two files get copied and then it says 'disk quota exceeded'
5. Create another directory 'dir2' and copy all the 8 files from '/' to '/dir2/'
6. Reboot node4 and wait for it to come up
7. Restart the brick process, by giving the command 'gluster v start <volname> force'
8. Verify that healing takes place successfully and all the files newly created are seen in the brick (that was killed before)

Actual results:
------------------

Only 3 files are seen in the brick that was killed before

Expected results:
-------------------

The data should be consistent on all the replica bricks, after successful healing

Additional info:
-------------------


[root@dhcp47-116 ~]# 
[root@dhcp47-116 ~]# gluster v list
dist
dist-rep2
dist-rep3
rep2
(reverse-i-search)`conf': ^Cnfigure-gluster-nagios -c vm912_6313 -H 10.70.47.134
[root@dhcp47-116 ~]# 
[root@dhcp47-116 ~]# 
[root@dhcp47-116 ~]# gluster v info dist-rep3
 
Volume Name: dist-rep3
Type: Distributed-Replicate
Volume ID: 8f152c8b-9fba-4cc2-9e07-a6dd1ee02c94
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.116:/brick/brick2/dist-rep3
Brick2: 10.70.47.131:/brick/brick3/dist-rep3
Brick3: 10.70.46.231:/brick/brick3/dist-rep3
Brick4: 10.70.47.116:/brick/brick1/dist-rep3
Brick5: 10.70.47.131:/brick/brick4/dist-rep3
Brick6: 10.70.46.231:/brick/brick4/dist-rep3
Options Reconfigured:
performance.readdir-ahead: on
cluster.server-quorum-type: server
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.soft-timeout: 0
cluster.self-heal-daemon: enable
[root@dhcp47-116 ~]# 
[root@dhcp47-116 ~]# 
[root@dhcp47-116 ~]# gluster v status dist-rep3
Status of volume: dist-rep3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.116:/brick/brick2/dist-rep3  49152     0          Y       860  
Brick 10.70.47.131:/brick/brick3/dist-rep3  49154     0          Y       14348
Brick 10.70.46.231:/brick/brick3/dist-rep3  49154     0          Y       29845
Brick 10.70.47.116:/brick/brick1/dist-rep3  49153     0          Y       29637
Brick 10.70.47.131:/brick/brick4/dist-rep3  49155     0          Y       14367
Brick 10.70.46.231:/brick/brick4/dist-rep3  49155     0          Y       29864
NFS Server on localhost                     2049      0          Y       8255 
Self-heal Daemon on localhost               N/A       N/A        Y       1965 
Quota Daemon on localhost                   N/A       N/A        Y       1973 
NFS Server on 10.70.47.134                  2049      0          Y       20278
Self-heal Daemon on 10.70.47.134            N/A       N/A        Y       28415
Quota Daemon on 10.70.47.134                N/A       N/A        Y       28423
NFS Server on 10.70.47.131                  2049      0          Y       21982
Self-heal Daemon on 10.70.47.131            N/A       N/A        Y       15186
Quota Daemon on 10.70.47.131                N/A       N/A        Y       15194
NFS Server on 10.70.46.231                  2049      0          Y       31382
Self-heal Daemon on 10.70.46.231            N/A       N/A        Y       27058
Quota Daemon on 10.70.46.231                N/A       N/A        Y       27066
 
Task Status of Volume dist-rep3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp47-116 ~]# kill -9 29864
-bash: kill: (29864) - No such process
[root@dhcp47-116 ~]#

====================================================
NODE3
=====================================================
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# gluster pool list
UUID					Hostname                         	State
b085a38a-cb9d-4022-aad7-9ad654ea310b	dhcp47-116.lab.eng.blr.redhat.com	Connected 
fbc2256b-de25-49b2-a46a-b8d3c821b558	10.70.47.134                     	Connected 
27399a0b-06fa-4e3e-b270-9fc0884d126c	10.70.47.131                     	Connected 
e5cd7626-c7fa-4afe-a0d9-db38bc9b506e	localhost                        	Connected 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# kill -9 29864
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# gluster v status dist-rep3
Status of volume: dist-rep3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.116:/brick/brick2/dist-rep3  49152     0          Y       860  
Brick 10.70.47.131:/brick/brick3/dist-rep3  49154     0          Y       14348
Brick 10.70.46.231:/brick/brick3/dist-rep3  49154     0          Y       29845
Brick 10.70.47.116:/brick/brick1/dist-rep3  49153     0          Y       29637
Brick 10.70.47.131:/brick/brick4/dist-rep3  49155     0          Y       14367
Brick 10.70.46.231:/brick/brick4/dist-rep3  N/A       N/A        N       N/A  
NFS Server on localhost                     2049      0          Y       31382
Self-heal Daemon on localhost               N/A       N/A        Y       27058
Quota Daemon on localhost                   N/A       N/A        Y       27066
NFS Server on 10.70.47.134                  2049      0          Y       20278
Self-heal Daemon on 10.70.47.134            N/A       N/A        Y       28415
Quota Daemon on 10.70.47.134                N/A       N/A        Y       28423
NFS Server on 10.70.47.131                  2049      0          Y       21982
Self-heal Daemon on 10.70.47.131            N/A       N/A        Y       15186
Quota Daemon on 10.70.47.131                N/A       N/A        Y       15194
NFS Server on dhcp47-116.lab.eng.blr.redhat
.com                                        2049      0          Y       8255 
Self-heal Daemon on dhcp47-116.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       1965 
Quota Daemon on dhcp47-116.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       1973 
 
Task Status of Volume dist-rep3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# gluster v start dist-rep3 force
volume start: dist-rep3: success
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# gluster v status dist-rep3
Status of volume: dist-rep3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.116:/brick/brick2/dist-rep3  49152     0          Y       860  
Brick 10.70.47.131:/brick/brick3/dist-rep3  49154     0          Y       1833 
Brick 10.70.46.231:/brick/brick3/dist-rep3  49154     0          Y       29845
Brick 10.70.47.116:/brick/brick1/dist-rep3  49153     0          Y       29637
Brick 10.70.47.131:/brick/brick4/dist-rep3  49155     0          Y       1838 
Brick 10.70.46.231:/brick/brick4/dist-rep3  49155     0          Y       13418
NFS Server on localhost                     2049      0          Y       13438
Self-heal Daemon on localhost               N/A       N/A        Y       13446
Quota Daemon on localhost                   N/A       N/A        Y       13454
NFS Server on 10.70.47.134                  2049      0          Y       32524
Self-heal Daemon on 10.70.47.134            N/A       N/A        Y       32532
Quota Daemon on 10.70.47.134                N/A       N/A        Y       32540
NFS Server on 10.70.47.131                  2049      0          Y       5005 
Self-heal Daemon on 10.70.47.131            N/A       N/A        Y       5013 
Quota Daemon on 10.70.47.131                N/A       N/A        Y       5021 
NFS Server on dhcp47-116.lab.eng.blr.redhat
.com                                        2049      0          Y       4408 
Self-heal Daemon on dhcp47-116.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       4416 
Quota Daemon on dhcp47-116.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       4424 
 
Task Status of Volume dist-rep3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp46-231 ~]# gluster v heal dist-rep3 info
Brick 10.70.47.116:/brick/brick2/dist-rep3
Number of entries: 0

Brick 10.70.47.131:/brick/brick3/dist-rep3
Number of entries: 0

Brick 10.70.46.231:/brick/brick3/dist-rep3
Number of entries: 0

Brick 10.70.47.116:/brick/brick1/dist-rep3
/dir1/file3 
/dir1/file4 
/dir1/file7 
Number of entries: 3

Brick 10.70.47.131:/brick/brick4/dist-rep3
/dir1/file3 
/dir1/file4 
/dir1/file7 
Number of entries: 3

Brick 10.70.46.231:/brick/brick4/dist-rep3
Number of entries: 0

[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# gluster v heal
Usage: volume heal <VOLNAME> [enable | disable | full |statistics [heal-count [replica <HOSTNAME:BRICKNAME>]] |info [healed | heal-failed | split-brain] |split-brain {bigger-file <FILE> | latest-mtime <FILE> |source-brick <HOSTNAME:BRICKNAME> [<FILE>]}]
[root@dhcp46-231 ~]# gluster v heal dist-rep3 info healed
Gathering list of healed entries on volume dist-rep3 has been unsuccessful on bricks that are down. Please check if all brick processes are running.
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# gluster v heal dist-rep3 info
Brick 10.70.47.116:/brick/brick2/dist-rep3
Number of entries: 0

Brick 10.70.47.131:/brick/brick3/dist-rep3
Number of entries: 0

Brick 10.70.46.231:/brick/brick3/dist-rep3
Number of entries: 0

Brick 10.70.47.116:/brick/brick1/dist-rep3
/dir1/file3 
/dir1/file4 
/dir1/file7 
Number of entries: 3

Brick 10.70.47.131:/brick/brick4/dist-rep3
/dir1/file3 
/dir1/file4 
/dir1/file7 
Number of entries: 3

Brick 10.70.46.231:/brick/brick4/dist-rep3
Number of entries: 0

[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# gluster v status dist-rep3^C
[root@dhcp46-231 ~]# gluster v status dist-rep3
Status of volume: dist-rep3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.116:/brick/brick2/dist-rep3  49152     0          Y       860  
Brick 10.70.47.131:/brick/brick3/dist-rep3  49154     0          Y       1833 
Brick 10.70.46.231:/brick/brick3/dist-rep3  49154     0          Y       29845
Brick 10.70.47.116:/brick/brick1/dist-rep3  49153     0          Y       29637
Brick 10.70.47.131:/brick/brick4/dist-rep3  49155     0          Y       1838 
Brick 10.70.46.231:/brick/brick4/dist-rep3  49155     0          Y       13418
NFS Server on localhost                     2049      0          Y       13438
Self-heal Daemon on localhost               N/A       N/A        Y       13446
Quota Daemon on localhost                   N/A       N/A        Y       13454
NFS Server on 10.70.47.134                  2049      0          Y       32524
Self-heal Daemon on 10.70.47.134            N/A       N/A        Y       32532
Quota Daemon on 10.70.47.134                N/A       N/A        Y       32540
NFS Server on dhcp47-116.lab.eng.blr.redhat
.com                                        2049      0          Y       4408 
Self-heal Daemon on dhcp47-116.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       4416 
Quota Daemon on dhcp47-116.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       4424 
NFS Server on 10.70.47.131                  2049      0          Y       5005 
Self-heal Daemon on 10.70.47.131            N/A       N/A        Y       5013 
Quota Daemon on 10.70.47.131                N/A       N/A        Y       5021 
 
Task Status of Volume dist-rep3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# gluster v heal dist-rep3 info healed
Gathering list of healed entries on volume dist-rep3 has been unsuccessful on bricks that are down. Please check if all brick processes are running.
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# gluster v heal 
Usage: volume heal <VOLNAME> [enable | disable | full |statistics [heal-count [replica <HOSTNAME:BRICKNAME>]] |info [healed | heal-failed | split-brain] |split-brain {bigger-file <FILE> | latest-mtime <FILE> |source-brick <HOSTNAME:BRICKNAME> [<FILE>]}]
[root@dhcp46-231 ~]# gluster v heal dist-rep3 info heal-failed
Gathering list of heal failed entries on volume dist-rep3 has been unsuccessful on bricks that are down. Please check if all brick processes are running.
[root@dhcp46-231 ~]# gluster v heal dist-rep3 info split-brain
Brick 10.70.47.116:/brick/brick2/dist-rep3
Number of entries in split-brain: 0

Brick 10.70.47.131:/brick/brick3/dist-rep3
Number of entries in split-brain: 0

Brick 10.70.46.231:/brick/brick3/dist-rep3
Number of entries in split-brain: 0

Brick 10.70.47.116:/brick/brick1/dist-rep3
Number of entries in split-brain: 0

Brick 10.70.47.131:/brick/brick4/dist-rep3
Number of entries in split-brain: 0

Brick 10.70.46.231:/brick/brick4/dist-rep3
Number of entries in split-brain: 0

[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# gluster v heal dist-rep3 info
Brick 10.70.47.116:/brick/brick2/dist-rep3
Number of entries: 0

Brick 10.70.47.131:/brick/brick3/dist-rep3
Number of entries: 0

Brick 10.70.46.231:/brick/brick3/dist-rep3
Number of entries: 0

Brick 10.70.47.116:/brick/brick1/dist-rep3
/dir1/file3 
/dir1/file4 
/dir1/file7 
Number of entries: 3

Brick 10.70.47.131:/brick/brick4/dist-rep3
/dir1/file3 
/dir1/file4 
/dir1/file7 
Number of entries: 3

Brick 10.70.46.231:/brick/brick4/dist-rep3
Number of entries: 0

[root@dhcp46-231 ~]# gluster v status dist-rep3
Status of volume: dist-rep3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.116:/brick/brick2/dist-rep3  49152     0          Y       860  
Brick 10.70.47.131:/brick/brick3/dist-rep3  49154     0          Y       1833 
Brick 10.70.46.231:/brick/brick3/dist-rep3  49154     0          Y       29845
Brick 10.70.47.116:/brick/brick1/dist-rep3  49153     0          Y       29637
Brick 10.70.47.131:/brick/brick4/dist-rep3  49155     0          Y       1838 
Brick 10.70.46.231:/brick/brick4/dist-rep3  49155     0          Y       13418
NFS Server on localhost                     2049      0          Y       13438
Self-heal Daemon on localhost               N/A       N/A        Y       13446
Quota Daemon on localhost                   N/A       N/A        Y       13454
NFS Server on 10.70.47.134                  2049      0          Y       32524
Self-heal Daemon on 10.70.47.134            N/A       N/A        Y       32532
Quota Daemon on 10.70.47.134                N/A       N/A        Y       32540
NFS Server on 10.70.47.131                  2049      0          Y       5005 
Self-heal Daemon on 10.70.47.131            N/A       N/A        Y       5013 
Quota Daemon on 10.70.47.131                N/A       N/A        Y       5021 
NFS Server on dhcp47-116.lab.eng.blr.redhat
.com                                        2049      0          Y       4408 
Self-heal Daemon on dhcp47-116.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       4416 
Quota Daemon on dhcp47-116.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       4424 
 
Task Status of Volume dist-rep3
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# cd /brick/brick4/dist-rep3/
dir1/                          file8
dir2/                          .glusterfs/
file1                          The Shawshank Redemption2.avi
file2                          The Shawshank Redemption.avi
file5                          .trashcan/
file6                          
[root@dhcp46-231 ~]# cd /brick/brick4/dist-rep3/^C
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# vim /var/log/glusterfs/glustershd.log
[root@dhcp46-231 ~]# gluster v quota dist-rep3 list-objects
                  Path                   Hard-limit   Soft-limit      Files       Dirs     Available  Soft-limit exceeded? Hard-limit exceeded?
-----------------------------------------------------------------------------------------------------------------------------------------------
/dir1                                            5       80%(4)          5         1           0             Yes                  Yes
[root@dhcp46-231 ~]# 
[root@dhcp46-231 ~]# 
[root@dhcp46-231 dir1]# 
[root@dhcp46-231 dir1]# 
[root@dhcp46-231 dir1]# gluster v quota dist-rep3 list
                  Path                   Hard-limit  Soft-limit      Used  Available  Soft-limit exceeded? Hard-limit exceeded?
-------------------------------------------------------------------------------------------------------------------------------
/                                          4.0GB     80%(3.2GB)    2.4GB   1.6GB              No                   No
[root@dhcp46-231 dir1]# 
[root@dhcp46-231 dir1]# 
[root@dhcp46-231 dir1]# gluster v quota dist-rep3 list-objects
                  Path                   Hard-limit   Soft-limit      Files       Dirs     Available  Soft-limit exceeded? Hard-limit exceeded?
-----------------------------------------------------------------------------------------------------------------------------------------------
/dir1                                            5       80%(4)          5         1           0             Yes                  Yes
[root@dhcp46-231 dir1]# 
[root@dhcp46-231 dir1]# 
[root@dhcp46-231 dir1]# 
[root@dhcp46-231 dir1]# rpm -qa | grep gluster
glusterfs-client-xlators-3.7.9-2.el6rhs.x86_64
glusterfs-cli-3.7.9-2.el6rhs.x86_64
glusterfs-libs-3.7.9-2.el6rhs.x86_64
glusterfs-3.7.9-2.el6rhs.x86_64
glusterfs-fuse-3.7.9-2.el6rhs.x86_64
glusterfs-server-3.7.9-2.el6rhs.x86_64
gluster-nagios-common-0.2.4-1.el6rhs.noarch
glusterfs-api-3.7.9-2.el6rhs.x86_64
gluster-nagios-addons-0.2.6-1.el6rhs.x86_64
[root@dhcp46-231 dir1]# 

=======================================
NODE4
=======================================

Glustershd logs on node4:

The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]" repeated 2 times between [2016-05-02 12:02:47.036004] and [2016-05-02 12:02:47.086110]
[2016-05-02 12:09:50.031061] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]
The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]" repeated 3 times between [2016-05-02 12:09:50.031061] and [2016-05-02 12:10:08.546343]
[2016-05-02 12:12:09.121456] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]
[2016-05-02 12:13:29.149764] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-dist-rep3-client-5: server 10.70.46.231:49155 has not responded in the last 42 seconds, disconnecting.
[2016-05-02 12:13:29.156385] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f5871551902] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f587131c497] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f587131c5ae] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88)[0x7f587131c658] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1c2)[0x7f587131c852] ))))) 0-dist-rep3-client-5: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-05-02 12:12:39.201038 (xid=0x245)
[2016-05-02 12:13:29.156460] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Transport endpoint is not connected]
[2016-05-02 12:13:29.157254] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f5871551902] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f587131c497] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f587131c5ae] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88)[0x7f587131c658] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1c2)[0x7f587131c852] ))))) 0-dist-rep3-client-5: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-05-02 12:12:47.139516 (xid=0x246)
[2016-05-02 12:13:29.168806] I [socket.c:3309:socket_submit_request] 0-dist-rep3-client-5: not connected (priv->connected = 0)
[2016-05-02 12:13:29.169101] W [rpc-clnt.c:1586:rpc_clnt_submit] 0-dist-rep3-client-5: failed to submit rpc-request (XID: 0x247 Program: GlusterFS 3.3, ProgVers: 330, Proc: 29) to rpc-transport (dist-rep3-client-5)
[2016-05-02 12:13:29.169125] W [rpc-clnt-ping.c:208:rpc_clnt_ping_cbk] 0-dist-rep3-client-5: socket disconnected
[2016-05-02 12:13:29.169151] I [MSGID: 114018] [client.c:2030:client_rpc_notify] 0-dist-rep3-client-5: disconnected from dist-rep3-client-5. Client process will keep trying to connect to glusterd until brick's port is available
[2016-05-02 12:13:29.169199] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-dist-rep3-client-5: remote operation failed [Transport endpoint is not connected]
[2016-05-02 12:13:29.174262] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep3-client-5: changing port to 49155 (from 0)
[2016-05-02 12:14:12.196537] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-dist-rep3-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2016-05-02 12:14:12.209019] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-dist-rep3-client-5: Connected to dist-rep3-client-5, attached to remote volume '/brick/brick4/dist-rep3'.
[2016-05-02 12:14:12.209079] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-dist-rep3-client-5: Server and Client lk-version numbers are not same, reopening the fds
[2016-05-02 12:14:12.209115] I [MSGID: 114042] [client-handshake.c:1056:client_post_handshake] 0-dist-rep3-client-5: 1 fds open - Delaying child_up until they are re-opened
[2016-05-02 12:14:12.213599] I [MSGID: 114041] [client-handshake.c:678:client_child_up_reopen_done] 0-dist-rep3-client-5: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2016-05-02 12:14:12.215991] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-dist-rep3-client-5: Server lk version = 1
[2016-05-02 12:14:12.266042] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]
[2016-05-02 12:13:29.169518] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-dist-rep3-client-5: remote operation failed [Transport endpoint is not connected]
[2016-05-02 12:24:13.030693] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]
The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]" repeated 2 times between [2016-05-02 12:24:13.030693] and [2016-05-02 12:24:13.081613]
[2016-05-02 12:34:14.044392] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]
The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]" repeated 2 times between [2016-05-02 12:34:14.044392] and [2016-05-02 12:34:14.089210]
[2016-05-02 12:44:15.033554] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]

Comment 2 Sweta Anandpara 2016-05-02 13:37:48 UTC

[qe@rhsqe-repo 1332199]$ 
[qe@rhsqe-repo 1332199]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo 1332199]$ 
[qe@rhsqe-repo 1332199]$ pwd
/home/repo/sosreports/1332199
[qe@rhsqe-repo 1332199]$ 
[qe@rhsqe-repo 1332199]$ ls -l
total 136516
-rwxr-xr-x. 1 qe qe 35482436 May  2 19:00 sosreport-sysreg-prod-20160502173626.tar.xz
-rwxr-xr-x. 1 qe qe 36357356 May  2 19:02 sosreport-sysreg-prod-20160502173627.tar.xz
-rwxr-xr-x. 1 qe qe 37035712 May  2 19:05 sosreport-sysreg-prod-20160502173628.tar.xz
-rwxr-xr-x. 1 qe qe 30910008 May  2 19:04 sosreport-sysreg-prod-20160502173629.tar.xz
[qe@rhsqe-repo 1332199]$ 
[qe@rhsqe-repo 1332199]$

Comment 3 Sweta Anandpara 2016-05-03 04:11:52 UTC

The package glusterfs-debuginfo is installed on the setup now. That should help in further debugging.

Comment 4 Raghavendra G 2016-05-04 09:03:33 UTC

Quota is getting write requests (as part of self-heal) with pid as 0. For quota to skip any checks pid has to be a negative number.

(gdb) p req->pid
$3 = 0
(gdb) p *req
$4 = {trans = 0x7f781c00ebc0, svc = 0x7f782402f840, prog = 0x7f7824031200, xid = 7048, prognum = 1298437, progver = 330, procnum = 13, type = 0, uid = 0, gid = 0, 
  pid = 0, lk_owner = {len = 8, data = "ĝ\256\324%\177", '\000' <repeats 1017 times>}, gfs_id = 0, auxgids = 0x7f78280370ec, auxgidsmall = {0 <repeats 128 times>}, 
  auxgidlarge = 0x0, auxgidcount = 0, msg = {{iov_base = 0x7f7837473a44, iov_len = 44}, {iov_base = 0x7f7837493d00, iov_len = 35}, {iov_base = 0x0, 
      iov_len = 0} <repeats 14 times>}, count = 2, iobref = 0x7f781c0185d0, rpc_status = 0, rpc_err = 0, auth_err = 0, txlist = {next = 0x7f782803741c, 
    prev = 0x7f782803741c}, payloadsize = 0, cred = {flavour = 390039, datalen = 28, 
    authdata = '\000' <repeats 19 times>, "\bĝ\256\324%\177", '\000' <repeats 373 times>}, verf = {flavour = 0, datalen = 0, authdata = '\000' <repeats 399 times>}, 
  synctask = _gf_false, private = 0x0, trans_private = 0x0, hdr_iobuf = 0x0, reply = 0x0}
(gdb) c
Continuing.
[Thread 0x7f7821a02700 (LWP 19940) exited]

Breakpoint 1, quota_writev (frame=0x7f7834b9d3e8, this=0x7f7824019dc0, fd=0x7f78240cdebc, vector=0x7f781c018c38, count=1, off=0, flags=0, iobref=0x7f781c0185d0, 
    xdata=0x0) at quota.c:1810
1810	{
(gdb) p frame->root->pid
$5 = 0

Comment 5 Pranith Kumar K 2016-05-04 13:21:45 UTC

After debugging this issue, found that multi-threaded self-heal feature introduced this regression. Please mark this as blocker.

Comment 8 Atin Mukherjee 2016-05-05 12:13:20 UTC

Upstream patch http://review.gluster.org/14211 posted for review

Comment 12 Nag Pavan Chilakam 2016-05-31 13:22:17 UTC

QATP:
===
(tested all with x3)
TC#1 ran the case which was mentioned while raising bug===>passed
TC#2 failed==>raised a bug  	 1341190 -conservative merge happening on a x3 volume for a deleted file 
TC#3 same as tc#1 but check with data size limit usage instead of inode limit ==>passed

but as tc#1 passed moving to verified


retried tc#1 and tc#3 with multithrreaded set to 16
cluster.shd-max-threads:16 ==>both passed

test version: 3.7.9-6

Comment 14 errata-xmlrpc 2016-06-23 05:20:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240