Description of problem: -------------------------- Had a 2*3 volume (on a 4node cluster) with limit-objects and limit-usage set. Killed one of the brick on node3, created 8-10 files, rebooted node4, and did a 'gluster v start <volname> force' to restart the brick on node3. Expected to see the files healed, but 'gluster v heal <volname> info' continued to show the unhealed files. The logs from which heal tries to take place gives 'disk quota exceeded' warning. The setup is still in the same state in case it has to be looked at. The IP is 10.70.46.231 (with the password that has already been shared) Sosreports will be copied in http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/ Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-3.7.9-2.el6rhs.x86_64 How reproducible: 1:1 ------------------ Steps to Reproduce: ---------------------- 1. Have a 4 node cluster, and create a 2*3 volume 'dist-rep3' (which has bricks on node1, node3 and node4) 2. Set quota on a directory path (say /dir1), and limit the object-count, say, to 5 3. Create a few files (say 8) on the root directory, and 3 files under dir1 4. Kill one of the brick (say, on node3) and copy 8 files from '/' to '/dir1/' Verify that only two files get copied and then it says 'disk quota exceeded' 5. Create another directory 'dir2' and copy all the 8 files from '/' to '/dir2/' 6. Reboot node4 and wait for it to come up 7. Restart the brick process, by giving the command 'gluster v start <volname> force' 8. Verify that healing takes place successfully and all the files newly created are seen in the brick (that was killed before) Actual results: ------------------ Only 3 files are seen in the brick that was killed before Expected results: ------------------- The data should be consistent on all the replica bricks, after successful healing Additional info: ------------------- [root@dhcp47-116 ~]# [root@dhcp47-116 ~]# gluster v list dist dist-rep2 dist-rep3 rep2 (reverse-i-search)`conf': ^Cnfigure-gluster-nagios -c vm912_6313 -H 10.70.47.134 [root@dhcp47-116 ~]# [root@dhcp47-116 ~]# [root@dhcp47-116 ~]# gluster v info dist-rep3 Volume Name: dist-rep3 Type: Distributed-Replicate Volume ID: 8f152c8b-9fba-4cc2-9e07-a6dd1ee02c94 Status: Started Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 10.70.47.116:/brick/brick2/dist-rep3 Brick2: 10.70.47.131:/brick/brick3/dist-rep3 Brick3: 10.70.46.231:/brick/brick3/dist-rep3 Brick4: 10.70.47.116:/brick/brick1/dist-rep3 Brick5: 10.70.47.131:/brick/brick4/dist-rep3 Brick6: 10.70.46.231:/brick/brick4/dist-rep3 Options Reconfigured: performance.readdir-ahead: on cluster.server-quorum-type: server features.quota: on features.inode-quota: on features.quota-deem-statfs: on features.soft-timeout: 0 cluster.self-heal-daemon: enable [root@dhcp47-116 ~]# [root@dhcp47-116 ~]# [root@dhcp47-116 ~]# gluster v status dist-rep3 Status of volume: dist-rep3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.116:/brick/brick2/dist-rep3 49152 0 Y 860 Brick 10.70.47.131:/brick/brick3/dist-rep3 49154 0 Y 14348 Brick 10.70.46.231:/brick/brick3/dist-rep3 49154 0 Y 29845 Brick 10.70.47.116:/brick/brick1/dist-rep3 49153 0 Y 29637 Brick 10.70.47.131:/brick/brick4/dist-rep3 49155 0 Y 14367 Brick 10.70.46.231:/brick/brick4/dist-rep3 49155 0 Y 29864 NFS Server on localhost 2049 0 Y 8255 Self-heal Daemon on localhost N/A N/A Y 1965 Quota Daemon on localhost N/A N/A Y 1973 NFS Server on 10.70.47.134 2049 0 Y 20278 Self-heal Daemon on 10.70.47.134 N/A N/A Y 28415 Quota Daemon on 10.70.47.134 N/A N/A Y 28423 NFS Server on 10.70.47.131 2049 0 Y 21982 Self-heal Daemon on 10.70.47.131 N/A N/A Y 15186 Quota Daemon on 10.70.47.131 N/A N/A Y 15194 NFS Server on 10.70.46.231 2049 0 Y 31382 Self-heal Daemon on 10.70.46.231 N/A N/A Y 27058 Quota Daemon on 10.70.46.231 N/A N/A Y 27066 Task Status of Volume dist-rep3 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp47-116 ~]# kill -9 29864 -bash: kill: (29864) - No such process [root@dhcp47-116 ~]# ==================================================== NODE3 ===================================================== [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# gluster pool list UUID Hostname State b085a38a-cb9d-4022-aad7-9ad654ea310b dhcp47-116.lab.eng.blr.redhat.com Connected fbc2256b-de25-49b2-a46a-b8d3c821b558 10.70.47.134 Connected 27399a0b-06fa-4e3e-b270-9fc0884d126c 10.70.47.131 Connected e5cd7626-c7fa-4afe-a0d9-db38bc9b506e localhost Connected [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# kill -9 29864 [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# gluster v status dist-rep3 Status of volume: dist-rep3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.116:/brick/brick2/dist-rep3 49152 0 Y 860 Brick 10.70.47.131:/brick/brick3/dist-rep3 49154 0 Y 14348 Brick 10.70.46.231:/brick/brick3/dist-rep3 49154 0 Y 29845 Brick 10.70.47.116:/brick/brick1/dist-rep3 49153 0 Y 29637 Brick 10.70.47.131:/brick/brick4/dist-rep3 49155 0 Y 14367 Brick 10.70.46.231:/brick/brick4/dist-rep3 N/A N/A N N/A NFS Server on localhost 2049 0 Y 31382 Self-heal Daemon on localhost N/A N/A Y 27058 Quota Daemon on localhost N/A N/A Y 27066 NFS Server on 10.70.47.134 2049 0 Y 20278 Self-heal Daemon on 10.70.47.134 N/A N/A Y 28415 Quota Daemon on 10.70.47.134 N/A N/A Y 28423 NFS Server on 10.70.47.131 2049 0 Y 21982 Self-heal Daemon on 10.70.47.131 N/A N/A Y 15186 Quota Daemon on 10.70.47.131 N/A N/A Y 15194 NFS Server on dhcp47-116.lab.eng.blr.redhat .com 2049 0 Y 8255 Self-heal Daemon on dhcp47-116.lab.eng.blr. redhat.com N/A N/A Y 1965 Quota Daemon on dhcp47-116.lab.eng.blr.redh at.com N/A N/A Y 1973 Task Status of Volume dist-rep3 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# gluster v start dist-rep3 force volume start: dist-rep3: success [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# gluster v status dist-rep3 Status of volume: dist-rep3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.116:/brick/brick2/dist-rep3 49152 0 Y 860 Brick 10.70.47.131:/brick/brick3/dist-rep3 49154 0 Y 1833 Brick 10.70.46.231:/brick/brick3/dist-rep3 49154 0 Y 29845 Brick 10.70.47.116:/brick/brick1/dist-rep3 49153 0 Y 29637 Brick 10.70.47.131:/brick/brick4/dist-rep3 49155 0 Y 1838 Brick 10.70.46.231:/brick/brick4/dist-rep3 49155 0 Y 13418 NFS Server on localhost 2049 0 Y 13438 Self-heal Daemon on localhost N/A N/A Y 13446 Quota Daemon on localhost N/A N/A Y 13454 NFS Server on 10.70.47.134 2049 0 Y 32524 Self-heal Daemon on 10.70.47.134 N/A N/A Y 32532 Quota Daemon on 10.70.47.134 N/A N/A Y 32540 NFS Server on 10.70.47.131 2049 0 Y 5005 Self-heal Daemon on 10.70.47.131 N/A N/A Y 5013 Quota Daemon on 10.70.47.131 N/A N/A Y 5021 NFS Server on dhcp47-116.lab.eng.blr.redhat .com 2049 0 Y 4408 Self-heal Daemon on dhcp47-116.lab.eng.blr. redhat.com N/A N/A Y 4416 Quota Daemon on dhcp47-116.lab.eng.blr.redh at.com N/A N/A Y 4424 Task Status of Volume dist-rep3 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp46-231 ~]# gluster v heal dist-rep3 info Brick 10.70.47.116:/brick/brick2/dist-rep3 Number of entries: 0 Brick 10.70.47.131:/brick/brick3/dist-rep3 Number of entries: 0 Brick 10.70.46.231:/brick/brick3/dist-rep3 Number of entries: 0 Brick 10.70.47.116:/brick/brick1/dist-rep3 /dir1/file3 /dir1/file4 /dir1/file7 Number of entries: 3 Brick 10.70.47.131:/brick/brick4/dist-rep3 /dir1/file3 /dir1/file4 /dir1/file7 Number of entries: 3 Brick 10.70.46.231:/brick/brick4/dist-rep3 Number of entries: 0 [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# gluster v heal Usage: volume heal <VOLNAME> [enable | disable | full |statistics [heal-count [replica <HOSTNAME:BRICKNAME>]] |info [healed | heal-failed | split-brain] |split-brain {bigger-file <FILE> | latest-mtime <FILE> |source-brick <HOSTNAME:BRICKNAME> [<FILE>]}] [root@dhcp46-231 ~]# gluster v heal dist-rep3 info healed Gathering list of healed entries on volume dist-rep3 has been unsuccessful on bricks that are down. Please check if all brick processes are running. [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# gluster v heal dist-rep3 info Brick 10.70.47.116:/brick/brick2/dist-rep3 Number of entries: 0 Brick 10.70.47.131:/brick/brick3/dist-rep3 Number of entries: 0 Brick 10.70.46.231:/brick/brick3/dist-rep3 Number of entries: 0 Brick 10.70.47.116:/brick/brick1/dist-rep3 /dir1/file3 /dir1/file4 /dir1/file7 Number of entries: 3 Brick 10.70.47.131:/brick/brick4/dist-rep3 /dir1/file3 /dir1/file4 /dir1/file7 Number of entries: 3 Brick 10.70.46.231:/brick/brick4/dist-rep3 Number of entries: 0 [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# gluster v status dist-rep3^C [root@dhcp46-231 ~]# gluster v status dist-rep3 Status of volume: dist-rep3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.116:/brick/brick2/dist-rep3 49152 0 Y 860 Brick 10.70.47.131:/brick/brick3/dist-rep3 49154 0 Y 1833 Brick 10.70.46.231:/brick/brick3/dist-rep3 49154 0 Y 29845 Brick 10.70.47.116:/brick/brick1/dist-rep3 49153 0 Y 29637 Brick 10.70.47.131:/brick/brick4/dist-rep3 49155 0 Y 1838 Brick 10.70.46.231:/brick/brick4/dist-rep3 49155 0 Y 13418 NFS Server on localhost 2049 0 Y 13438 Self-heal Daemon on localhost N/A N/A Y 13446 Quota Daemon on localhost N/A N/A Y 13454 NFS Server on 10.70.47.134 2049 0 Y 32524 Self-heal Daemon on 10.70.47.134 N/A N/A Y 32532 Quota Daemon on 10.70.47.134 N/A N/A Y 32540 NFS Server on dhcp47-116.lab.eng.blr.redhat .com 2049 0 Y 4408 Self-heal Daemon on dhcp47-116.lab.eng.blr. redhat.com N/A N/A Y 4416 Quota Daemon on dhcp47-116.lab.eng.blr.redh at.com N/A N/A Y 4424 NFS Server on 10.70.47.131 2049 0 Y 5005 Self-heal Daemon on 10.70.47.131 N/A N/A Y 5013 Quota Daemon on 10.70.47.131 N/A N/A Y 5021 Task Status of Volume dist-rep3 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# gluster v heal dist-rep3 info healed Gathering list of healed entries on volume dist-rep3 has been unsuccessful on bricks that are down. Please check if all brick processes are running. [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# gluster v heal Usage: volume heal <VOLNAME> [enable | disable | full |statistics [heal-count [replica <HOSTNAME:BRICKNAME>]] |info [healed | heal-failed | split-brain] |split-brain {bigger-file <FILE> | latest-mtime <FILE> |source-brick <HOSTNAME:BRICKNAME> [<FILE>]}] [root@dhcp46-231 ~]# gluster v heal dist-rep3 info heal-failed Gathering list of heal failed entries on volume dist-rep3 has been unsuccessful on bricks that are down. Please check if all brick processes are running. [root@dhcp46-231 ~]# gluster v heal dist-rep3 info split-brain Brick 10.70.47.116:/brick/brick2/dist-rep3 Number of entries in split-brain: 0 Brick 10.70.47.131:/brick/brick3/dist-rep3 Number of entries in split-brain: 0 Brick 10.70.46.231:/brick/brick3/dist-rep3 Number of entries in split-brain: 0 Brick 10.70.47.116:/brick/brick1/dist-rep3 Number of entries in split-brain: 0 Brick 10.70.47.131:/brick/brick4/dist-rep3 Number of entries in split-brain: 0 Brick 10.70.46.231:/brick/brick4/dist-rep3 Number of entries in split-brain: 0 [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# gluster v heal dist-rep3 info Brick 10.70.47.116:/brick/brick2/dist-rep3 Number of entries: 0 Brick 10.70.47.131:/brick/brick3/dist-rep3 Number of entries: 0 Brick 10.70.46.231:/brick/brick3/dist-rep3 Number of entries: 0 Brick 10.70.47.116:/brick/brick1/dist-rep3 /dir1/file3 /dir1/file4 /dir1/file7 Number of entries: 3 Brick 10.70.47.131:/brick/brick4/dist-rep3 /dir1/file3 /dir1/file4 /dir1/file7 Number of entries: 3 Brick 10.70.46.231:/brick/brick4/dist-rep3 Number of entries: 0 [root@dhcp46-231 ~]# gluster v status dist-rep3 Status of volume: dist-rep3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.116:/brick/brick2/dist-rep3 49152 0 Y 860 Brick 10.70.47.131:/brick/brick3/dist-rep3 49154 0 Y 1833 Brick 10.70.46.231:/brick/brick3/dist-rep3 49154 0 Y 29845 Brick 10.70.47.116:/brick/brick1/dist-rep3 49153 0 Y 29637 Brick 10.70.47.131:/brick/brick4/dist-rep3 49155 0 Y 1838 Brick 10.70.46.231:/brick/brick4/dist-rep3 49155 0 Y 13418 NFS Server on localhost 2049 0 Y 13438 Self-heal Daemon on localhost N/A N/A Y 13446 Quota Daemon on localhost N/A N/A Y 13454 NFS Server on 10.70.47.134 2049 0 Y 32524 Self-heal Daemon on 10.70.47.134 N/A N/A Y 32532 Quota Daemon on 10.70.47.134 N/A N/A Y 32540 NFS Server on 10.70.47.131 2049 0 Y 5005 Self-heal Daemon on 10.70.47.131 N/A N/A Y 5013 Quota Daemon on 10.70.47.131 N/A N/A Y 5021 NFS Server on dhcp47-116.lab.eng.blr.redhat .com 2049 0 Y 4408 Self-heal Daemon on dhcp47-116.lab.eng.blr. redhat.com N/A N/A Y 4416 Quota Daemon on dhcp47-116.lab.eng.blr.redh at.com N/A N/A Y 4424 Task Status of Volume dist-rep3 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# cd /brick/brick4/dist-rep3/ dir1/ file8 dir2/ .glusterfs/ file1 The Shawshank Redemption2.avi file2 The Shawshank Redemption.avi file5 .trashcan/ file6 [root@dhcp46-231 ~]# cd /brick/brick4/dist-rep3/^C [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# vim /var/log/glusterfs/glustershd.log [root@dhcp46-231 ~]# gluster v quota dist-rep3 list-objects Path Hard-limit Soft-limit Files Dirs Available Soft-limit exceeded? Hard-limit exceeded? ----------------------------------------------------------------------------------------------------------------------------------------------- /dir1 5 80%(4) 5 1 0 Yes Yes [root@dhcp46-231 ~]# [root@dhcp46-231 ~]# [root@dhcp46-231 dir1]# [root@dhcp46-231 dir1]# [root@dhcp46-231 dir1]# gluster v quota dist-rep3 list Path Hard-limit Soft-limit Used Available Soft-limit exceeded? Hard-limit exceeded? ------------------------------------------------------------------------------------------------------------------------------- / 4.0GB 80%(3.2GB) 2.4GB 1.6GB No No [root@dhcp46-231 dir1]# [root@dhcp46-231 dir1]# [root@dhcp46-231 dir1]# gluster v quota dist-rep3 list-objects Path Hard-limit Soft-limit Files Dirs Available Soft-limit exceeded? Hard-limit exceeded? ----------------------------------------------------------------------------------------------------------------------------------------------- /dir1 5 80%(4) 5 1 0 Yes Yes [root@dhcp46-231 dir1]# [root@dhcp46-231 dir1]# [root@dhcp46-231 dir1]# [root@dhcp46-231 dir1]# rpm -qa | grep gluster glusterfs-client-xlators-3.7.9-2.el6rhs.x86_64 glusterfs-cli-3.7.9-2.el6rhs.x86_64 glusterfs-libs-3.7.9-2.el6rhs.x86_64 glusterfs-3.7.9-2.el6rhs.x86_64 glusterfs-fuse-3.7.9-2.el6rhs.x86_64 glusterfs-server-3.7.9-2.el6rhs.x86_64 gluster-nagios-common-0.2.4-1.el6rhs.noarch glusterfs-api-3.7.9-2.el6rhs.x86_64 gluster-nagios-addons-0.2.6-1.el6rhs.x86_64 [root@dhcp46-231 dir1]# ======================================= NODE4 ======================================= Glustershd logs on node4: The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]" repeated 2 times between [2016-05-02 12:02:47.036004] and [2016-05-02 12:02:47.086110] [2016-05-02 12:09:50.031061] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded] The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]" repeated 3 times between [2016-05-02 12:09:50.031061] and [2016-05-02 12:10:08.546343] [2016-05-02 12:12:09.121456] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded] [2016-05-02 12:13:29.149764] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-dist-rep3-client-5: server 10.70.46.231:49155 has not responded in the last 42 seconds, disconnecting. [2016-05-02 12:13:29.156385] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f5871551902] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f587131c497] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f587131c5ae] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88)[0x7f587131c658] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1c2)[0x7f587131c852] ))))) 0-dist-rep3-client-5: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2016-05-02 12:12:39.201038 (xid=0x245) [2016-05-02 12:13:29.156460] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Transport endpoint is not connected] [2016-05-02 12:13:29.157254] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f5871551902] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f587131c497] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f587131c5ae] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x88)[0x7f587131c658] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1c2)[0x7f587131c852] ))))) 0-dist-rep3-client-5: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2016-05-02 12:12:47.139516 (xid=0x246) [2016-05-02 12:13:29.168806] I [socket.c:3309:socket_submit_request] 0-dist-rep3-client-5: not connected (priv->connected = 0) [2016-05-02 12:13:29.169101] W [rpc-clnt.c:1586:rpc_clnt_submit] 0-dist-rep3-client-5: failed to submit rpc-request (XID: 0x247 Program: GlusterFS 3.3, ProgVers: 330, Proc: 29) to rpc-transport (dist-rep3-client-5) [2016-05-02 12:13:29.169125] W [rpc-clnt-ping.c:208:rpc_clnt_ping_cbk] 0-dist-rep3-client-5: socket disconnected [2016-05-02 12:13:29.169151] I [MSGID: 114018] [client.c:2030:client_rpc_notify] 0-dist-rep3-client-5: disconnected from dist-rep3-client-5. Client process will keep trying to connect to glusterd until brick's port is available [2016-05-02 12:13:29.169199] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-dist-rep3-client-5: remote operation failed [Transport endpoint is not connected] [2016-05-02 12:13:29.174262] I [rpc-clnt.c:1847:rpc_clnt_reconfig] 0-dist-rep3-client-5: changing port to 49155 (from 0) [2016-05-02 12:14:12.196537] I [MSGID: 114057] [client-handshake.c:1437:select_server_supported_programs] 0-dist-rep3-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2016-05-02 12:14:12.209019] I [MSGID: 114046] [client-handshake.c:1213:client_setvolume_cbk] 0-dist-rep3-client-5: Connected to dist-rep3-client-5, attached to remote volume '/brick/brick4/dist-rep3'. [2016-05-02 12:14:12.209079] I [MSGID: 114047] [client-handshake.c:1224:client_setvolume_cbk] 0-dist-rep3-client-5: Server and Client lk-version numbers are not same, reopening the fds [2016-05-02 12:14:12.209115] I [MSGID: 114042] [client-handshake.c:1056:client_post_handshake] 0-dist-rep3-client-5: 1 fds open - Delaying child_up until they are re-opened [2016-05-02 12:14:12.213599] I [MSGID: 114041] [client-handshake.c:678:client_child_up_reopen_done] 0-dist-rep3-client-5: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2016-05-02 12:14:12.215991] I [MSGID: 114035] [client-handshake.c:193:client_set_lk_version_cbk] 0-dist-rep3-client-5: Server lk version = 1 [2016-05-02 12:14:12.266042] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded] [2016-05-02 12:13:29.169518] E [MSGID: 114031] [client-rpc-fops.c:1624:client3_3_inodelk_cbk] 0-dist-rep3-client-5: remote operation failed [Transport endpoint is not connected] [2016-05-02 12:24:13.030693] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded] The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]" repeated 2 times between [2016-05-02 12:24:13.030693] and [2016-05-02 12:24:13.081613] [2016-05-02 12:34:14.044392] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded] The message "W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]" repeated 2 times between [2016-05-02 12:34:14.044392] and [2016-05-02 12:34:14.089210] [2016-05-02 12:44:15.033554] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-dist-rep3-client-5: remote operation failed [Disk quota exceeded]
[qe@rhsqe-repo 1332199]$ [qe@rhsqe-repo 1332199]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1332199]$ [qe@rhsqe-repo 1332199]$ pwd /home/repo/sosreports/1332199 [qe@rhsqe-repo 1332199]$ [qe@rhsqe-repo 1332199]$ ls -l total 136516 -rwxr-xr-x. 1 qe qe 35482436 May 2 19:00 sosreport-sysreg-prod-20160502173626.tar.xz -rwxr-xr-x. 1 qe qe 36357356 May 2 19:02 sosreport-sysreg-prod-20160502173627.tar.xz -rwxr-xr-x. 1 qe qe 37035712 May 2 19:05 sosreport-sysreg-prod-20160502173628.tar.xz -rwxr-xr-x. 1 qe qe 30910008 May 2 19:04 sosreport-sysreg-prod-20160502173629.tar.xz [qe@rhsqe-repo 1332199]$ [qe@rhsqe-repo 1332199]$
The package glusterfs-debuginfo is installed on the setup now. That should help in further debugging.
Quota is getting write requests (as part of self-heal) with pid as 0. For quota to skip any checks pid has to be a negative number. (gdb) p req->pid $3 = 0 (gdb) p *req $4 = {trans = 0x7f781c00ebc0, svc = 0x7f782402f840, prog = 0x7f7824031200, xid = 7048, prognum = 1298437, progver = 330, procnum = 13, type = 0, uid = 0, gid = 0, pid = 0, lk_owner = {len = 8, data = "ĝ\256\324%\177", '\000' <repeats 1017 times>}, gfs_id = 0, auxgids = 0x7f78280370ec, auxgidsmall = {0 <repeats 128 times>}, auxgidlarge = 0x0, auxgidcount = 0, msg = {{iov_base = 0x7f7837473a44, iov_len = 44}, {iov_base = 0x7f7837493d00, iov_len = 35}, {iov_base = 0x0, iov_len = 0} <repeats 14 times>}, count = 2, iobref = 0x7f781c0185d0, rpc_status = 0, rpc_err = 0, auth_err = 0, txlist = {next = 0x7f782803741c, prev = 0x7f782803741c}, payloadsize = 0, cred = {flavour = 390039, datalen = 28, authdata = '\000' <repeats 19 times>, "\bĝ\256\324%\177", '\000' <repeats 373 times>}, verf = {flavour = 0, datalen = 0, authdata = '\000' <repeats 399 times>}, synctask = _gf_false, private = 0x0, trans_private = 0x0, hdr_iobuf = 0x0, reply = 0x0} (gdb) c Continuing. [Thread 0x7f7821a02700 (LWP 19940) exited] Breakpoint 1, quota_writev (frame=0x7f7834b9d3e8, this=0x7f7824019dc0, fd=0x7f78240cdebc, vector=0x7f781c018c38, count=1, off=0, flags=0, iobref=0x7f781c0185d0, xdata=0x0) at quota.c:1810 1810 { (gdb) p frame->root->pid $5 = 0
After debugging this issue, found that multi-threaded self-heal feature introduced this regression. Please mark this as blocker.
Upstream patch http://review.gluster.org/14211 posted for review
QATP: === (tested all with x3) TC#1 ran the case which was mentioned while raising bug===>passed TC#2 failed==>raised a bug 1341190 -conservative merge happening on a x3 volume for a deleted file TC#3 same as tc#1 but check with data size limit usage instead of inode limit ==>passed but as tc#1 passed moving to verified retried tc#1 and tc#3 with multithrreaded set to 16 cluster.shd-max-threads:16 ==>both passed test version: 3.7.9-6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240