Description of problem: DHT :- not able to unlink file ( rm -f, unlink) if cached sub-volume is up and hashed sub-volume is down Version-Release number of selected component (if applicable): 3.4.0.8rhs-1.el6rhs.x86_64 How reproducible: always Steps to Reproduce: 1. had a cluster of 4 server and volume having brick on 3 server 2. mount DHT volume and create few files and Dir 3. bring one sub-volume down [root@mia ~]# gluster v status test1 Status of volume: test1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick fred.lab.eng.blr.redhat.com:/rhs/brick1/t1 49154 Y32380 Brick mia.lab.eng.blr.redhat.com:/rhs/brick1/t1 N/A N11173 Brick cutlass.lab.eng.blr.redhat.com:/rhs/brick1/t1 49154 Y8989 NFS Server on localhost 2049 Y11183 NFS Server on c5154da1-be15-40e2-b5f3-9be6dadafd43 2049 Y8999 NFS Server on a37ff566-da82-4ae4-90c6-17763466fd36 2049 Y15188 NFS Server on 292b158a-7650-4e09-9bc0-71e392f0d0c1 2049 Y32390 in our case that sub-vol is on mia 4. try to delete files from mount point which are hashed and cached to different sub-vol [root@rhsauto037 test1nfs]# ls d100 d113 d126 d139 d52 d65 d78 d91 f108 f125 f145 file119 file140 newf68 newf86 newfile56 newfile72 newfile87 d101 d114 d127 d140 d53 d66 d79 d92 f110 f127 f148 file120 file141 newf69 newf87 newfile57 newfile73 newfile88 d102 d115 d128 d141 d54 d67 d80 d93 f111 f128 f150 file121 file142 newf70 newf88 newfile59 newfile74 newfile89 d103 d116 d129 d142 d55 d68 d81 d94 f112 f130 file102 file123 file143 newf71 newf89 newfile60 newfile76 newfile90 d104 d117 d130 d143 d56 d69 d82 d95 f113 f131 file106 file125 file145 newf73 newf90 newfile61 newfile77 newfile92 d105 d118 d131 d144 d57 d70 d83 d96 f114 f132 file108 file126 file146 newf75 newf91 newfile62 newfile78 newfile93 d106 d119 d132 d145 d58 d71 d84 d97 f115 f134 file109 file127 newf51 newf78 newf92 newfile63 newfile79 newfile94 d107 d120 d133 d146 d59 d72 d85 d98 f116 f135 file111 file128 newf54 newf79 newf93 newfile64 newfile80 newfile95 d108 d121 d134 d147 d60 d73 d86 d99 f117 f137 file112 file129 newf56 newf80 newf94 newfile65 newfile81 newfile96 d109 d122 d135 d148 d61 d74 d87 f101 f118 f138 file113 file130 newf59 newf81 newf96 newfile67 newfile82 newfile97 d110 d123 d136 d149 d62 d75 d88 f103 f120 f139 file115 file131 newf63 newf82 newfile51 newfile68 newfile83 newfile98 d111 d124 d137 d150 d63 d76 d89 f106 f121 f143 file117 file133 newf64 newf83 newfile52 newfile69 newfile84 newfile99 d112 d125 d138 d51 d64 d77 d90 f107 f123 f144 file118 file139 newf66 newf85 newfile53 newfile70 newfile85 [root@rhsauto037 test1nfs]# rm -rf newf* rm: cannot remove `newf70': Input/output error rm: cannot remove `newf73': Input/output error rm: cannot remove `newf81': Input/output error rm: cannot remove `newf86': Input/output error rm: cannot remove `newf89': Input/output error rm: cannot remove `newf96': Input/output error rm: cannot remove `newfile51': Input/output error rm: cannot remove `newfile59': Input/output error rm: cannot remove `newfile65': Input/output error rm: cannot remove `newfile68': Input/output error rm: cannot remove `newfile69': Input/output error rm: cannot remove `newfile72': Input/output error rm: cannot remove `newfile74': Input/output error rm: cannot remove `newfile76': Input/output error rm: cannot remove `newfile78': Input/output error rm: cannot remove `newfile81': Input/output error rm: cannot remove `newfile84': Input/output error rm: cannot remove `newfile88': Input/output error rm: cannot remove `newfile92': Input/output error rm: cannot remove `newfile93': Input/output error rm: cannot remove `newfile97': Input/output error rm: cannot remove `newfile98': Input/output error rm: cannot remove `newfile99': Input/output error [root@rhsauto037 test1nfs]# ls d100 d110 d120 d130 d140 d150 d60 d70 d80 d90 f101 f115 f130 f145 file115 file128 file145 newfile65 newfile92 d101 d111 d121 d131 d141 d51 d61 d71 d81 d91 f103 f116 f131 f148 file117 file129 file146 newfile68 newfile93 d102 d112 d122 d132 d142 d52 d62 d72 d82 d92 f106 f117 f132 f150 file118 file130 newf70 newfile69 newfile97 d103 d113 d123 d133 d143 d53 d63 d73 d83 d93 f107 f118 f134 file102 file119 file131 newf73 newfile72 newfile98 d104 d114 d124 d134 d144 d54 d64 d74 d84 d94 f108 f120 f135 file106 file120 file133 newf81 newfile74 newfile99 d105 d115 d125 d135 d145 d55 d65 d75 d85 d95 f110 f121 f137 file108 file121 file139 newf86 newfile76 d106 d116 d126 d136 d146 d56 d66 d76 d86 d96 f111 f123 f138 file109 file123 file140 newf89 newfile78 d107 d117 d127 d137 d147 d57 d67 d77 d87 d97 f112 f125 f139 file111 file125 file141 newf96 newfile81 d108 d118 d128 d138 d148 d58 d68 d78 d88 d98 f113 f127 f143 file112 file126 file142 newfile51 newfile84 d109 d119 d129 d139 d149 d59 d69 d79 d89 d99 f114 f128 f144 file113 file127 file143 newfile59 newfile88 5. all this files which are not getting deleted are cached on up sub-vol but hashed on down sub-vol [root@mia ~]# ls -l /rhs/brick1/t1/ | grep T ---------T 2 root root 0 Jun 4 05:59 newf70 ---------T 2 root root 0 Jun 4 05:59 newf73 ---------T 2 root root 0 Jun 4 05:59 newf81 ---------T 2 root root 0 Jun 4 05:59 newf86 ---------T 2 root root 0 Jun 4 05:59 newf89 ---------T 2 root root 0 Jun 4 05:59 newf96 ---------T 2 root root 0 Jun 4 05:59 newfile51 ---------T 2 root root 0 Jun 4 05:59 newfile59 ---------T 2 root root 0 Jun 4 05:59 newfile65 ---------T 2 root root 0 Jun 4 05:59 newfile68 ---------T 2 root root 0 Jun 4 05:59 newfile69 ---------T 2 root root 0 Jun 4 05:59 newfile72 ---------T 2 root root 0 Jun 4 05:59 newfile74 ---------T 2 root root 0 Jun 4 05:59 newfile76 ---------T 2 root root 0 Jun 4 05:59 newfile78 ---------T 2 root root 0 Jun 4 05:59 newfile81 ---------T 2 root root 0 Jun 4 05:59 newfile84 ---------T 2 root root 0 Jun 4 05:59 newfile88 ---------T 2 root root 0 Jun 4 05:59 newfile92 ---------T 2 root root 0 Jun 4 05:59 newfile93 ---------T 2 root root 0 Jun 4 05:59 newfile97 ---------T 2 root root 0 Jun 4 05:59 newfile98 ---------T 2 root root 0 Jun 4 05:59 newfile99 Actual results: not able to delete file if cached sub-volume is up and hashed sub-volume is down Expected results: User should be able to delete file if cached sub-volume is up and hashed sub-volume is down. Additional info:
verified with 3.4.0.17rhs-1.el6rhs.x86_64
able to reproduce with - 3.4.0.20rhs-2.el6_4.x86_64 hence reopening steps:- 1) had a Dist volume mounted as FUSE , created few files; perform rename operation. [root@DVM1 nufa]# gluster v info tnufa Volume Name: tnufa Type: Distribute Volume ID: aa1999bc-4df9-4a79-a845-b8e42b06599b Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: 10.70.37.128:/rhs/brick3/tn1 Brick2: 10.70.37.128:/rhs/brick3/tn2 Brick3: 10.70.37.192:/rhs/brick3/tn2 Options Reconfigured: cluster.nufa: on [root@rhs-client22 nufa]# mount | grep tnufa 10.70.37.192:/tnufa on /mnt/nufa type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) [root@rhs-client22 nufa]# touch f{1..10} [root@rhs-client22 nufa]# for i in {1..20}; do mv f$i fnew$i; done 2) bring one brick down by killing process [root@rhs-client22 nufa]# ls fnew1 fnew10 fnew2 fnew4 fnew5 fnew8 [root@DVM1 nufa]# gluster v status tnufa Status of volume: tnufa Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.128:/rhs/brick3/tn1 49160 Y 27140 Brick 10.70.37.128:/rhs/brick3/tn2 49161 Y 8743 Brick 10.70.37.192:/rhs/brick3/tn2 N/A N 4709 NFS Server on localhost 2049 Y 8756 NFS Server on 10.70.37.81 2049 Y 4737 NFS Server on 10.70.37.110 2049 Y 3846 NFS Server on 10.70.37.192 2049 Y 5008 NFS Server on 10.70.37.88 2049 Y 4786 There are no active volume tasks 3) try to delete file - cached sub-vol is up and hashed is down. It gives error. up bricks:- [root@DVM1 nfs]# ls -l /rhs/brick3/tn1 total 0 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew1 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew10 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew2 ---------T 2 root root 0 Aug 22 04:41 fnew3 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew5 ---------T 2 root root 0 Aug 22 04:41 fnew7 [root@DVM1 nfs]# ls -l /rhs/brick3/tn2 total 0 ---------T 2 root root 0 Aug 22 04:41 fnew10 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew4 ---------T 2 root root 0 Aug 22 04:41 fnew6 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew8 ---------T 2 root root 0 Aug 22 04:41 fnew9 down brick:- [root@DVM4 ~]# ls -l /rhs/brick3/tn2 total 0 ---------T 2 root root 0 Aug 22 04:41 fnew1 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew3 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew6 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew7 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew9 delete :- [root@rhs-client22 nufa]# rm -rf * rm: cannot remove `fnew1': Invalid argument [root@rhs-client22 nufa]# ls fnew1 actual result :- not able to unlink file ( rm -f, unlink) if cached sub-volume is up and hashed sub-volume is down
log snippet:- 2013-08-22 06:33:33.535409] I [rpc-clnt.c:1680:rpc_clnt_reconfig] 1-tnufa-client-2: changing port to 49156 (from 0) [2013-08-22 06:33:33.539888] E [socket.c:2158:socket_connect_finish] 1-tnufa-client-2: connection to 10.70.37.192:49156 failed (Connection refused) [2013-08-22 06:33:36.270780] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-tnufa-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-08-22 06:33:36.271837] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 1-tnufa-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-08-22 06:33:36.276413] W [client-rpc-fops.c:2316:client3_3_readdirp_cbk] 1-tnufa-client-2: remote operation failed: Transport endpoint is not connected [2013-08-22 06:33:36.278228] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-tnufa-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-08-22 06:33:36.279046] W [dht-layout.c:179:dht_layout_search] 1-tnufa-dht: no subvolume for hash (value) = 1563054481 [2013-08-22 06:33:36.279113] W [fuse-bridge.c:1688:fuse_unlink_cbk] 0-glusterfs-fuse: 1048: UNLINK() /fnew1 => -1 (Invalid argument) [2013-08-22 06:33:37.539996] I [rpc-clnt.c:1680:rpc_clnt_reconfig] 1-tnufa-client-2: changing port to 49156 (from 0) [2013-08-22 06:33:37.546119] E [socket.c:2158:socket_connect_finish] 1-tnufa-client-2: connection to 10.70.37.192:49156 failed (Connection refused) [2013-08-22 06:33:41.546534] I [rpc-clnt.c:1680:rpc_clnt_reconfig] 1-tnufa-client-2: changing port to 49156 (from 0) [2013-08-22 06:33:41.552727] E [socket.c:2158:socket_connect_finish] 1-tnufa-client-2: connection to 10.70.37.192:49156 failed (Connection refused)
As suggested by Shishir, reproducing defect on DHT volume where cluster.nufa was not set/reset able to reproduce with - 3.4.0.20rhs-2.el6_4.x86_64 hence reopening steps:- 1) had a Dist volume mounted as FUSE , created few files; perform rename operation. [root@DVM1 nufa]# gluster v info dht Volume Name: dht Type: Distribute Volume ID: cbf5f4d7-1d59-449b-8084-838801b51622 Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: 10.70.37.128:/rhs/brick3/d1 Brick2: 10.70.37.110:/rhs/brick3/d1 Brick3: 10.70.37.192:/rhs/brick3/d1 [root@rhs-client22 dht]# mount | grep dht 10.70.37.110:/dht on /mnt/dht type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) [root@rhs-client22 dht]# cd /mnt/dht [root@rhs-client22 dht]# touch f{1..10} [root@rhs-client22 dht]# for i in {1..20}; do mv f$i fnew$i; done 2) bring one brick down by killing process [root@DVM1 nufa]# gluster v status dht Status of volume: dht Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.128:/rhs/brick3/d1 49162 Y 13295 Brick 10.70.37.110:/rhs/brick3/d1 49153 Y 3650 Brick 10.70.37.192:/rhs/brick3/d1 49157 Y 8377 NFS Server on localhost N/A N N/A NFS Server on 10.70.37.88 N/A N N/A NFS Server on 10.70.37.110 N/A N N/A NFS Server on 10.70.37.81 N/A N N/A NFS Server on 10.70.37.192 N/A N N/A There are no active volume tasks 3) try to delete file - cached sub-vol is up and hashed is down. It gives error. [root@rhs-client22 dht]# ls fnew1 fnew10 fnew2 fnew4 fnew5 fnew8 [root@rhs-client22 dht]# rm -rf * rm: cannot remove `fnew1': Invalid argument [root@rhs-client22 dht]# rm -rf * rm: cannot remove `fnew1': Invalid argument up brick:- [root@DVM1 nufa]# ls -l /rhs/brick3/d1 total 0 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew1 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew10 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew2 ---------T 2 root root 0 Aug 22 08:10 fnew3 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew5 ---------T 2 root root 0 Aug 22 08:10 fnew7 [root@DVM2 ~]# ls -l /rhs/brick3/d1 total 0 ---------T 2 root root 0 Aug 22 08:10 fnew10 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew4 ---------T 2 root root 0 Aug 22 08:10 fnew6 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew8 ---------T 2 root root 0 Aug 22 08:10 fnew9 down brick:- [root@DVM4 ~]# ls -l /rhs/brick3/d1 total 0 ---------T 2 root root 0 Aug 22 08:10 fnew1 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew3 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew6 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew7 -rw-r--r-- 2 root root 0 Aug 22 2013 fnew9 log snippet:- [2013-08-22 08:48:07.863767] I [rpc-clnt.c:1680:rpc_clnt_reconfig] 0-dht-client-2: changing port to 49157 (from 0) [2013-08-22 08:48:07.869048] E [socket.c:2158:socket_connect_finish] 0-dht-client-2: connection to 10.70.37.192:49157 failed (Connection refused) [2013-08-22 08:48:08.061044] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-dht-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-08-22 08:48:08.062325] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-08-22 08:48:08.070770] W [client-rpc-fops.c:2316:client3_3_readdirp_cbk] 0-dht-client-2: remote operation failed: Transport endpoint is not connected [2013-08-22 08:48:08.072553] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-dht-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-08-22 08:48:08.073435] W [dht-layout.c:179:dht_layout_search] 0-dht-dht: no subvolume for hash (value) = 1563054481 [2013-08-22 08:48:08.073495] W [fuse-bridge.c:1688:fuse_unlink_cbk] 0-glusterfs-fuse: 955: UNLINK() /fnew1 => -1 (Invalid argument) [2013-08-22 08:48:11.869538] I [rpc-clnt.c:1680:rpc_clnt_reconfig] 0-dht-client-2: changing port to 49157 (from 0) [2013-08-22 08:48:11.875800] E [socket.c:2158:socket_connect_finish] 0-dht-client-2: connection to 10.70.37.192:49157 failed (Connection refused)
able to reproduce with - 3.4.0.30rhs-2.el6_4.x86_64 hence reopening steps:- 1) had a Dist volume mounted as FUSE , created few files; perform rename operation. [root@DHT1 bricks]# gluster v info test1 Volume Name: test1 Type: Distribute Volume ID: 20aa042f-302a-4f25-9382-79eaff30d0a5 Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: 10.70.37.195:/rhs/brick1/t1 Brick2: 10.70.37.195:/rhs/brick1/t2 Brick3: 10.70.37.66:/rhs/brick1/t1 [root@rhs-client22 test1]# mount | grep test1 10.70.37.66:/test1 on /mnt/test1 type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) [root@rhs-client22 ~]# cd /mnt/test1; touch f{1..10} [root@rhs-client22 test1]# for i in {1..20}; do mv f$i fnew$i; done 2) bring one brick down by killing process There are no active volume tasks [root@DHT1 bricks]# kill -9 16136 [root@DHT1 bricks]# gluster v status test1 Status of volume: test1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.195:/rhs/brick1/t1 49159 Y 16081 Brick 10.70.37.195:/rhs/brick1/t2 N/A N 16136 Brick 10.70.37.66:/rhs/brick1/t1 49156 Y 20337 NFS Server on localhost 2049 Y 16148 NFS Server on 10.70.37.66 2049 Y 20491 There are no active volume tasks 3) try to delete file - cached sub-vol is up and hashed is down. It gives error. [root@rhs-client22 test1]# ls fnew1 fnew10 fnew2 fnew3 fnew5 fnew6 fnew7 fnew9 [root@rhs-client22 test1]# rm -f fnew10 rm: cannot remove `fnew10': Invalid argument down brick:- [root@DHT1 bricks]# ls -l /rhs/brick1/t2 total 0 ---------T 2 root root 0 Sep 4 09:59 fnew10 -rw-r--r-- 2 root root 0 Sep 4 2013 fnew4 ---------T 2 root root 0 Sep 4 09:59 fnew6 -rw-r--r-- 2 root root 0 Sep 4 2013 fnew8 ---------T 2 root root 0 Sep 4 09:59 fnew9 up brick:- [root@DHT1 bricks]# ls -l /rhs/brick1/t1 total 0 -rw-r--r-- 2 root root 0 Sep 4 2013 fnew1 -rw-r--r-- 2 root root 0 Sep 4 2013 fnew10 -rw-r--r-- 2 root root 0 Sep 4 2013 fnew2 ---------T 2 root root 0 Sep 4 09:59 fnew3 -rw-r--r-- 2 root root 0 Sep 4 2013 fnew5 ---------T 2 root root 0 Sep 4 09:59 fnew7 log snippet:- [2013-09-04 06:54:45.563177] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-test1-client-1: changing port to 49160 (from 0) [2013-09-04 06:54:45.568442] E [socket.c:2158:socket_connect_finish] 0-test1-client-1: connection to 10.70.37.195:49160 failed (Connection refused) [2013-09-04 06:54:49.390481] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-test1-client-1: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-09-04 06:54:49.391305] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-test1-client-1: remote operation failed: Transport endpoint is not connected. Path: /fnew10 (94b9ac49-b31f-4d47-b51c-2fc721d8c15d) [2013-09-04 06:54:49.392439] W [dht-layout.c:179:dht_layout_search] 0-test1-dht: no subvolume for hash (value) = 1022996023 [2013-09-04 06:54:49.392507] W [fuse-bridge.c:1688:fuse_unlink_cbk] 0-glusterfs-fuse: 213: UNLINK() /fnew10 => -1 (Invalid argument) [2013-09-04 06:54:49.568947] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-test1-client-1: changing port to 49160 (from 0) [2013-09-04 06:54:49.572535] E [socket.c:2158:socket_connect_finish] 0-test1-client-1: connection to 10.70.37.195:49160 failed (Connection refused)
Removing the 'blocker' flag as per the discussion in Big Bend Readout call yesterday night IST. Also reducing the priority of the bug. Still to make a call on should we even support this behavior at all. Considering no high availability guarantee is being promised only with Distribute volume type, would close it as NOTABUG. Need little more time to frame the RCA for this.
Targeting for 3.0.0 (Denali) release.
Fixed in version : glusterfs-3.5qa2-0.323.git6567d14
verified with version -3.6.0.18-1.el6rhs.x86_64, working as expected hence moving to verified
Hi Susant, Please review the edited doc text for technical accuracy and sign off.
Minor change: "Previously, a file could not be unlinked if the hashed subvolume was down and cached subvolume was up. With this fix, the data file will be unlinked and the linkto file gets deleted upon lookup with the same name after the hashed subvolume is up."
Thank you Susant, i changed "down" to offline and "up" to online.
Incorporated Susant's comments in the doc text.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html