Bug 970686
| Summary: | DHT :- not able to unlink file ( rm -f, unlink) if cached sub-volume is up and hashed sub-volume is down | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rachana Patel <racpatel> | |
| Component: | glusterfs | Assignee: | Susant Kumar Palai <spalai> | |
| Status: | CLOSED ERRATA | QA Contact: | amainkar | |
| Severity: | high | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 2.1 | CC: | nsathyan, psriniva, rhs-bugs, rwheeler, sdharane, spalai, vbellur | |
| Target Milestone: | --- | |||
| Target Release: | RHGS 3.0.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.6.0.9 | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, a file could not be unlinked if the hashed subvolume was offline and cached subvolume was online. With this fix, upon unlinking the file, the file on the cached subvolume is deleted and the stale link file on the hashed subvolume is deleted upon lookup with the same name.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 983416 (view as bug list) | Environment: | ||
| Last Closed: | 2014-09-22 19:28:26 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 983416 | |||
|
Description
Rachana Patel
2013-06-04 15:19:35 UTC
verified with 3.4.0.17rhs-1.el6rhs.x86_64 able to reproduce with - 3.4.0.20rhs-2.el6_4.x86_64
hence reopening
steps:-
1) had a Dist volume mounted as FUSE , created few files; perform rename operation.
[root@DVM1 nufa]# gluster v info tnufa
Volume Name: tnufa
Type: Distribute
Volume ID: aa1999bc-4df9-4a79-a845-b8e42b06599b
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: 10.70.37.128:/rhs/brick3/tn1
Brick2: 10.70.37.128:/rhs/brick3/tn2
Brick3: 10.70.37.192:/rhs/brick3/tn2
Options Reconfigured:
cluster.nufa: on
[root@rhs-client22 nufa]# mount | grep tnufa
10.70.37.192:/tnufa on /mnt/nufa type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
[root@rhs-client22 nufa]# touch f{1..10}
[root@rhs-client22 nufa]# for i in {1..20}; do mv f$i fnew$i; done
2) bring one brick down by killing process
[root@rhs-client22 nufa]# ls
fnew1 fnew10 fnew2 fnew4 fnew5 fnew8
[root@DVM1 nufa]# gluster v status tnufa
Status of volume: tnufa
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.37.128:/rhs/brick3/tn1 49160 Y 27140
Brick 10.70.37.128:/rhs/brick3/tn2 49161 Y 8743
Brick 10.70.37.192:/rhs/brick3/tn2 N/A N 4709
NFS Server on localhost 2049 Y 8756
NFS Server on 10.70.37.81 2049 Y 4737
NFS Server on 10.70.37.110 2049 Y 3846
NFS Server on 10.70.37.192 2049 Y 5008
NFS Server on 10.70.37.88 2049 Y 4786
There are no active volume tasks
3) try to delete file - cached sub-vol is up and hashed is down.
It gives error.
up bricks:-
[root@DVM1 nfs]# ls -l /rhs/brick3/tn1
total 0
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew1
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew10
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew2
---------T 2 root root 0 Aug 22 04:41 fnew3
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew5
---------T 2 root root 0 Aug 22 04:41 fnew7
[root@DVM1 nfs]# ls -l /rhs/brick3/tn2
total 0
---------T 2 root root 0 Aug 22 04:41 fnew10
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew4
---------T 2 root root 0 Aug 22 04:41 fnew6
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew8
---------T 2 root root 0 Aug 22 04:41 fnew9
down brick:-
[root@DVM4 ~]# ls -l /rhs/brick3/tn2
total 0
---------T 2 root root 0 Aug 22 04:41 fnew1
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew3
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew6
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew7
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew9
delete :-
[root@rhs-client22 nufa]# rm -rf *
rm: cannot remove `fnew1': Invalid argument
[root@rhs-client22 nufa]# ls
fnew1
actual result :-
not able to unlink file ( rm -f, unlink) if cached sub-volume is up and hashed sub-volume is down
log snippet:- 2013-08-22 06:33:33.535409] I [rpc-clnt.c:1680:rpc_clnt_reconfig] 1-tnufa-client-2: changing port to 49156 (from 0) [2013-08-22 06:33:33.539888] E [socket.c:2158:socket_connect_finish] 1-tnufa-client-2: connection to 10.70.37.192:49156 failed (Connection refused) [2013-08-22 06:33:36.270780] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-tnufa-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-08-22 06:33:36.271837] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 1-tnufa-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-08-22 06:33:36.276413] W [client-rpc-fops.c:2316:client3_3_readdirp_cbk] 1-tnufa-client-2: remote operation failed: Transport endpoint is not connected [2013-08-22 06:33:36.278228] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-tnufa-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2013-08-22 06:33:36.279046] W [dht-layout.c:179:dht_layout_search] 1-tnufa-dht: no subvolume for hash (value) = 1563054481 [2013-08-22 06:33:36.279113] W [fuse-bridge.c:1688:fuse_unlink_cbk] 0-glusterfs-fuse: 1048: UNLINK() /fnew1 => -1 (Invalid argument) [2013-08-22 06:33:37.539996] I [rpc-clnt.c:1680:rpc_clnt_reconfig] 1-tnufa-client-2: changing port to 49156 (from 0) [2013-08-22 06:33:37.546119] E [socket.c:2158:socket_connect_finish] 1-tnufa-client-2: connection to 10.70.37.192:49156 failed (Connection refused) [2013-08-22 06:33:41.546534] I [rpc-clnt.c:1680:rpc_clnt_reconfig] 1-tnufa-client-2: changing port to 49156 (from 0) [2013-08-22 06:33:41.552727] E [socket.c:2158:socket_connect_finish] 1-tnufa-client-2: connection to 10.70.37.192:49156 failed (Connection refused) As suggested by Shishir, reproducing defect on DHT volume where cluster.nufa was not set/reset
able to reproduce with - 3.4.0.20rhs-2.el6_4.x86_64
hence reopening
steps:-
1) had a Dist volume mounted as FUSE , created few files; perform rename operation.
[root@DVM1 nufa]# gluster v info dht
Volume Name: dht
Type: Distribute
Volume ID: cbf5f4d7-1d59-449b-8084-838801b51622
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: 10.70.37.128:/rhs/brick3/d1
Brick2: 10.70.37.110:/rhs/brick3/d1
Brick3: 10.70.37.192:/rhs/brick3/d1
[root@rhs-client22 dht]# mount | grep dht
10.70.37.110:/dht on /mnt/dht type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
[root@rhs-client22 dht]# cd /mnt/dht
[root@rhs-client22 dht]# touch f{1..10}
[root@rhs-client22 dht]# for i in {1..20}; do mv f$i fnew$i; done
2) bring one brick down by killing process
[root@DVM1 nufa]# gluster v status dht
Status of volume: dht
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.37.128:/rhs/brick3/d1 49162 Y 13295
Brick 10.70.37.110:/rhs/brick3/d1 49153 Y 3650
Brick 10.70.37.192:/rhs/brick3/d1 49157 Y 8377
NFS Server on localhost N/A N N/A
NFS Server on 10.70.37.88 N/A N N/A
NFS Server on 10.70.37.110 N/A N N/A
NFS Server on 10.70.37.81 N/A N N/A
NFS Server on 10.70.37.192 N/A N N/A
There are no active volume tasks
3) try to delete file - cached sub-vol is up and hashed is down.
It gives error.
[root@rhs-client22 dht]# ls
fnew1 fnew10 fnew2 fnew4 fnew5 fnew8
[root@rhs-client22 dht]# rm -rf *
rm: cannot remove `fnew1': Invalid argument
[root@rhs-client22 dht]# rm -rf *
rm: cannot remove `fnew1': Invalid argument
up brick:-
[root@DVM1 nufa]# ls -l /rhs/brick3/d1
total 0
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew1
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew10
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew2
---------T 2 root root 0 Aug 22 08:10 fnew3
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew5
---------T 2 root root 0 Aug 22 08:10 fnew7
[root@DVM2 ~]# ls -l /rhs/brick3/d1
total 0
---------T 2 root root 0 Aug 22 08:10 fnew10
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew4
---------T 2 root root 0 Aug 22 08:10 fnew6
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew8
---------T 2 root root 0 Aug 22 08:10 fnew9
down brick:-
[root@DVM4 ~]# ls -l /rhs/brick3/d1
total 0
---------T 2 root root 0 Aug 22 08:10 fnew1
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew3
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew6
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew7
-rw-r--r-- 2 root root 0 Aug 22 2013 fnew9
log snippet:-
[2013-08-22 08:48:07.863767] I [rpc-clnt.c:1680:rpc_clnt_reconfig] 0-dht-client-2: changing port to 49157 (from 0)
[2013-08-22 08:48:07.869048] E [socket.c:2158:socket_connect_finish] 0-dht-client-2: connection to 10.70.37.192:49157 failed (Connection refused)
[2013-08-22 08:48:08.061044] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-dht-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2013-08-22 08:48:08.062325] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2013-08-22 08:48:08.070770] W [client-rpc-fops.c:2316:client3_3_readdirp_cbk] 0-dht-client-2: remote operation failed: Transport endpoint is not connected
[2013-08-22 08:48:08.072553] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-dht-client-2: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2013-08-22 08:48:08.073435] W [dht-layout.c:179:dht_layout_search] 0-dht-dht: no subvolume for hash (value) = 1563054481
[2013-08-22 08:48:08.073495] W [fuse-bridge.c:1688:fuse_unlink_cbk] 0-glusterfs-fuse: 955: UNLINK() /fnew1 => -1 (Invalid argument)
[2013-08-22 08:48:11.869538] I [rpc-clnt.c:1680:rpc_clnt_reconfig] 0-dht-client-2: changing port to 49157 (from 0)
[2013-08-22 08:48:11.875800] E [socket.c:2158:socket_connect_finish] 0-dht-client-2: connection to 10.70.37.192:49157 failed (Connection refused)
able to reproduce with - 3.4.0.30rhs-2.el6_4.x86_64
hence reopening
steps:-
1) had a Dist volume mounted as FUSE , created few files; perform rename operation.
[root@DHT1 bricks]# gluster v info test1
Volume Name: test1
Type: Distribute
Volume ID: 20aa042f-302a-4f25-9382-79eaff30d0a5
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: 10.70.37.195:/rhs/brick1/t1
Brick2: 10.70.37.195:/rhs/brick1/t2
Brick3: 10.70.37.66:/rhs/brick1/t1
[root@rhs-client22 test1]# mount | grep test1
10.70.37.66:/test1 on /mnt/test1 type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
[root@rhs-client22 ~]# cd /mnt/test1; touch f{1..10}
[root@rhs-client22 test1]# for i in {1..20}; do mv f$i fnew$i; done
2) bring one brick down by killing process
There are no active volume tasks
[root@DHT1 bricks]# kill -9 16136
[root@DHT1 bricks]# gluster v status test1
Status of volume: test1
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick 10.70.37.195:/rhs/brick1/t1 49159 Y 16081
Brick 10.70.37.195:/rhs/brick1/t2 N/A N 16136
Brick 10.70.37.66:/rhs/brick1/t1 49156 Y 20337
NFS Server on localhost 2049 Y 16148
NFS Server on 10.70.37.66 2049 Y 20491
There are no active volume tasks
3) try to delete file - cached sub-vol is up and hashed is down.
It gives error.
[root@rhs-client22 test1]# ls
fnew1 fnew10 fnew2 fnew3 fnew5 fnew6 fnew7 fnew9
[root@rhs-client22 test1]# rm -f fnew10
rm: cannot remove `fnew10': Invalid argument
down brick:-
[root@DHT1 bricks]# ls -l /rhs/brick1/t2
total 0
---------T 2 root root 0 Sep 4 09:59 fnew10
-rw-r--r-- 2 root root 0 Sep 4 2013 fnew4
---------T 2 root root 0 Sep 4 09:59 fnew6
-rw-r--r-- 2 root root 0 Sep 4 2013 fnew8
---------T 2 root root 0 Sep 4 09:59 fnew9
up brick:-
[root@DHT1 bricks]# ls -l /rhs/brick1/t1
total 0
-rw-r--r-- 2 root root 0 Sep 4 2013 fnew1
-rw-r--r-- 2 root root 0 Sep 4 2013 fnew10
-rw-r--r-- 2 root root 0 Sep 4 2013 fnew2
---------T 2 root root 0 Sep 4 09:59 fnew3
-rw-r--r-- 2 root root 0 Sep 4 2013 fnew5
---------T 2 root root 0 Sep 4 09:59 fnew7
log snippet:-
[2013-09-04 06:54:45.563177] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-test1-client-1: changing port to 49160 (from 0)
[2013-09-04 06:54:45.568442] E [socket.c:2158:socket_connect_finish] 0-test1-client-1: connection to 10.70.37.195:49160 failed (Connection refused)
[2013-09-04 06:54:49.390481] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-test1-client-1: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2013-09-04 06:54:49.391305] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 0-test1-client-1: remote operation failed: Transport endpoint is not connected. Path: /fnew10 (94b9ac49-b31f-4d47-b51c-2fc721d8c15d)
[2013-09-04 06:54:49.392439] W [dht-layout.c:179:dht_layout_search] 0-test1-dht: no subvolume for hash (value) = 1022996023
[2013-09-04 06:54:49.392507] W [fuse-bridge.c:1688:fuse_unlink_cbk] 0-glusterfs-fuse: 213: UNLINK() /fnew10 => -1 (Invalid argument)
[2013-09-04 06:54:49.568947] I [rpc-clnt.c:1687:rpc_clnt_reconfig] 0-test1-client-1: changing port to 49160 (from 0)
[2013-09-04 06:54:49.572535] E [socket.c:2158:socket_connect_finish] 0-test1-client-1: connection to 10.70.37.195:49160 failed (Connection refused)
Removing the 'blocker' flag as per the discussion in Big Bend Readout call yesterday night IST. Also reducing the priority of the bug. Still to make a call on should we even support this behavior at all. Considering no high availability guarantee is being promised only with Distribute volume type, would close it as NOTABUG. Need little more time to frame the RCA for this. Targeting for 3.0.0 (Denali) release. Fixed in version : glusterfs-3.5qa2-0.323.git6567d14 verified with version -3.6.0.18-1.el6rhs.x86_64, working as expected hence moving to verified Hi Susant, Please review the edited doc text for technical accuracy and sign off. Minor change: "Previously, a file could not be unlinked if the hashed subvolume was down and cached subvolume was up. With this fix, the data file will be unlinked and the linkto file gets deleted upon lookup with the same name after the hashed subvolume is up." Thank you Susant, i changed "down" to offline and "up" to online. Incorporated Susant's comments in the doc text. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html |