1136714 – DHT + rebalance :- DATA LOSS - while file is in migration, creation of Hard-link and unlink of original file ends in data loss(both files are missing from mount and backend

Bug 1136714 - DHT + rebalance :- DATA LOSS - while file is in migration, creation of Hard-link and unlink of original file ends in data loss(both files are missing from mount and backend

Summary: DHT + rebalance :- DATA LOSS - while file is in migration, creation of Hard-...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.0.4
Assignee:	Nithya Balachandran
QA Contact:	amainkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1087818 1182947
TreeView+	depends on / blocked

Reported:	2014-09-03 07:04 UTC by Rachana Patel
Modified:	2015-05-13 17:53 UTC (History)
CC List:	12 users (show)
Fixed In Version:	glusterfs-3.6.0.46-1
Doc Type:	Bug Fix
Doc Text:	Previously, any hard links to a file that were created while the file was being migrated were lost once the migration was completed. With this fix, the hard links are retained.
Clone Of:
Clones:	1161311 (view as bug list)
Environment:
Last Closed:	2015-03-26 06:34:32 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
attempted test case (2.52 KB, application/x-shellscript) 2015-03-02 20:31 UTC, Shyamsundar	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0682	0	normal	SHIPPED_LIVE	Red Hat Storage 3.0 enhancement and bug fix update #4	2015-03-26 10:32:55 UTC

Description Rachana Patel 2014-09-03 07:04:11 UTC

Description of problem:
=======================
Hard-links are missing after rebalance if hard-link is created while file migration is in progress

Version-Release number of selected component (if applicable):
=============================================================
3.6.0.27-6.el6rhs.x86_64

How reproducible:
=================
always

Steps to Reproduce:
==================

1. Created a 1GB file on the mount point.
2. Started rebalance force after adding brick [so that file will be migrated]
3. Created multiple hard links after this.
[root@vm100 mnt]# ll -h
total 954M
-rw-r--r--. 1 root root 954M Sep  1 23:39 file
[root@vm100 mnt]# ln file link
[root@vm100 mnt]# ls
file  link
[root@vm100 mnt]# ln file link2
[root@vm100 mnt]# ls
file  link  link2
[root@vm100 mnt]# ln file link3
[root@vm100 mnt]#


5. Waited for rebalance to complete.
[root@vm100 ~]# gluster v rebalance test1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                1       953.7MB             1             0             0            completed              44.00
6. Checked file status on mount point.
[root@vm100 mnt]# ll -h
total 954M
-rw-r--r--. 1 root root 954M Sep  1 23:39 file
[root@vm100 mnt]#


Actual results:
===============
The hard links are gone as mentioned earlier. 


Expected results:
=================
hard link should be present

Additional info:
================

Comment 3 Rachana Patel 2014-09-04 09:59:50 UTC

AS mentioned in Description above, If we create hard-link while file is in migration, hard-link will be deleted after migration.

With same steps mentioned above, if we unlink original file after creating hard link (while file is in migration), we will loose both files and after migration we will not have data on backend or mount.

Steps to Reproduce:
==================

1. Created a  file on the mount point.
2. Started rebalance force after adding brick [so that file will be migrated]
3. While file is in migration, Created multiple hard links after this.
[root@vm100 mnt]# ll -h
total 954M
-rw-r--r--. 1 root root 954M Sep  1 23:39 file
[root@vm100 mnt]# ln file link
[root@vm100 mnt]# ls
file  link
[root@vm100 mnt]# ln file link2
[root@vm100 mnt]# ls
file  link  link2
[root@vm100 mnt]# ln file link3
[root@vm100 mnt]# ls 
file  link  link2 link3

4. while file is in migration unlink original file as we have multiple hard link for the same.

[root@vm100 mnt]# unlink file

5.  Waited for rebalance to complete. check data on mount and bricks
[root@vm100 mnt]# ls
[root@vm100 mnt]# 


Actual results:
===============
No files are present on mount or bricks


Expected results:
=================
There should not be any data/file loss If user has deleted orginal file after creating hard link for the same

Comment 6 Shalaka 2014-09-20 09:06:45 UTC

Please review and sign-off the edited doc text.

Comment 8 Shalaka 2014-09-22 11:29:13 UTC

Updated the doc text as suggested by Nithya.

Comment 10 Nithya Balachandran 2014-09-22 12:08:58 UTC

The doc text seems fine.

Comment 11 Shyamsundar 2014-11-06 21:07:01 UTC

Reasons for this happening:
---------------------------

1) dht_link would create a link to the cached file and a linkto at the hashed location

2) When file is under migration, if (1) happens, then we create a hard link to the cached (which is under migration) and create a linkto on the subvol that the new name hashes to.

3) If new name hashes to the same subvol as the old name, then the file survives, as the linkto file, for new name, on the hashed is a hard link to the linkto file for old name

4) If new name hashed to a different subvol, then the file does not survive. As the cached is on the subvol that is migrating and hence when migration is over that file is truncated and made 0, so it retains the P2 state old file (as the file is migrated) which has linkto information and sticky bit. So in all we lose the file.

The resolution for this is, to redirect the link to the new cached subvol for a file under migration. So when we get a dht_link on the postop, we need to send a link to the real cached subvol that the file is being migrated to, IOW follow the linkto and link the file there as well.

The above is a first cut RCA and resolution thought process.

Here is some FS data on the same, to help corelate with the comments above,

< State of the brick before rebalance starts on FILE1>
# ls -l /d/backends/patchy*
/d/backends/patchy1:
total 5242880
-rw-r--r--. 2 root root 5368709120 Nov 6 14:49 FILE1

/d/backends/patchy2:
total 0

/d/backends/patchy3:
total 0

< State of the brick as soon as rebalance starts on FILE1 >
[root@marvin ~]# ls -l /d/backends/patchy*
/d/backends/patchy1:
total 5242884
-rw-r-Sr-T. 2 root root 5368709120 Nov 6 14:49 FILE1

/d/backends/patchy2:
total 0

/d/backends/patchy3:
total 52868
---------T. 2 root root 5368709120 Nov 6 14:50 FILE1

<Create hard link FILE2 which hashes to second subvol
and hard link FILE5 which hashes to third subvol >
[root@marvin ~]# ls -l /d/backends/patchy*
/d/backends/patchy1:
total 20971536
-rw-r-Sr-T. 5 root root 5368709120 Nov 6 14:49 FILE1
-rw-r-Sr-T. 5 root root 5368709120 Nov 6 14:49 FILE2 (this is the hard link as cached for FILE1 is subvol1)
-rw-r-Sr-T. 5 root root 5368709120 Nov 6 14:49 FILE5

/d/backends/patchy2:
total 4
---------T. 2 root root 0 Nov 6 14:50 FILE2 (this is the linkto for FILE2)

/d/backends/patchy3:
total 5164812
---------T. 4 root root 5368709120 Nov 6 14:50 FILE1
---------T. 4 root root 5368709120 Nov 6 14:50 FILE5 (this is the linkto for FILE5 but is a hardlink to FILE1 as GFID is the same (check stat information))

< End of rebalance of FILE1, so on subvol1 we have the P1 file left, which is a linkto file>
[root@marvin ~]# ls -l /d/backends/patchy*
/d/backends/patchy:
total 0

/d/backends/patchy1:
total 12
---------T. 4 root root 0 Nov 6 14:50 FILE2 (bad FILE2, as FILE1 post migration was truncated, and made a linkto file and then unlinked, so the hard links survive with the P2 file, and linkto as subvol3, where FILE2 does not exist, IF it existed then it would be a double linkto, which again a lookup everywhere may cleanup)
---------T. 4 root root 0 Nov 6 14:50 FILE5 (good FILE5, but sort of useless as it is stale linkto, will get cleaned up later)

/d/backends/patchy2:
total 4
---------T. 2 root root 0 Nov 6 14:50 FILE2 (Hashed subvol of FILE2 pointing to NULL file in subvol1)

/d/backends/patchy3:
total 15728640
-rw-r--r--. 4 root root 5368709120 Nov 6 14:49 FILE1 (FILE1 is now hashed/cached here and good)
-rw-r--r--. 4 root root 5368709120 Nov 6 14:49 FILE5 (FILE5 was created as a hard link to the linkto for FILE1 during rebalance, so now automatically becomes a good file)

So bottom line is, we should create a hardlink in dht_link following a file under migration to its _new_ destination to resolve the issue.

Even if the original file was unlinked during migration, I assume the target would survive as there are hard links to it, this is a test case that needs repitition once we fix the issue in dht_link.

Comment 12 Triveni Rao 2015-02-26 06:42:42 UTC

This bug can be re-produced with this build 46


[root@rhsauto034 b0]# rpm -qa | grep gluster
gluster-nagios-common-0.1.4-1.el6rhs.noarch
glusterfs-3.6.0.46-1.el6rhs.x86_64
glusterfs-server-3.6.0.46-1.el6rhs.x86_64
gluster-nagios-addons-0.1.14-1.el6rhs.x86_64
samba-glusterfs-3.6.509-169.4.el6rhs.x86_64
glusterfs-libs-3.6.0.46-1.el6rhs.x86_64
glusterfs-api-3.6.0.46-1.el6rhs.x86_64
glusterfs-cli-3.6.0.46-1.el6rhs.x86_64
glusterfs-geo-replication-3.6.0.46-1.el6rhs.x86_64
vdsm-gluster-4.14.7.3-1.el6rhs.noarch
glusterfs-fuse-3.6.0.46-1.el6rhs.x86_64
glusterfs-rdma-3.6.0.46-1.el6rhs.x86_64
[root@rhsauto034 b0]
[root@rhsauto034 b0]# glusterfsd --version
glusterfs 3.6.0.46 built on Feb 20 2015 12:32:38
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2013 Red Hat, Inc. <http://www.redhat.com/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
[root@rhsauto034 b0]#

Comment 13 Triveni Rao 2015-02-26 10:12:02 UTC

1. Volume creation:

[root@rhsauto032 ~]# gluster v info bug_test

Volume Name: bug_test
Type: Distribute
Volume ID: c8d9f0b4-33d4-492b-84c5-ebddfd7bfc78
Status: Started
Snap Volume: no
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: rhsauto032.lab.eng.blr.redhat.com:/rhs/brick4/b0
Brick2: rhsauto034.lab.eng.blr.redhat.com:/rhs/brick4/b0
Options Reconfigured:
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
[root@rhsauto032 ~]# cd /rhs/brick4/b0
[root@rhsauto032 b0]# ls
[root@rhsauto032 b0]# ls -la
total 0
drwxr-xr-x 3 root root 23 Feb 25 22:44 .
drwxr-xr-x 3 root root 15 Feb 25 22:44 ..
drw------- 6 root root 61 Feb 25 22:44 .glusterfs
[root@rhsauto032 b0]# 

2. On mount point dd some 5 gb file.

[root@rhsauto031 mnt2]# dd if=/dev/urandom of=rebal_test bs=5000M count=1

[root@rhsauto031 mnt2]# ls -la
total 5120004
drwxr-xr-x   3 root root         55 Feb 25 23:06 .
dr-xr-xr-x. 26 root root       4096 Feb 25 22:45 ..
-rw-r--r--   1 root root 5242880000 Feb 25 23:04 rebal_test
[root@rhsauto031 mnt2]# 


3. Add brick to the volume:

[root@rhsauto032 b0]# gluster v add-brick bug_test `hostname`:/rhs/brick5/b0 rhsauto034.lab.eng.blr.redhat.com:/rhs/brick5/b0
volume add-brick: success
[root@rhsauto032 b0]# gluster v info bug_test
 
Volume Name: bug_test
Type: Distribute
Volume ID: c8d9f0b4-33d4-492b-84c5-ebddfd7bfc78
Status: Started
Snap Volume: no
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: rhsauto032.lab.eng.blr.redhat.com:/rhs/brick4/b0
Brick2: rhsauto034.lab.eng.blr.redhat.com:/rhs/brick4/b0
Brick3: rhsauto032.lab.eng.blr.redhat.com:/rhs/brick5/b0
Brick4: rhsauto034.lab.eng.blr.redhat.com:/rhs/brick5/b0
Options Reconfigured:
performance.readdir-ahead: on
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable
[root@rhsauto032 b0]# 


4. Rename the rebal_test to testing on mount point :

[root@rhsauto031 mnt2]# ls -la
total 5120004
drwxr-xr-x   3 root root        109 Feb 26 00:22 .
dr-xr-xr-x. 26 root root       4096 Feb 25 22:45 ..
-rw-r--r--   1 root root 5242880000 Feb 25 23:04 rebal_test
[root@rhsauto031 mnt2]#
[root@rhsauto031 mnt2]# 


[root@rhsauto031 mnt2]# mv rebal_test testing
[root@rhsauto031 mnt2]# ls -la
total 5120004
drwxr-xr-x   3 root root        120 Feb 26 00:58 .
dr-xr-xr-x. 26 root root       4096 Feb 25 22:45 ..
-rw-r--r--   1 root root 5242880000 Feb 25 23:04 testing
[root@rhsauto031 mnt2]#
[root@rhsauto031 mnt2]#


5. testing file is hashed to brick5 of 032 host.

[root@rhsauto032 ~]# ls -la /rhs/brick4/b0
total 0
drwxr-xr-x 3 root root 23 Feb 25 22:44 .
drwxr-xr-x 3 root root 15 Feb 25 22:44 ..
drw------- 6 root root 61 Feb 25 22:44 .glusterfs
[root@rhsauto032 ~]# ls -la /rhs/brick5/b0
total 0
drwxr-xr-x 3 root root 37 Feb 26 00:58 .
drwxr-xr-x 3 root root 15 Feb 26 00:16 ..
drw------- 7 root root 70 Feb 26 00:58 .glusterfs
---------T 2 root root  0 Feb 26 00:58 testing
[root@rhsauto032 ~]# 
[root@rhsauto032 ~]# 


6. kicked the rebalance process.

[root@rhsauto032 ~]# gluster v rebalance bug_test start force
volume rebalance: bug_test: success: Rebalance on bug_test has been started successfully. Use rebalance status command to check status of the rebalanc
e process.
ID: 69a7c505-fce1-4071-a31b-fe1e0bfc6ada

[root@rhsauto032 ~]# gluster v rebalance bug_test status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in s
ecs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     -----------
---
                               localhost                0        0Bytes             1             0             0            completed               0.00
       rhsauto034.lab.eng.blr.redhat.com                0        0Bytes             1             0             0          in progress               6.00
volume rebalance: bug_test: success:
[root@rhsauto032 ~]#

[root@rhsauto032 ~]# gluster v rebalance bug_test status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             1             0             0            completed               0.00
       rhsauto034.lab.eng.blr.redhat.com                0        0Bytes             1             0             0          in progress              42.00
volume rebalance: bug_test: success:
[root@rhsauto032 ~]# gluster v rebalance bug_test status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             1             0             0            completed               0.00
       rhsauto034.lab.eng.blr.redhat.com                0        0Bytes             1             0             0          in progress              45.00
volume rebalance: bug_test: success:
[root@rhsauto032 ~]# gluster v rebalance bug_test status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             1             0             0            completed               0.00
       rhsauto034.lab.eng.blr.redhat.com                0        0Bytes             1             0             0          in progress              79.00
volume rebalance: bug_test: success:
[root@rhsauto032 ~]# 



7 While rebalance is going on, hard link files are created:

[root@rhsauto031 mnt2]# ln testing min
[root@rhsauto031 mnt2]# ln testing max
[root@rhsauto031 mnt2]# ln testing normal
[root@rhsauto031 mnt2]# ln testing t1
[root@rhsauto031 mnt2]# ln testing t2
[root@rhsauto031 mnt2]# ln testing t3
[root@rhsauto031 mnt2]# ln testing t4
[root@rhsauto031 mnt2]# ln testing t5
[root@rhsauto031 mnt2]# ln testing t6
[root@rhsauto031 mnt2]# 

8. After rebalance completed ::

[root@rhsauto032 ~]# gluster v rebalance bug_test status 
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             1             0             0            completed               0.00
       rhsauto034.lab.eng.blr.redhat.com                1         4.9GB             4             0             0            completed             345.00
volume rebalance: bug_test: success: 
[root@rhsauto032 ~]# 

9. On backend T files are seen but actual files are missing.

[root@rhsauto034 t2]# ls -la /rhs/brick4/b0
total 51200000
drwxr-xr-x  3 root root        124 Feb 26 02:24 .
drwxr-xr-x  3 root root         15 Feb 25 22:44 ..
drw-------  9 root root         88 Feb 25 23:04 .glusterfs
-rw-r-Sr-T 11 root root 5242880000 Feb 25 23:04 max
-rw-r-Sr-T 11 root root 5242880000 Feb 25 23:04 min
-rw-r-Sr-T 11 root root 5242880000 Feb 25 23:04 normal
-rw-r-Sr-T 11 root root 5242880000 Feb 25 23:04 t1
-rw-r-Sr-T 11 root root 5242880000 Feb 25 23:04 t2
-rw-r-Sr-T 11 root root 5242880000 Feb 25 23:04 t3
-rw-r-Sr-T 11 root root 5242880000 Feb 25 23:04 t4
-rw-r-Sr-T 11 root root 5242880000 Feb 25 23:04 t5
-rw-r-Sr-T 11 root root 5242880000 Feb 25 23:04 t6
-rw-r-Sr-T 11 root root 5242880000 Feb 25 23:04 testing

[root@rhsauto034 t2]# ls -la /rhs/brick5/b0
total 0
drwxr-xr-x 3 root root 60 Feb 26 02:24 .
drwxr-xr-x 3 root root 15 Feb 26 00:16 ..
drw------- 7 root root 70 Feb 26 02:23 .glusterfs
---------T 5 root root  0 Feb 26 02:23 min
---------T 5 root root  0 Feb 26 02:23 t3
---------T 5 root root  0 Feb 26 02:23 t5
---------T 5 root root  0 Feb 26 02:23 t6
[root@rhsauto034 t2]# 
[root@rhsauto034 t2]# 


root@rhsauto032 ~]# ls -la /rhs/brick4/b0
total 0
drwxr-xr-x 3 root root 54 Feb 26 02:24 .
drwxr-xr-x 3 root root 15 Feb 25 22:44 ..
drw------- 7 root root 70 Feb 26 02:24 .glusterfs
---------T 4 root root  0 Feb 26 02:24 normal
---------T 4 root root  0 Feb 26 02:24 t1
---------T 4 root root  0 Feb 26 02:24 t4
[root@rhsauto032 ~]# 
[root@rhsauto032 ~]

[root@rhsauto032 ~]# ls -la /rhs/brick5/b0
total 14185344
drwxr-xr-x 3 root root         56 Feb 26 02:24 .
drwxr-xr-x 3 root root         15 Feb 26 00:16 ..
drw------- 7 root root         70 Feb 26 00:58 .glusterfs
---------T 4 root root 5242880000 Feb 26 02:28 max
---------T 4 root root 5242880000 Feb 26 02:28 t2
---------T 4 root root 5242880000 Feb 26 02:28 testing
[root@rhsauto032 ~]#



10. On the Mount point i dont see hard links created during rebalance.

[root@rhsauto031 mnt2]# ls -la
total 15360004
drwxr-xr-x   3 root root        280 Feb 26 02:29 .
dr-xr-xr-x. 26 root root       4096 Feb 25 22:45 ..
-rw-r--r--   3 root root 5242880000 Feb 25 23:04 max
-rw-r--r--   3 root root 5242880000 Feb 25 23:04 t2
-rw-r--r--   3 root root 5242880000 Feb 25 23:04 testing
[root@rhsauto031 mnt2]# 


11. Rebalance Log messages on 034 host

[2015-02-25 20:53:31.145198] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 0-bug_test-client-3: changing port to 49158 (from 0)
[2015-02-25 20:53:31.148069] I [client-handshake.c:1412:select_server_supported_programs] 0-bug_test-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-02-25 20:53:31.148752] I [client-handshake.c:1200:client_setvolume_cbk] 0-bug_test-client-2: Connected to bug_test-client-2, attached to remote volume '/rhs/brick5/b0'.
[2015-02-25 20:53:31.148777] I [client-handshake.c:1210:client_setvolume_cbk] 0-bug_test-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2015-02-25 20:53:31.149304] I [client-handshake.c:187:client_set_lk_version_cbk] 0-bug_test-client-2: Server lk version = 1
[2015-02-25 20:53:31.151116] I [client-handshake.c:1412:select_server_supported_programs] 0-bug_test-client-3: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2015-02-25 20:53:31.151679] I [client-handshake.c:1200:client_setvolume_cbk] 0-bug_test-client-3: Connected to bug_test-client-3, attached to remote volume '/rhs/brick5/b0'.
[2015-02-25 20:53:31.151701] I [client-handshake.c:1210:client_setvolume_cbk] 0-bug_test-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2015-02-25 20:53:31.152056] I [client-handshake.c:187:client_set_lk_version_cbk] 0-bug_test-client-3: Server lk version = 1
[2015-02-25 20:53:31.168839] I [dht-common.c:3252:dht_setxattr] 0-bug_test-dht: fixing the layout of /
[2015-02-25 20:53:31.171765] I [dict.c:371:dict_get] (-->/usr/lib64/glusterfs/3.6.0.46/xlator/cluster/distribute.so(dht_inodelk_done+0x4e) [0x7f2d402102ce] (-->/usr/lib64/glusterfs/3.6.0.46/xlator/cluster/distribute.so(dht_selfheal_layout_lock_cbk+0x10) [0x7f2d4021b110] (-->/usr/lib64/glusterfs/3.6.0.46/xlator/cluster/distribute.so(dht_refresh_layout+0xb3) [0x7f2d4021ae33]))) 0-dict: !this || key=trusted.glusterfs.dht
[2015-02-25 20:53:31.171829] W [dict.c:329:dict_set] (-->/usr/lib64/glusterfs/3.6.0.46/xlator/cluster/distribute.so(dht_selfheal_layout_lock_cbk+0x10) [0x7f2d4021b110] (-->/usr/lib64/glusterfs/3.6.0.46/xlator/cluster/distribute.so(dht_refresh_layout+0x30e) [0x7f2d4021b08e] (-->/usr/lib64/libglusterfs.so.0(dict_set_uint32+0x3b) [0x3ff481cceb]))) 0-dict: !this || !value for key=trusted.glusterfs.dht
[2015-02-25 20:53:31.171844] W [MSGID: 109003] [dht-selfheal.c:270:dht_refresh_layout] 0-bug_test-dht: /: Failed to set dictionary value:key = trusted.glusterfs.dht
[2015-02-25 20:53:31.175914] I [dht-rebalance.c:1430:gf_defrag_migrate_data] 0-bug_test-dht: migrate data called on /
[2015-02-25 20:53:31.180037] I [dht-rebalance.c:902:dht_migrate_file] 0-bug_test-dht: /testing: attempting to move from bug_test-client-1 to bug_test-client-2
[2015-02-25 20:53:37.514610] I [MSGID: 109028] [dht-rebalance.c:2135:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 6.00 secs
[2015-02-25 20:53:37.514654] I [MSGID: 109028] [dht-rebalance.c:2139:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 1, failures: 0, skipped: 0

[2015-02-25 20:54:50.227018] I [MSGID: 109028] [dht-rebalance.c:2135:gf_defrag_status_get] 0-glusterfs: Rebalance is in progress. Time taken is 79.00 secs
[2015-02-25 20:54:50.227069] I [MSGID: 109028] [dht-rebalance.c:2139:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 1, failures: 0, skipped: 0
[2015-02-25 20:59:16.523523] I [MSGID: 109022] [dht-rebalance.c:1180:dht_migrate_file] 0-bug_test-dht: completed migration of /testing from subvolume bug_test-client-1 to bug_test-client-2
[2015-02-25 20:59:16.538494] I [dht-rebalance.c:1673:gf_defrag_migrate_data] 0-bug_test-dht: Migration operation on dir / took 345.36 secs
[2015-02-25 20:59:16.551257] I [MSGID: 109028] [dht-rebalance.c:2135:gf_defrag_status_get] 0-glusterfs: Rebalance is completed. Time taken is 345.00 secs
[2015-02-25 20:59:16.551307] I [MSGID: 109028] [dht-rebalance.c:2139:gf_defrag_status_get] 0-glusterfs: Files migrated: 1, size: 5242880000, lookups: 4, failures: 0, skipped: 0
[2015-02-25 20:59:16.552294] W [glusterfsd.c:1183:cleanup_and_exit] (--> 0-: received signum (15), shutting down

Comment 14 Shyamsundar 2015-03-02 20:31:06 UTC

Created attachment 997223 [details]
attempted test case

Not a cause of the bug, but an observation on the results:

Between step 4 and 5 there is no layout fix etc, hence root of the volume would retain the older layout, which would have range 0-0 for the new brick added, hence a rename cannot land on the new brick as the root of the volume does not include this in the layout, please verify statement to this effect in the above comment.

I tried reproducing this upstream (master) with the attached script which is mostly what are the steps given here, but was unable to do so. Testing against RHS 3.0.x code base now.

Comment 15 Shyamsundar 2015-03-02 21:07:18 UTC

Tried with RHS 3.0 branch downstream as well, no luck reproducing the same with a single host. (tired adding 2 bricks instead of 1 as in the script attached in comment #14).

@Triveni, could you repro it on your setup and give me access to the same to take a look at what is happening?

Another thing it looks like rebalance has not cleared the sticky and SGID bits on the source file. There seems to be another issue here than links that seems to be causing this in my opinion.

Comment 18 Triveni Rao 2015-03-04 07:13:01 UTC

Couldnot reproduce this issue.

Comment 19 Bhavana 2015-03-22 18:10:52 UTC

Hi Nithya,

The doc text is modified. please review the same and sign-off if it looks ok.

Comment 22 errata-xmlrpc 2015-03-26 06:34:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0682.html

Note You need to log in before you can comment on or make changes to this bug.