Bug 969298 - Renaming file while rebalance is in progress causes data loss
Summary: Renaming file while rebalance is in progress causes data loss
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 2.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: RHGS 3.0.0
Assignee: vsomyaju
QA Contact: shylesh
URL:
Whiteboard:
: 1136838 (view as bug list)
Depends On:
Blocks: 987422 1127748 1130888 1131044 1138395 1139998 1140348 1146895 1286059
TreeView+ depends on / blocked
 
Reported: 2013-05-31 07:06 UTC by shylesh
Modified: 2015-11-27 10:26 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.6.0.28-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1130888 1146895 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:28:15 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description shylesh 2013-05-31 07:06:04 UTC
Description of problem:
while rebalance is in progress renaming the files causes data loss

Version-Release number of selected component (if applicable):
3.3.0.10rhs-1.el6.x86_64

How reproducible:


Steps to Reproduce:
1.created a 2 brick distribute volume 
2. mounted the volume and created 10000 files
3. add-brick and initiate rebalance, while rebalance is in progress rename all the files
for i in {1..1000}
do
mv $i new$i
done

Actual results:

on the mount point we can see the messages like
[root@localhost mymount]# for i in {1..10000}; do mv $i new$i ; done
mv: cannot move `1747' to `new1747': File exists
mv: cannot move `2485' to `new2485': File exists
mv: cannot move `2577' to `new2577': File exists
mv: cannot move `2586' to `new2586': File exists
mv: cannot move `3455' to `new3455': Structure needs cleaning
mv: cannot move `3626' to `new3626': No such file or directory
mv: cannot move `3829' to `new3829': No such file or directory
mv: cannot move `3830' to `new3830': No such file or directory
mv: cannot move `3944' to `new3944': Structure needs cleaning
mv: cannot move `3963' to `new3963': Structure needs cleaning
mv: cannot move `4180' to `new4180': Structure needs cleaning
mv: cannot move `4506' to `new4506': Structure needs cleaning
mv: cannot move `4591' to `new4591': Structure needs cleaning
mv: cannot move `4601' to `new4601': Structure needs cleaning
mv: cannot move `4602' to `new4602': Structure needs cleaning
mv: cannot move `4611' to `new4611': Structure needs cleaning
mv: cannot move `4644' to `new4644': No such file or directory
mv: cannot move `4709' to `new4709': No such file or directory
mv: cannot move `4814' to `new4814': No such file or directory
mv: cannot move `4835' to `new4835': Structure needs cleaning
mv: cannot move `4852' to `new4852': No such file or directory
mv: cannot move `5009' to `new5009': No such file or directory


the number files has been decreased
[root@localhost mymount]# ls | wc -l
9922


mnt log says 
================
02:06.250800] W [dht-rename.c:482:dht_rename_cbk] 1-dist1-dht: /4852: rename on dist1-client-0 failed (No such file or
 directory)
[2013-05-31 12:02:06.251172] W [fuse-bridge.c:1528:fuse_rename_cbk] 0-glusterfs-fuse: 103658: /4852 -> /new4852 => -1 (No such file o
r directory)
[2013-05-31 12:02:06.314825] I [dht-common.c:1103:dht_lookup_linkfile_cbk] 1-dist1-dht: lookup of /4858 on dist1-client-2 (following 
linkfile) reached link
[2013-05-31 12:02:06.316650] W [client3_1-fops.c:258:client3_1_mknod_cbk] 1-dist1-client-2: remote operation failed: File exists. Pat
h: /4858 (00000000-0000-0000-0000-000000000000)
[2013-05-31 12:02:06.930730] I [dht-common.c:1103:dht_lookup_linkfile_cbk] 1-dist1-dht: lookup of /4909 on dist1-client-0 (following 
linkfile) reached link
[2013-05-31 12:02:06.931329] W [dht-common.c:983:dht_lookup_everywhere_cbk] 1-dist1-dht: multiple subvolumes (dist1-client-1 and dist
1-client-0) have file /4909 (preferably rename the file in the backend, and do a fresh lookup)
[2013-05-31 12:02:06.933437] W [client3_1-fops.c:258:client3_1_mknod_cbk] 1-dist1-client-0: remote operation failed: File exists. Pat
h: /4909 (00000000-0000-0000-0000-000000000000)
[2013-05-31 12:02:06.977724] W [dht-rename.c:334:dht_rename_unlink_cbk] 1-dist1-dht: /4912: unlink on dist1-client-1 failed (No such 
file or directory)
[2013-05-31 12:02:07.022107] I [dht-common.c:997:dht_lookup_everywhere_cbk] 1-dist1-dht: deleting stale linkfile /4916 on dist1-clien
t-2
[2013-05-31 12:02:07.282241] I [dht-common.c:1103:dht_lookup_linkfile_cbk] 1-dist1-dht: lookup of /4939 on dist1-client-0 (following 
linkfile) reached link
[2013-05-31 12:02:07.283273] W [client3_1-fops.c:258:client3_1_mknod_cbk] 1-dist1-client-0: remote operation failed: File exists. Pat
h: /4939 (00000000-0000-0000-0000-000000000000)
[2013-05-31 12:02:07.284610] I [dht-common.c:1103:dht_lookup_linkfile_cbk] 1-dist1-dht: lookup of /4939 on dist1-client-0 (following 
linkfile) reached link
[2013-05-31 12:02:07.288314] W [client3_1-fops.c:258:client3_1_mknod_cbk] 1-dist1-client-0: remote operation failed: File exists. Path: /4939 (00000000-0000-0000-0000-000000000000)
[2013-05-31 12:02:07.325669] I [dht-common.c:1103:dht_lookup_linkfile_cbk] 1-dist1-dht: lookup of /4942 on dist1-client-0 (following linkfile) reached link

Comment 1 shylesh 2013-05-31 07:06:30 UTC
[root@anshi1 ~]# gluster v info dist1
 
Volume Name: dist1
Type: Distribute
Volume ID: ed55b825-0805-49c8-873c-8447681e687c
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: 10.70.35.213:/brick2/dist1
Brick2: 10.70.35.230:/brick2/dist2
Brick3: 10.70.35.213:/brick2/dist3

Comment 4 Amar Tumballi 2013-07-11 10:47:19 UTC
moving the target to rhs-2.1.0

Comment 5 Nagaprasad Sathyanarayana 2014-05-06 11:43:37 UTC
Dev ack to 3.0 RHS BZs

Comment 8 vsomyaju 2014-08-11 11:59:23 UTC
In rebalance logs, we saw many files migrated from cached to hashed. After migration file should be from data-name by rebalance. But unlinks fails. It
means that rename came in between and caused data loss. 


Race could be something like this:


                            Src-cacehd    Dst-hashed
                                A           A(linkto)

mv A B came and
and A should get renamed
to B at src_cached and
A(linkto) should get
deleted.


Migration migrated the          A(Linkto)       A(data)
file from A to B.          


Rename case                     rename A->B     unlink A


So we are left with renamed linkto file.

Comment 10 ssamanta 2014-09-02 05:02:31 UTC
As discussed with Engineering Leads,marked as a blocker because dependent BZ 1127748 is a blocker.

Comment 12 Kaushal 2014-09-03 12:18:32 UTC
*** Bug 1136838 has been marked as a duplicate of this bug. ***

Comment 13 shylesh 2014-09-19 07:49:06 UTC
verified on glusterfs-3.6.0.28-1

Comment 14 shylesh 2014-09-19 07:51:24 UTC
Verified by renaming 100 files constantly in a loop and simultaneously does add-brick + rebalance.

Result : No data loss



for i in {1..1000}; do for j in {1..100}; do mv f$j-$i f$j-`expr $i + 1`; done; done

Comment 16 errata-xmlrpc 2014-09-22 19:28:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.