Description of problem: while rebalance is in progress renaming the files causes data loss Version-Release number of selected component (if applicable): 3.3.0.10rhs-1.el6.x86_64 How reproducible: Steps to Reproduce: 1.created a 2 brick distribute volume 2. mounted the volume and created 10000 files 3. add-brick and initiate rebalance, while rebalance is in progress rename all the files for i in {1..1000} do mv $i new$i done Actual results: on the mount point we can see the messages like [root@localhost mymount]# for i in {1..10000}; do mv $i new$i ; done mv: cannot move `1747' to `new1747': File exists mv: cannot move `2485' to `new2485': File exists mv: cannot move `2577' to `new2577': File exists mv: cannot move `2586' to `new2586': File exists mv: cannot move `3455' to `new3455': Structure needs cleaning mv: cannot move `3626' to `new3626': No such file or directory mv: cannot move `3829' to `new3829': No such file or directory mv: cannot move `3830' to `new3830': No such file or directory mv: cannot move `3944' to `new3944': Structure needs cleaning mv: cannot move `3963' to `new3963': Structure needs cleaning mv: cannot move `4180' to `new4180': Structure needs cleaning mv: cannot move `4506' to `new4506': Structure needs cleaning mv: cannot move `4591' to `new4591': Structure needs cleaning mv: cannot move `4601' to `new4601': Structure needs cleaning mv: cannot move `4602' to `new4602': Structure needs cleaning mv: cannot move `4611' to `new4611': Structure needs cleaning mv: cannot move `4644' to `new4644': No such file or directory mv: cannot move `4709' to `new4709': No such file or directory mv: cannot move `4814' to `new4814': No such file or directory mv: cannot move `4835' to `new4835': Structure needs cleaning mv: cannot move `4852' to `new4852': No such file or directory mv: cannot move `5009' to `new5009': No such file or directory the number files has been decreased [root@localhost mymount]# ls | wc -l 9922 mnt log says ================ 02:06.250800] W [dht-rename.c:482:dht_rename_cbk] 1-dist1-dht: /4852: rename on dist1-client-0 failed (No such file or directory) [2013-05-31 12:02:06.251172] W [fuse-bridge.c:1528:fuse_rename_cbk] 0-glusterfs-fuse: 103658: /4852 -> /new4852 => -1 (No such file o r directory) [2013-05-31 12:02:06.314825] I [dht-common.c:1103:dht_lookup_linkfile_cbk] 1-dist1-dht: lookup of /4858 on dist1-client-2 (following linkfile) reached link [2013-05-31 12:02:06.316650] W [client3_1-fops.c:258:client3_1_mknod_cbk] 1-dist1-client-2: remote operation failed: File exists. Pat h: /4858 (00000000-0000-0000-0000-000000000000) [2013-05-31 12:02:06.930730] I [dht-common.c:1103:dht_lookup_linkfile_cbk] 1-dist1-dht: lookup of /4909 on dist1-client-0 (following linkfile) reached link [2013-05-31 12:02:06.931329] W [dht-common.c:983:dht_lookup_everywhere_cbk] 1-dist1-dht: multiple subvolumes (dist1-client-1 and dist 1-client-0) have file /4909 (preferably rename the file in the backend, and do a fresh lookup) [2013-05-31 12:02:06.933437] W [client3_1-fops.c:258:client3_1_mknod_cbk] 1-dist1-client-0: remote operation failed: File exists. Pat h: /4909 (00000000-0000-0000-0000-000000000000) [2013-05-31 12:02:06.977724] W [dht-rename.c:334:dht_rename_unlink_cbk] 1-dist1-dht: /4912: unlink on dist1-client-1 failed (No such file or directory) [2013-05-31 12:02:07.022107] I [dht-common.c:997:dht_lookup_everywhere_cbk] 1-dist1-dht: deleting stale linkfile /4916 on dist1-clien t-2 [2013-05-31 12:02:07.282241] I [dht-common.c:1103:dht_lookup_linkfile_cbk] 1-dist1-dht: lookup of /4939 on dist1-client-0 (following linkfile) reached link [2013-05-31 12:02:07.283273] W [client3_1-fops.c:258:client3_1_mknod_cbk] 1-dist1-client-0: remote operation failed: File exists. Pat h: /4939 (00000000-0000-0000-0000-000000000000) [2013-05-31 12:02:07.284610] I [dht-common.c:1103:dht_lookup_linkfile_cbk] 1-dist1-dht: lookup of /4939 on dist1-client-0 (following linkfile) reached link [2013-05-31 12:02:07.288314] W [client3_1-fops.c:258:client3_1_mknod_cbk] 1-dist1-client-0: remote operation failed: File exists. Path: /4939 (00000000-0000-0000-0000-000000000000) [2013-05-31 12:02:07.325669] I [dht-common.c:1103:dht_lookup_linkfile_cbk] 1-dist1-dht: lookup of /4942 on dist1-client-0 (following linkfile) reached link
[root@anshi1 ~]# gluster v info dist1 Volume Name: dist1 Type: Distribute Volume ID: ed55b825-0805-49c8-873c-8447681e687c Status: Started Number of Bricks: 3 Transport-type: tcp Bricks: Brick1: 10.70.35.213:/brick2/dist1 Brick2: 10.70.35.230:/brick2/dist2 Brick3: 10.70.35.213:/brick2/dist3
moving the target to rhs-2.1.0
Dev ack to 3.0 RHS BZs
In rebalance logs, we saw many files migrated from cached to hashed. After migration file should be from data-name by rebalance. But unlinks fails. It means that rename came in between and caused data loss. Race could be something like this: Src-cacehd Dst-hashed A A(linkto) mv A B came and and A should get renamed to B at src_cached and A(linkto) should get deleted. Migration migrated the A(Linkto) A(data) file from A to B. Rename case rename A->B unlink A So we are left with renamed linkto file.
As discussed with Engineering Leads,marked as a blocker because dependent BZ 1127748 is a blocker.
*** Bug 1136838 has been marked as a duplicate of this bug. ***
verified on glusterfs-3.6.0.28-1
Verified by renaming 100 files constantly in a loop and simultaneously does add-brick + rebalance. Result : No data loss for i in {1..1000}; do for j in {1..100}; do mv f$j-$i f$j-`expr $i + 1`; done; done
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html