Bug 988419 - Rebalance : Files missing on mount point after stopping and starting rebalance while rebalance process was running
Summary: Rebalance : Files missing on mount point after stopping and starting rebalanc...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 2.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: shishir gowda
QA Contact: senaik
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-07-25 14:49 UTC by senaik
Modified: 2015-09-01 12:23 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-08-08 03:26:18 UTC
Embargoed:


Attachments (Terms of Use)

Description senaik 2013-07-25 14:49:26 UTC
Description of problem:
=========================
Files missing on mount point after stopping and starting rebalance while rebalance process was running 


Version-Release number of selected component (if applicable):
===========================================================
3.4.0.12rhs.beta6-1.el6rhs.x86_64

How reproducible:


Steps to Reproduce:
=================== 
1.Create a distribute volume and start it 

2.Fuse mount the volume and create some files 
for i in {1..500} ; do dd if=/dev/urandom of=f"$i" bs=10M count=1; done

3.calculate are equal check sum 
[root@RHEL6 Volume1]# /opt/qa/tools/arequal-checksum /mnt/Volume1

Entry counts
Regular files   : 500
Directories     : 1
Symbolic links  : 0
Other           : 0
Total           : 501

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : eb1faeaf1dc9dcfc68a6cb49ce3231f0
Directories     : 30312a00
Symbolic links  : 0
Other           : 0
Total           : 83b965e6e3cac70c

4) add 2 bricks and start rebalance 

5) While rebalance is running , stop rebalance process 

gluster v rebalance Volume1 status

 Node Rebalanced-files    size   scanned    failures   status run time in secs
localhost    29       290.0MB      30         0    in progress          6.00
10.70.34.86  28       280.0MB     232         0    in progress          6.00
10.70.34.87  0        0Bytes      517        80      completed          2.00
10.70.34.88  6        60.0MB      529        26      completed          3.00
volume rebalance: Volume1: success: 


gluster v rebalance Volume1 stop

 Node Rebalanced-files    size   scanned    failures   status run time in secs
localhost    63       630.0MB      72         8       stopped          12.00
10.70.34.86  28       420.0MB     588         0      completed         10.00
10.70.34.87  0        0Bytes      517        80      completed          2.00
10.70.34.88  6        60.0MB      529        26      completed          3.00
volume rebalance: Volume1: success: rebalance process may be in the middle of a file migration.
The process will be fully stopped once the migration of the file is complete.
Please check rebalance process for completion before doing any further brick related tasks on the volume.

6) executed rebalance stop command 3-4 times 
7) started rebalance again 
8) check rebalance status 
9) After rebalance is completed , check are equal check sum 

[root@RHEL6 Volume1]# /opt/qa/tools/arequal-checksum /mnt/Volume1

Entry counts
Regular files   : 499
Directories     : 1
Symbolic links  : 0
Other           : 0
Total           : 500

Metadata checksums
Regular files   : 486e85
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : d2975638243ab0d13de03c5cfa867cb0
Directories     : 30021b66
Symbolic links  : 0
Other           : 0
Total           : ef776a64eebed707

regular file count has changed from 500 to 499 . 

Note : Tried to umount and mount the volume back , file was still missing 


Actual results:
=================
Files missing on mount point


Expected results:
=================
There should be no files missing on mount point 



Additional info:
==================


Missing file info : 
------------------
f13

[root@jay brick1]# ls -l */f13

---------T. 2 root root       0 Jul 25 18:58 d6/f13


[root@jay brick1]# getfattr -m . -d -e hex /rhs/brick1/d6/f13
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/d6/f13
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0xaab1adad2103416eb1e54b598264fd55
trusted.glusterfs.dht.linkto=0x566f6c756d65312d636c69656e742d3500


[root@jay brick1]# getfattr -m . -d -e text /rhs/brick1/d6/f13
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/d6/f13
security.selinux="unconfined_u:object_r:file_t:s0"
trusted.gfid="����!An��KY�d�U"
trusted.glusterfs.dht.linkto="Volume1-client-5"



[root@boost brick1]# gluster v i Volume1
 
Volume Name: Volume1
Type: Distribute
Volume ID: ca804585-b8bc-4804-8484-928442bbc698
Status: Started
Number of Bricks: 6
Transport-type: tcp
Bricks:
Brick1: 10.70.34.85:/rhs/brick1/d1
Brick2: 10.70.34.86:/rhs/brick1/d2
Brick3: 10.70.34.87:/rhs/brick1/d3
Brick4: 10.70.34.88:/rhs/brick1/d4
Brick5: 10.70.34.85:/rhs/brick1/d5
Brick6: 10.70.34.86:/rhs/brick1/d6

Comment 4 shishir gowda 2013-08-07 08:08:35 UTC
There are no errors reported in the logs for the missing file.

[shishirng@sgowda new]$ grep "f13 " */var/log/glusterfs/Volume1-rebalance.log
jay-2013072520201374763856/var/log/glusterfs/Volume1-rebalance.log:[2013-07-25 13:45:31.918180] I [dht-rebalance.c:872:dht_migrate_file] 0-Volume1-dht: completed migration of /f13 from subvolume Volume1-client-1 to Volume1-client-5
junior-2013072517581374755297/var/log/glusterfs/Volume1-rebalance.log:[2013-07-25 11:22:05.295182] I [dht-common.c:1051:dht_lookup_everywhere_cbk] 0-Volume1-dht: deleting stale linkfile /f13 on Volume1-client-5
kori-2013072517591374755347/var/log/glusterfs/Volume1-rebalance.log:[2013-07-25 11:22:05.778611] I [dht-common.c:1051:dht_lookup_everywhere_cbk] 0-Volume1-dht: deleting stale linkfile /f13 on Volume1-client-5

The only error seen seems to be linkfile pointing to itself.

volume Volume1-client-5
      type protocol/client
      option remote-host 10.70.34.86
      option remote-subvolume /rhs/brick1/d6
      option transport-type socket
      option username 77bc7b46-d99e-4695-82be-3ec1251d3904
      option password fa57f711-4283-4d13-865b-7e2be9a0f6d9
end-volume


[root@jay brick1]# getfattr -m . -d -e text /rhs/brick1/d6/f13
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/d6/f13
security.selinux="unconfined_u:object_r:file_t:s0" <=== selinux is on
trusted.gfid=0xaab1adad2103416eb1e54b598264fd55
trusted.glusterfs.dht.linkto="Volume1-client-5" <=== pointing to itself

Additionally errors reported in the logs:
[2013-07-25 13:51:02.779050] I [dht-layout.c:749:dht_layout_dir_mismatch] 0-Volume1-dht: subvol: Volume1-client-0; inode layout - 0 - 1073741822; disk layout - 2147483646 - 2863311527
[2013-07-25 13:51:02.779095] I [dht-common.c:654:dht_revalidate_cbk] 0-Volume1-dht: mismatching layouts for /
[2013-07-25 13:51:02.779393] I [dht-layout.c:749:dht_layout_dir_mismatch] 0-Volume1-dht: subvol: Volume1-client-1; inode layout - 1073741823 - 2147483645; disk layout - 1431655764 - 2147483645
[2013-07-25 13:51:02.779413] I [dht-common.c:654:dht_revalidate_cbk] 0-Volume1-dht: mismatching layouts for /
[2013-07-25 13:51:02.781766] I [dht-layout.c:749:dht_layout_dir_mismatch] 0-Volume1-dht: subvol: Volume1-client-2; inode layout - 2147483646 - 3221225468; disk layout - 2863311528 - 3579139409
[2013-07-25 13:51:02.781867] I [dht-common.c:654:dht_revalidate_cbk] 0-Volume1-dht: mismatching layouts for /
[2013-07-25 13:51:02.781945] I [dht-layout.c:749:dht_layout_dir_mismatch] 0-Volume1-dht: subvol: Volume1-client-3; inode layout - 3221225469 - 4294967295; disk layout - 3579139410 - 4294967295
[2013-07-25 13:51:02.781961] I [dht-common.c:654:dht_revalidate_cbk] 0-Volume1-dht: mismatching layouts for /
[2013-07-25 13:51:02.783531] I [dht-layout.c:636:dht_layout_normalize] 0-Volume1-dht: found anomalies in /. holes=1 overlaps=0 missing=0 down=0 misc=0
[2013-07-25 13:51:02.902111] I [dht-layout.c:749:dht_layout_dir_mismatch] 1-Volume1-dht: subvol: Volume1-client-1; inode layout - 1431655764 - 2147483645; disk layout - 1073741823 - 2147483645
[2013-07-25 13:51:02.902150] I [dht-common.c:654:dht_revalidate_cbk] 1-Volume1-dht: mismatching layouts for /
[2013-07-25 13:51:02.902196] I [dht-layout.c:749:dht_layout_dir_mismatch] 1-Volume1-dht: subvol: Volume1-client-0; inode layout - 2147483646 - 2863311527; disk layout - 0 - 1073741822
[2013-07-25 13:51:02.902212] I [dht-common.c:654:dht_revalidate_cbk] 1-Volume1-dht: mismatching layouts for /
[2013-07-25 13:51:02.902323] I [dht-layout.c:749:dht_layout_dir_mismatch] 1-Volume1-dht: subvol: Volume1-client-2; inode layout - 2863311528 - 3579139409; disk layout - 2147483646 - 3221225468
[2013-07-25 13:51:02.902341] I [dht-common.c:654:dht_revalidate_cbk] 1-Volume1-dht: mismatching layouts for /
[2013-07-25 13:51:02.902390] I [dht-layout.c:749:dht_layout_dir_mismatch] 1-Volume1-dht: subvol: Volume1-client-3; inode layout - 3579139410 - 4294967295; disk layout - 3221225469 - 4294967295
[2013-07-25 13:51:02.902403] I [dht-common.c:654:dht_revalidate_cbk] 1-Volume1-dht: mismatching layouts for /
[2013-07-25 13:51:02.905684] I [dht-layout.c:636:dht_layout_normalize] 1-Volume1-dht: found anomalies in /. holes=0 overlaps=2 missing=0 down=0 misc=0

Comment 5 senaik 2013-08-07 12:00:36 UTC
Re-tested the scenario on Version : 3.4.0.17rhs-1.el6rhs.x86_64, could not reproduce the issue .

Comment 6 Sudhir D 2013-08-08 03:26:18 UTC
(In reply to senaik from comment #5)
> Re-tested the scenario on Version : 3.4.0.17rhs-1.el6rhs.x86_64, could not
> reproduce the issue .

Issue seem to be magically fixed. Moving this to Closed WorksForMe. Reopen if the this regress with future builds.


Note You need to log in before you can comment on or make changes to this bug.