Bug 1138383 - DHT + rebalance : rebalance process crashed + data loss + few Directories are present on sub-volumes but not visible on mount point + lookup is not healing directories
Summary: DHT + rebalance : rebalance process crashed + data loss + few Directories are...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: 3.6.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Shyamsundar
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-04 17:11 UTC by Shyamsundar
Modified: 2014-09-05 15:32 UTC (History)
8 users (show)

Fixed In Version:
Clone Of: 1104653
Environment:
Last Closed: 2014-09-05 15:32:25 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Shyamsundar 2014-09-04 17:11:24 UTC
+++ This bug was initially created as a clone of Bug #1104653 +++

Description of problem:
======================
rebalance process was crashes(on all node) after running for 44+ hours. no I/O was being done on mount point when it was crashed.


pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-11-23 03:17:58configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.44rhs
/lib64/libc.so.6[0x3b4d032960]
/lib64/libc.so.6[0x3b4d07f92a]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/cluster/distribute.so(dht_layout_entry_cmp_volname+0x2a)[0x7f9288a31dca]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/cluster/distribute.so(dht_layout_sort_volname+0x3d)[0x7f9288a31e1d]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/cluster/distribute.so(dht_fix_layout_of_directory+0xeb)[0x7f9288a3ad0b]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/cluster/distribute.so(dht_fix_directory_layout+0x49)[0x7f9288a3c5b9]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/cluster/distribute.so(dht_setxattr+0x9d6)[0x7f9288a4e726]
/usr/lib64/libglusterfs.so.0(syncop_setxattr+0x1a1)[0x3ab304fbd1]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/cluster/distribute.so(gf_defrag_fix_layout+0x4b4)[0x7f9288a374b4]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/cluster/distribute.so(gf_defrag_fix_layout+0x4d8)[0x7f9288a374d8]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/cluster/distribute.so(gf_defrag_start_crawl+0x286)[0x7f9288a379d6]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x3ab3049ad2]
/lib64/libc.so.6[0x3b4d043bb0]
---------



core was generated
[root@7-VM1 core]# ls -l
total 1018116
-rw------- 1 root root 1042546688 Nov 23 09:14 core.20128.1385176679.dump.1

[root@7-VM1 core]# file core.20128.1385176679.dump.1 
core.20128.1385176679.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfs -s localhost --volfile-id flat --xlator-option *dht.use-rea'



How reproducible:
==================
got it twice(different volume, same cluster). For another volume, process was running for more than 7 days

Steps to Reproduce:
====================
1. create and mount DHT volume. Create Data from mount point(Directory depth was 10)
2.add brick to volume and start rebalance
3. after 44+ hours rebalance process was crashed on all node and rebalance status was 'failed'

[root@7-VM1 core]#  gluster volume rebalance flat status
                                    Node Rebalanced-files          size       scanned      failures       skipped         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------   ------------   --------------
                               localhost           832000        13.7GB       5344344             1           228         failed        159836.00
                            10.70.36.133          1009405        15.7GB       5362837             2           206         failed        159836.00
                            10.70.36.132           823206        12.9GB       5416604             1           233         failed        159836.00
                            10.70.36.131                0        0Bytes       5227829             0             0         failed        159836.00
volume rebalance: flat: success: 

[root@7-VM1 core]# ps auxw | grep rebalance
root     31760  0.0  0.0 103244   804 pts/1    R+   10:49   0:00 grep rebalance



Actual results:
================
rebalance process crashed

Expected results:


Additional info:

--- Additional comment from Rachana Patel on 2013-11-25 04:32:54 EST ---

volume info:-Volume Name: flat
Type: Distribute
Volume ID: e305c335-0859-46e3-acf9-ee245f205a99
Status: Started
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: 10.70.36.130:/rhs/brick1/f
Brick2: 10.70.36.132:/rhs/brick1/f
Brick3: 10.70.36.133:/rhs/brick1/f
Brick4: 10.70.36.133:/rhs/brick2/f
Brick5: 10.70.36.133:/rhs/brick4/f

mount info - 
[root@rhs-client22 ~]# mount | grep flat
10.70.36.130:/flat on /mnt/flat-nfs type nfs (rw,addr=10.70.36.130)
10.70.36.130:/flat on /mnt/flat type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)

will upload sosreport soon

--- Additional comment from Rachana Patel on 2013-11-26 05:27:50 EST ---

updating the same bug as this behaviour is noticed on volume where rebalance was crashed.

From mount point many Directories are not visible which are present on backend.(present on 3 sub-vols, and not present on 2 sub-vols etc.; one of them is newly added sub-vol)

Directories have files and other directories also. So It's data loss.

+

lokkup is not healing Directory on 

from mount point

--- Additional comment from Rachana Patel on 2013-11-26 05:40:32 EST ---

updating the same bug as this behaviour is noticed on volume where rebalance was crashed.

1) From mount point many Directories are not visible which are present on backend.(Directories are present on some sub-vols, and not present on some sub-vols )

Directories have files and other directories also. So It's data loss.

2) lookup is not healing Directory on sub-volumes


e.g.

 etc65 is present on # sub-vol and not present on 2 sub-vol.
It is not shown on mount point. It has data. and lookup is not healing that Dir.



from mount point:-
[root@7-VM2 mvs1]# pwd
/mnt/flat/mvs1
[root@7-VM2 mvs1]# ls
etc100  mvetc10  mvetc12  mvetc14  mvetc17  mvetc19  mvetc20  mvetc22  mvetc25  mvetc4  mvetc63  mvetc9
etc100  mvetc10  mvetc12  mvetc14  mvetc17  mvetc19  mvetc20  mvetc22  mvetc25  mvetc4  mvetc63  mvetc9
mvetc1  mvetc11  mvetc13  mvetc15  mvetc18  mvetc2   mvetc21  mvetc24  mvetc3   mvetc5  mvetc8
mvetc1  mvetc11  mvetc13  mvetc15  mvetc18  mvetc2   mvetc21  mvetc24  mvetc3   mvetc5  mvetc8


Backend:-[root@7-VM1 ~]# ls  /rhs/brick1/f/mvs1/
ddf79   f112  f268  f432  f560  f698  f845  f994     mvetc24
ddf84   f116  f269  f435  f561  f70   f846  f996     mvetc25
ddf86   f118  f273  f441  f563  f703  f848  f999     mvetc27
ddf89   f125  f274  f442  f564  f704  f849  mvddf1   mvetc28
ddf92   f130  f278  f443  f570  f711  f85   mvddf10  mvetc29
ddf96   f134  f280  f449  f573  f712  f851  mvddf11  mvetc3
ddf98   f135  f282  f456  f578  f714  f853  mvddf15  mvetc30
etc100  f145  f284  f46   f589  f716  f858  mvddf17  mvetc31
etc65   f148  f287  f463  f59   f722  f864  mvddf2   mvetc35
etc67   f155  f295  f47   f591  f724  f868  mvddf20  mvetc36
etc68   f162  f297  f470  f596  f728  f871  mvddf21  mvetc37
etc70   f163  f299  f474  f597  f737  f874  mvddf3   mvetc39
etc71   f170  f301  f481  f606  f739  f880  mvddf30  mvetc4
etc72   f171  f307  f484  f608  f741  f882  mvddf32  mvetc40
etc73   f176  f309  f487  f610  f747  f884  mvddf34  mvetc41
etc74   f18   f310  f489  f611  f756  f885  mvddf36  mvetc42
etc75   f186  f314  f491  f612  f759  f890  mvddf37  mvetc43
etc76   f189  f320  f494  f614  f761  f891  mvddf38  mvetc45
etc78   f19   f323  f499  f618  f769  f893  mvddf42  mvetc46
etc79   f2    f327  f500  f621  f770  f894  mvddf44  mvetc49
etc80   f20   f328  f501  f624  f773  f896  mvddf47  mvetc5
etc81   f206  f329  f502  f625  f778  f897  mvddf48  mvetc50
etc82   f212  f332  f507  f63   f783  f903  mvddf49  mvetc54
etc83   f217  f345  f508  f631  f784  f911  mvddf53  mvetc57
etc84   f218  f346  f51   f645  f786  f917  mvddf56  mvetc58
etc85   f219  f357  f511  f648  f789  f922  mvddf57  mvetc59
etc86   f22   f361  f514  f657  f791  f925  mvddf6   mvetc60
etc89   f225  f362  f516  f664  f792  f93   mvddf60  mvetc61
etc90   f226  f363  f518  f666  f801  f933  mvetc1   mvetc62
etc91   f227  f364  f520  f669  f803  f936  mvetc10  mvetc63
etc92   f23   f373  f522  f674  f805  f951  mvetc11  mvetc64
etc95   f231  f375  f524  f678  f807  f955  mvetc12  mvetc8
etc96   f237  f379  f525  f679  f808  f957  mvetc13  mvetc9
etc97   f239  f380  f526  f68   f811  f96   mvetc14  s2
etc98   f240  f393  f531  f680  f813  f962  mvetc15
etc99   f245  f402  f538  f681  f82   f967  mvetc17
f1      f260  f403  f542  f682  f823  f970  mvetc18
f10     f261  f409  f543  f685  f825  f977  mvetc19
f103    f262  f410  f544  f688  f827  f978  mvetc2
f106    f263  f419  f554  f69   f83   f980  mvetc20
f107    f264  f422  f555  f691  f830  f987  mvetc21
f110    f266  f426  f56   f692  f839  f988  mvetc22

[root@7-VM4 ~]# ls  /rhs/brick4/f/mvs1/
mvetc1   mvetc13  mvetc18  mvetc21  mvetc3   mvetc8
mvetc10  mvetc14  mvetc19  mvetc22  mvetc4   mvetc9
mvetc11  mvetc15  mvetc2   mvetc24  mvetc5
mvetc12  mvetc17  mvetc20  mvetc25  mvetc54
[root@7-VM4 ~]# ls  /rhs/brick2/f/mvs1/
ddf100  etc97  f243  f383  f576  f735  f898     mvddf19  mvetc28
ddf68   etc98  f249  f396  f583  f743  f90      mvddf20  mvetc29
ddf70   etc99  f254  f399  f592  f75   f900     mvddf27  mvetc3
ddf73   f11    f272  f401  f598  f754  f902     mvddf28  mvetc30
ddf76   f111   f277  f408  f603  f757  f904     mvddf29  mvetc31
ddf77   f114   f279  f411  f605  f758  f908     mvddf33  mvetc35
ddf80   f115   f283  f413  f607  f762  f91      mvddf34  mvetc36
ddf82   f117   f285  f415  f615  f765  f912     mvddf4   mvetc37
ddf83   f120   f286  f417  f620  f766  f913     mvddf40  mvetc39
ddf93   f123   f29   f42   f623  f768  f916     mvddf44  mvetc4
ddf95   f128   f3    f423  f627  f771  f919     mvddf45  mvetc40
ddf97   f132   f302  f427  f628  f775  f92      mvddf48  mvetc41
etc100  f137   f304  f434  f629  f777  f921     mvddf49  mvetc42
etc65   f14    f306  f445  f636  f779  f932     mvddf5   mvetc43
etc67   f15    f31   f45   f637  f78   f935     mvddf53  mvetc45
etc68   f150   f311  f452  f64   f806  f937     mvddf54  mvetc46
etc70   f157   f312  f453  f641  f810  f939     mvddf59  mvetc49
etc71   f161   f313  f455  f642  f815  f94      mvddf6   mvetc5
etc72   f173   f315  f464  f644  f816  f940     mvddf60  mvetc50
etc73   f175   f319  f465  f647  f820  f941     mvddf63  mvetc54
etc74   f181   f321  f467  f649  f824  f947     mvddf8   mvetc57
etc75   f183   f324  f472  f651  f836  f948     mvetc1   mvetc58
etc76   f185   f33   f473  f653  f837  f950     mvetc10  mvetc59
etc78   f187   f338  f497  f66   f840  f956     mvetc11  mvetc60
etc79   f196   f340  f498  f662  f841  f958     mvetc12  mvetc61
etc80   f200   f347  f5    f67   f842  f959     mvetc13  mvetc62
etc81   f202   f349  f521  f670  f847  f960     mvetc14  mvetc63
etc82   f207   f351  f523  f672  f857  f965     mvetc15  mvetc64
etc83   f208   f352  f535  f673  f859  f969     mvetc17  mvetc8
etc84   f209   f354  f545  f676  f87   f971     mvetc18  mvetc9
etc85   f211   f355  f547  f686  f872  f972     mvetc19  s2
etc86   f216   f368  f548  f690  f873  f976     mvetc2
etc89   f220   f37   f549  f71   f879  f979     mvetc20
etc90   f222   f372  f556  f717  f88   f992     mvetc21
etc91   f223   f374  f567  f727  f888  f995     mvetc22
etc92   f224   f378  f568  f73   f89   f997     mvetc24
etc95   f238   f38   f569  f732  f892  mvddf12  mvetc25
etc96   f241   f382  f57   f734  f895  mvddf18  mvetc27
[root@7-VM4 ~]# ls  /rhs/brick1/f/mvs1/
etc65  f201  f330  f418  f54   f660  f76   f875  mvddf17  mvetc29
f1000  f205  f331  f421  f541  f663  f760  f876  mvddf18  mvetc3
f101   f21   f334  f424  f550  f665  f764  f883  mvddf19  mvetc30
f102   f213  f337  f425  f558  f667  f767  f887  mvddf2   mvetc31
f108   f214  f34   f429  f559  f668  f772  f9    mvddf3   mvetc35
f113   f221  f342  f43   f565  f675  f774  f909  mvddf32  mvetc36
f121   f232  f343  f431  f566  f684  f776  f910  mvddf33  mvetc37
f127   f234  f344  f436  f571  f687  f782  f914  mvddf35  mvetc39
f129   f235  f348  f437  f575  f694  f79   f918  mvddf45  mvetc4
f13    f247  f35   f440  f577  f696  f790  f923  mvddf51  mvetc40
f131   f25   f356  f444  f579  f699  f795  f924  mvddf56  mvetc41
f133   f256  f358  f446  f584  f7    f796  f928  mvddf8   mvetc42
f136   f257  f36   f457  f586  f707  f797  f938  mvddf9   mvetc43
f138   f259  f360  f458  f590  f709  f80   f942  mvetc1   mvetc45
f139   f26   f367  f479  f595  f710  f802  f943  mvetc10  mvetc46
f143   f267  f371  f480  f599  f718  f81   f945  mvetc11  mvetc49
f144   f27   f376  f482  f6    f719  f814  f946  mvetc12  mvetc5
f153   f271  f377  f485  f600  f720  f817  f952  mvetc13  mvetc50
f158   f275  f386  f49   f609  f723  f818  f954  mvetc14  mvetc54
f16    f276  f388  f490  f613  f725  f826  f961  mvetc15  mvetc57
f165   f288  f389  f496  f616  f729  f828  f963  mvetc17  mvetc58
f168   f291  f391  f50   f617  f733  f833  f966  mvetc18  mvetc59
f169   f292  f392  f504  f622  f738  f844  f968  mvetc19  mvetc60
f178   f294  f395  f505  f630  f740  f850  f974  mvetc2   mvetc61
f180   f298  f398  f506  f633  f744  f852  f975  mvetc20  mvetc62
f182   f30   f40   f515  f639  f746  f854  f98   mvetc21  mvetc63
f191   f303  f404  f527  f643  f748  f855  f981  mvetc22  mvetc64
f194   f308  f405  f529  f65   f751  f86   f984  mvetc24  mvetc8
f197   f316  f407  f530  f650  f752  f860  f985  mvetc25  mvetc9
f198   f32   f412  f532  f654  f753  f867  f990  mvetc27
f199   f325  f416  f537  f655  f755  f869  f993  mvetc28



[root@7-VM3 ~]# ls  /rhs/brick1/f/mvs1/
ddf65   f12   f248  f406  f53   f671  f821  f983     mvetc22
ddf67   f122  f250  f41   f533  f677  f822  f986     mvetc24
ddf72   f124  f251  f414  f534  f683  f829  f989     mvetc25
ddf75   f126  f252  f420  f536  f689  f831  f99      mvetc27
ddf78   f140  f253  f428  f539  f693  f832  f991     mvetc28
ddf85   f141  f255  f430  f540  f695  f834  f998     mvetc29
ddf87   f142  f258  f433  f546  f697  f835  mvddf10  mvetc3
ddf88   f146  f265  f438  f55   f700  f838  mvddf12  mvetc30
ddf91   f147  f270  f439  f551  f701  f84   mvddf16  mvetc31
etc100  f149  f28   f44   f552  f702  f843  mvddf21  mvetc35
etc65   f151  f281  f447  f553  f705  f856  mvddf22  mvetc36
etc67   f152  f289  f448  f557  f706  f861  mvddf23  mvetc37
etc68   f154  f290  f450  f562  f708  f862  mvddf25  mvetc39
etc70   f156  f293  f451  f572  f713  f863  mvddf27  mvetc4
etc71   f159  f296  f454  f574  f715  f865  mvddf29  mvetc40
etc72   f160  f300  f459  f58   f72   f866  mvddf30  mvetc41
etc73   f164  f305  f460  f580  f721  f870  mvddf35  mvetc42
etc74   f166  f317  f461  f581  f726  f877  mvddf36  mvetc43
etc75   f167  f318  f462  f582  f730  f878  mvddf37  mvetc45
etc76   f17   f322  f466  f585  f731  f881  mvddf39  mvetc46
etc78   f172  f326  f468  f587  f736  f886  mvddf4   mvetc49
etc79   f174  f333  f469  f588  f74   f889  mvddf40  mvetc5
etc80   f177  f335  f471  f593  f742  f899  mvddf42  mvetc50
etc81   f179  f336  f475  f594  f745  f901  mvddf47  mvetc54
etc82   f184  f339  f476  f60   f749  f905  mvddf5   mvetc57
etc83   f188  f341  f477  f601  f750  f906  mvddf51  mvetc58
etc84   f190  f350  f478  f602  f763  f907  mvddf54  mvetc59
etc85   f192  f353  f48   f604  f77   f915  mvddf58  mvetc60
etc86   f193  f359  f483  f61   f780  f920  mvddf63  mvetc61
etc89   f195  f365  f486  f619  f781  f926  mvddf9   mvetc62
etc90   f203  f366  f488  f62   f785  f927  mvetc1   mvetc63
etc91   f204  f369  f492  f626  f787  f929  mvetc10  mvetc64
etc92   f210  f370  f493  f632  f788  f930  mvetc11  mvetc8
etc95   f215  f381  f495  f634  f793  f931  mvetc12  mvetc9
etc96   f228  f384  f503  f635  f794  f934  mvetc13  s2
etc97   f229  f385  f509  f638  f798  f944  mvetc14
etc98   f230  f387  f510  f640  f799  f949  mvetc15
etc99   f233  f39   f512  f646  f8    f95   mvetc17
f100    f236  f390  f513  f652  f800  f953  mvetc18
f104    f24   f394  f517  f656  f804  f964  mvetc19
f105    f242  f397  f519  f658  f809  f97   mvetc2
f109    f244  f4    f52   f659  f812  f973  mvetc20
f119    f246  f400  f528  f661  f819  f982  mvetc21

data from one of the brick:-
[root@7-VM3 ~]# ls  /rhs/brick1/f/mvs1/etc65 | wc
    149     149    1294


###########################################################

3) now try to create same Dir from mount point and it will be successful and after that lookup will show 2 Dir with same name

mount:-

[root@7-VM2 mvs1]# mkdir etc65
[root@7-VM2 mvs1]# ls
etc100  mvetc1   mvetc11  mvetc13  mvetc15  mvetc18  mvetc2   mvetc21  mvetc24  mvetc3  mvetc5   mvetc8
etc100  mvetc1   mvetc11  mvetc13  mvetc15  mvetc18  mvetc2   mvetc21  mvetc24  mvetc3  mvetc5   mvetc8
etc65   mvetc10  mvetc12  mvetc14  mvetc17  mvetc19  mvetc20  mvetc22  mvetc25  mvetc4  mvetc63  mvetc9
etc65   mvetc10  mvetc12  mvetc14  mvetc17  mvetc19  mvetc20  mvetc22  mvetc25  mvetc4  mvetc63  mvetc9

Backend:-
[root@7-VM4 ~]# ls  /rhs/brick1/f/mvs1/etc65 | wc
      0       0       0
[root@7-VM4 ~]# ls  /rhs/brick4/f/mvs1/etc65 | wc
ls: cannot access /rhs/brick4/f/mvs1/etc65: No such file or directory
      0       0       0
[root@7-VM4 ~]# ls  /rhs/brick2/f/mvs1/etc65 | wc
    141     141    1203
[root@7-VM1 ~]# ls  /rhs/brick1/f/mvs1/etc65 | wc
    155     155    1426
[root@7-VM3 ~]# ls  /rhs/brick1/f/mvs1/etc65 | wc
    149     149    1294



--> It hasn't healed Dir on all sub-volumes and mkdir didn't give any error even though Directory exist.
--. It also shows Dir twice.

Comment 1 Shyamsundar 2014-09-05 15:32:25 UTC
This is already fixed in release 3.6 branch, before branching as a part of bug #1104653.

Closing this bug as NOTABUG.


Note You need to log in before you can comment on or make changes to this bug.