1151384 – Rebalance fails to complete - stale file handles

Bug 1151384 - Rebalance fails to complete - stale file handles

Summary: Rebalance fails to complete - stale file handles

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	distribute
Sub Component:
Version:	3.5.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Sakshi
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-10-10 09:38 UTC by alex.smith
Modified:	2016-08-01 01:22 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-06-17 16:23:41 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Worklog and rebalance logs (5.54 MB, application/x-gzip) 2014-10-10 09:38 UTC, alex.smith	no flags	Details
View All

Description alex.smith 2014-10-10 09:38:07 UTC

Created attachment 945537 [details]
Worklog and rebalance logs

Description of problem:
After starting a rebalance it gets to a certain point (usually after scanning 202,908 files) and I see stale file handle warnings in the logs. Then it becomes unable to find any files (despite their existence) and the rebalance fails.

Please see attached worklog.txt (in tarball) for full steps.


Version-Release number of selected component (if applicable):
GlusterFS 3.5.2

How reproducible:
Always

Steps to Reproduce:
See worklog.txt

Additional info:
Rebalance logs in tarball

Comment 1 Raghavendra G 2014-10-17 05:02:12 UTC

[2014-10-09 10:55:52.723798] E [dht-rebalance.c:1402:gf_defrag_fix_layout] 0-videostore-las-dht: Lookup failed on /las/data2/videos/cf/f6/cff65c574d72
4d373af7f6d90b265b78/1-7d3ec80abe0a304cb6c30d65590f0713
[2014-10-09 10:55:52.723860] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2/videos/cf/f6/cff65c5
74d724d373af7f6d90b265b78/1-7d3ec80abe0a304cb6c30d65590f0713
[2014-10-09 10:55:52.724217] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2/videos/cf/f6/cff65c5
74d724d373af7f6d90b265b78
[2014-10-09 10:55:52.724629] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2/videos/cf/f6
[2014-10-09 10:55:52.725084] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2/videos/cf
[2014-10-09 10:55:52.725573] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2/videos
[2014-10-09 10:55:52.725979] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2
[2014-10-09 10:55:52.726385] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las
[2014-10-09 10:55:52.726817] I [dht-rebalance.c:1800:gf_defrag_status_get] 0-glusterfs: Rebalance is completed. Time taken is 4357.00 secs
[2014-10-09 10:55:52.726859] I [dht-rebalance.c:1803:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 7, skipped: 
0
[2014-10-09 10:55:52.727287] W [glusterfsd.c:1095:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x2abb0e5c5fcd] (-->/lib64/libpthread.so.0 [0x2a
bb0df8983d] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0x138) [0x4052c8]))) 0-: received signum (15), shutting down

@Alex,

Is the following a valid path on your setup? Do you've files/directories named as cff65c574d724d373af7f6d90b265b78? The path in question is:
/las/data2/videos/cf/f6/cff65c574d724d373af7f6d90b265b78/1-7d3ec80abe0a304cb6c30d65590f0713

When rebalance try to lookup the above path, it seems to be failing. There are not enough logs to help us identify why that lookup failed (may be file got removed? etc)

On the other hand stopping entire rebalance when lookup on a file failed seems like an extreme measure. It might be possible that file got removed in the window after readdir but before an attempt to migration.

I need information on following things:
1. whether the above path is a valid one?
2. Is there a possibility that the file was removed when rebalance was running? To broaden the scope of the question, what operations were being performed from mount-points/clients when rebalance was going on?

regards,
Raghavendra.

Comment 2 Raghavendra G 2014-10-17 07:34:19 UTC

One possible reason for ESTALE errors is that the files/directories are deleted after readdir is done. Brick processes return ESTALE in two scenarios:

1. during named-lookup when parent cannot be found.
2. during nameless lookup when gfid cannot be resolved (the error might be ENOENT/ESTALE - need to confirm by reading code).
3. during any other fop when gfid passed with fop cannot be resolved.

If parallel I/O operations are done when rebalance is running, occurrence of these logs is not an error-condition. However, we can fine-tune the behaviour of rebalance when it sees these errors (like skipping that directory/file instead of aborting rebalance altogether etc).

regards,
Raghavendra.

Comment 3 alex.smith 2014-10-17 09:20:06 UTC

Hi Raghavendra, thanks for the response. The file always exists and I can always see it during/after the rebalance.

$ ll /data1/gluster/las/data2/videos/cf/f6/cff65c574d724d373af7f6d90b265b78/1-7d3ec80abe0a304cb6c30d65590f0713
total 4372
-rw-r--r-- 2 2131121 2131121 396680 Apr 23 21:26 video_ep_240k-13.ts
-rw-r--r-- 2 2131121 2131121 383144 Apr 23 21:26 video_ep_240k-27.ts
-rw-r--r-- 2 2131121 2131121 374872 Apr 23 21:26 video_ep_240k-29.ts
-rw-r--r-- 2 2131121 2131121 334640 Apr 23 21:26 video_ep_240k-37.ts
-rw-r--r-- 2 2131121 2131121 602352 Apr 23 21:26 video_ep_400k-01.ts
-rw-r--r-- 2 2131121 2131121 652360 Apr 23 21:26 video_ep_400k-03.ts
-rw-r--r-- 2 2131121 2131121 589568 Apr 23 21:26 video_ep_400k-24.ts
-rw-r--r-- 2 2131121 2131121 540500 Apr 23 21:26 video_ep_400k-29.ts
-rw-r--r-- 2 2131121 2131121 586748 Apr 23 21:26 video_ep_400k-32.ts

This happens on all bricks in this volume, once it gets to 202,908 files scanned it says it can't find the 202,909th despite its existence.

As you can see from the logs it always happens immediately after the stale file handle warnings so I presumed they were related.

It would be good to have the rebalance not exit (as complete anyway) after seeing these errors particularly because as far as I can see they should not be errors!

Alex.

Comment 4 alex.smith 2014-11-12 17:28:52 UTC

Bug is still present in 3.6.1 and on top the rebalance stopping I am now unable to run gluster commands whilst it is attempting to rebalance.

# gluster v rebalance videostore-las start
# gluster v rebalance videostore-las status
volume rebalance: videostore-las: failed: Another transaction is in progress. Please try again after sometime.

There are some new log lines that fill the rebalance logs (example below):
[2014-11-12 17:15:10.515871] I [MSGID: 109036] [dht-common.c:6221:dht_log_new_layout_for_di
r_selfheal] 0-videostore-las-dht: Setting layout of /las/data1/videos/08/b1/08b12f4cdf482c1
583aa4d83af2574cd with [Subvol_name: videostore-las-client-0, Err: -1 , Start: 3264110328 ,
 Stop: 3468098551 ], [Subvol_name: videostore-las-client-1, Err: -1 , Start: 3468098552 , S
top: 3672086775 ], [Subvol_name: videostore-las-client-10, Err: -1 , Start: 3672086776 , St
op: 3876074999 ], [Subvol_name: videostore-las-client-11, Err: 116 , Start: 0 , Stop: 0 ], 
[Subvol_name: videostore-las-client-12, Err: -1 , Start: 3876075000 , Stop: 4080063223 ], [
Subvol_name: videostore-las-client-13, Err: -1 , Start: 4080063224 , Stop: 4294967295 ], [S
ubvol_name: videostore-las-client-14, Err: -1 , Start: 0 , Stop: 204100787 ], [Subvol_name:
 videostore-las-client-15, Err: -1 , Start: 204100788 , Stop: 408201575 ], [Subvol_name: vi
deostore-las-client-16, Err: -1 , Start: 408201576 , Stop: 612302363 ], [Subvol_name: video
store-las-client-17, Err: -1 , Start: 612302364 , Stop: 816403151 ], [Subvol_name: videosto
re-las-client-18, Err: -1 , Start: 816403152 , Stop: 1020503939 ], [Subvol_name: videostore
-las-client-19, Err: -1 , Start: 1020503940 , Stop: 1224604727 ], [Subvol_name: videostore-
las-client-2, Err: -1 , Start: 1224604728 , Stop: 1428592951 ], [Subvol_name: videostore-la
s-client-20, Err: -1 , Start: 1428592952 , Stop: 1632693739 ], [Subvol_name: videostore-las
-client-21, Err: -1 , Start: 1632693740 , Stop: 1836794527 ], [Subvol_name: videostore-las-
client-22, Err: -1 , Start: 1836794528 , Stop: 2040180983 ], [Subvol_name: videostore-las-client-3, Err: -1 , Start: 2040180984 , Stop: 2244169207 ], [Subvol_name: videostore-las-client-4, Err: -1 , Start: 2244169208 , Stop: 2448157431 ], [Subvol_name: videostore-las-client-5, Err: 116 , Start: 0 , Stop: 0 ], [Subvol_name: videostore-las-client-6, Err: -1 , Start: 2448157432 , Stop: 2652145655 ], [Subvol_name: videostore-las-client-7, Err: -1 , Start: 2652145656 , Stop: 2856133879 ], [Subvol_name: videostore-las-client-8, Err: -1 , Start: 2856133880 , Stop: 3060122103 ], [Subvol_name: videostore-las-client-9, Err: -1 , Start: 3060122104 , Stop: 3264110327 ],

Not sure if "Err: -1" is good or bad.

Alex.

Comment 5 alex.smith 2014-11-14 13:41:32 UTC

Possibly related to the gluster commands not working:

[2014-11-14 13:39:35.534600] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument)
[2014-11-14 13:39:38.537415] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument)
[2014-11-14 13:39:41.540273] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument)
[2014-11-14 13:39:44.542984] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument)
[2014-11-14 13:39:47.545857] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument)
[2014-11-14 13:39:50.551688] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument)
[2014-11-14 13:39:53.554345] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument)
[2014-11-14 13:39:56.557185] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument)
[2014-11-14 13:39:59.560002] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument)
[2014-11-14 13:40:00.224150] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on 10.35.13.85. Please check log file for details.
[2014-11-14 13:40:00.224980] E [glusterd-syncop.c:1602:gd_sync_task_begin] 0-management: Locking Peers Failed.
[2014-11-14 13:40:00.239014] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on 10.35.13.85. Please check log file for details.
[2014-11-14 13:40:00.239975] E [glusterd-syncop.c:1602:gd_sync_task_begin] 0-management: Locking Peers Failed.

Comment 6 alex.smith 2014-11-27 10:54:45 UTC

The locking issue was resolved by patching commit https://github.com/gluster/glusterfs/commit/97ccd45fb66a63c0b2436a0245dfb9490e2941b7 back to 3.6.1. So not related, issue is still present.

Title changed because the number of files is not always 202,908.

# gluster v rebalance videostore-las status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost             1118         1.0GB        247513             6             0            completed            7228.00
                             10.35.13.85             5245         4.0GB        247519             6             0            completed            7228.00
                             10.35.13.80             2954         2.8GB        247530             6             0            completed            7228.00
                             10.35.13.93            12020         9.7GB        247506             6             0            completed            7228.00
                             10.35.13.71             2763         2.3GB        247516             6             0            completed            7228.00
                             10.35.13.94              240       321.9MB        247507             6             0            completed            7228.00
                             10.35.13.88                0        0Bytes        247508             6             0            completed            7228.00
                            10.35.13.191             4644         3.8GB        247520             6             0            completed            7228.00
                            10.35.13.192             2706         2.6GB        247520             6             0            completed            7228.00
                            10.35.13.193             3815         2.9GB        247582             6             0            completed            7228.00
                             10.35.13.86              779       670.1MB        247514             6             0            completed            7228.00
                             10.35.13.77             1212         1.0GB        247509             6             0            completed            7228.00
                            10.35.13.171            13472        10.4GB        247641             6             0            completed            7228.00

Comment 7 Niels de Vos 2016-06-17 16:23:41 UTC

This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.

Note You need to log in before you can comment on or make changes to this bug.