Created attachment 945537 [details] Worklog and rebalance logs Description of problem: After starting a rebalance it gets to a certain point (usually after scanning 202,908 files) and I see stale file handle warnings in the logs. Then it becomes unable to find any files (despite their existence) and the rebalance fails. Please see attached worklog.txt (in tarball) for full steps. Version-Release number of selected component (if applicable): GlusterFS 3.5.2 How reproducible: Always Steps to Reproduce: See worklog.txt Additional info: Rebalance logs in tarball
[2014-10-09 10:55:52.723798] E [dht-rebalance.c:1402:gf_defrag_fix_layout] 0-videostore-las-dht: Lookup failed on /las/data2/videos/cf/f6/cff65c574d72 4d373af7f6d90b265b78/1-7d3ec80abe0a304cb6c30d65590f0713 [2014-10-09 10:55:52.723860] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2/videos/cf/f6/cff65c5 74d724d373af7f6d90b265b78/1-7d3ec80abe0a304cb6c30d65590f0713 [2014-10-09 10:55:52.724217] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2/videos/cf/f6/cff65c5 74d724d373af7f6d90b265b78 [2014-10-09 10:55:52.724629] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2/videos/cf/f6 [2014-10-09 10:55:52.725084] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2/videos/cf [2014-10-09 10:55:52.725573] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2/videos [2014-10-09 10:55:52.725979] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las/data2 [2014-10-09 10:55:52.726385] E [dht-rebalance.c:1515:gf_defrag_fix_layout] 0-videostore-las-dht: Fix layout failed for /las [2014-10-09 10:55:52.726817] I [dht-rebalance.c:1800:gf_defrag_status_get] 0-glusterfs: Rebalance is completed. Time taken is 4357.00 secs [2014-10-09 10:55:52.726859] I [dht-rebalance.c:1803:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 7, skipped: 0 [2014-10-09 10:55:52.727287] W [glusterfsd.c:1095:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x2abb0e5c5fcd] (-->/lib64/libpthread.so.0 [0x2a bb0df8983d] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0x138) [0x4052c8]))) 0-: received signum (15), shutting down @Alex, Is the following a valid path on your setup? Do you've files/directories named as cff65c574d724d373af7f6d90b265b78? The path in question is: /las/data2/videos/cf/f6/cff65c574d724d373af7f6d90b265b78/1-7d3ec80abe0a304cb6c30d65590f0713 When rebalance try to lookup the above path, it seems to be failing. There are not enough logs to help us identify why that lookup failed (may be file got removed? etc) On the other hand stopping entire rebalance when lookup on a file failed seems like an extreme measure. It might be possible that file got removed in the window after readdir but before an attempt to migration. I need information on following things: 1. whether the above path is a valid one? 2. Is there a possibility that the file was removed when rebalance was running? To broaden the scope of the question, what operations were being performed from mount-points/clients when rebalance was going on? regards, Raghavendra.
One possible reason for ESTALE errors is that the files/directories are deleted after readdir is done. Brick processes return ESTALE in two scenarios: 1. during named-lookup when parent cannot be found. 2. during nameless lookup when gfid cannot be resolved (the error might be ENOENT/ESTALE - need to confirm by reading code). 3. during any other fop when gfid passed with fop cannot be resolved. If parallel I/O operations are done when rebalance is running, occurrence of these logs is not an error-condition. However, we can fine-tune the behaviour of rebalance when it sees these errors (like skipping that directory/file instead of aborting rebalance altogether etc). regards, Raghavendra.
Hi Raghavendra, thanks for the response. The file always exists and I can always see it during/after the rebalance. $ ll /data1/gluster/las/data2/videos/cf/f6/cff65c574d724d373af7f6d90b265b78/1-7d3ec80abe0a304cb6c30d65590f0713 total 4372 -rw-r--r-- 2 2131121 2131121 396680 Apr 23 21:26 video_ep_240k-13.ts -rw-r--r-- 2 2131121 2131121 383144 Apr 23 21:26 video_ep_240k-27.ts -rw-r--r-- 2 2131121 2131121 374872 Apr 23 21:26 video_ep_240k-29.ts -rw-r--r-- 2 2131121 2131121 334640 Apr 23 21:26 video_ep_240k-37.ts -rw-r--r-- 2 2131121 2131121 602352 Apr 23 21:26 video_ep_400k-01.ts -rw-r--r-- 2 2131121 2131121 652360 Apr 23 21:26 video_ep_400k-03.ts -rw-r--r-- 2 2131121 2131121 589568 Apr 23 21:26 video_ep_400k-24.ts -rw-r--r-- 2 2131121 2131121 540500 Apr 23 21:26 video_ep_400k-29.ts -rw-r--r-- 2 2131121 2131121 586748 Apr 23 21:26 video_ep_400k-32.ts This happens on all bricks in this volume, once it gets to 202,908 files scanned it says it can't find the 202,909th despite its existence. As you can see from the logs it always happens immediately after the stale file handle warnings so I presumed they were related. It would be good to have the rebalance not exit (as complete anyway) after seeing these errors particularly because as far as I can see they should not be errors! Alex.
Bug is still present in 3.6.1 and on top the rebalance stopping I am now unable to run gluster commands whilst it is attempting to rebalance. # gluster v rebalance videostore-las start # gluster v rebalance videostore-las status volume rebalance: videostore-las: failed: Another transaction is in progress. Please try again after sometime. There are some new log lines that fill the rebalance logs (example below): [2014-11-12 17:15:10.515871] I [MSGID: 109036] [dht-common.c:6221:dht_log_new_layout_for_di r_selfheal] 0-videostore-las-dht: Setting layout of /las/data1/videos/08/b1/08b12f4cdf482c1 583aa4d83af2574cd with [Subvol_name: videostore-las-client-0, Err: -1 , Start: 3264110328 , Stop: 3468098551 ], [Subvol_name: videostore-las-client-1, Err: -1 , Start: 3468098552 , S top: 3672086775 ], [Subvol_name: videostore-las-client-10, Err: -1 , Start: 3672086776 , St op: 3876074999 ], [Subvol_name: videostore-las-client-11, Err: 116 , Start: 0 , Stop: 0 ], [Subvol_name: videostore-las-client-12, Err: -1 , Start: 3876075000 , Stop: 4080063223 ], [ Subvol_name: videostore-las-client-13, Err: -1 , Start: 4080063224 , Stop: 4294967295 ], [S ubvol_name: videostore-las-client-14, Err: -1 , Start: 0 , Stop: 204100787 ], [Subvol_name: videostore-las-client-15, Err: -1 , Start: 204100788 , Stop: 408201575 ], [Subvol_name: vi deostore-las-client-16, Err: -1 , Start: 408201576 , Stop: 612302363 ], [Subvol_name: video store-las-client-17, Err: -1 , Start: 612302364 , Stop: 816403151 ], [Subvol_name: videosto re-las-client-18, Err: -1 , Start: 816403152 , Stop: 1020503939 ], [Subvol_name: videostore -las-client-19, Err: -1 , Start: 1020503940 , Stop: 1224604727 ], [Subvol_name: videostore- las-client-2, Err: -1 , Start: 1224604728 , Stop: 1428592951 ], [Subvol_name: videostore-la s-client-20, Err: -1 , Start: 1428592952 , Stop: 1632693739 ], [Subvol_name: videostore-las -client-21, Err: -1 , Start: 1632693740 , Stop: 1836794527 ], [Subvol_name: videostore-las- client-22, Err: -1 , Start: 1836794528 , Stop: 2040180983 ], [Subvol_name: videostore-las-client-3, Err: -1 , Start: 2040180984 , Stop: 2244169207 ], [Subvol_name: videostore-las-client-4, Err: -1 , Start: 2244169208 , Stop: 2448157431 ], [Subvol_name: videostore-las-client-5, Err: 116 , Start: 0 , Stop: 0 ], [Subvol_name: videostore-las-client-6, Err: -1 , Start: 2448157432 , Stop: 2652145655 ], [Subvol_name: videostore-las-client-7, Err: -1 , Start: 2652145656 , Stop: 2856133879 ], [Subvol_name: videostore-las-client-8, Err: -1 , Start: 2856133880 , Stop: 3060122103 ], [Subvol_name: videostore-las-client-9, Err: -1 , Start: 3060122104 , Stop: 3264110327 ], Not sure if "Err: -1" is good or bad. Alex.
Possibly related to the gluster commands not working: [2014-11-14 13:39:35.534600] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument) [2014-11-14 13:39:38.537415] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument) [2014-11-14 13:39:41.540273] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument) [2014-11-14 13:39:44.542984] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument) [2014-11-14 13:39:47.545857] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument) [2014-11-14 13:39:50.551688] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument) [2014-11-14 13:39:53.554345] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument) [2014-11-14 13:39:56.557185] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument) [2014-11-14 13:39:59.560002] W [socket.c:611:__socket_rwv] 0-management: readv on /var/run/c17c278617345958e75a372b836b1e5d.socket failed (Invalid argument) [2014-11-14 13:40:00.224150] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on 10.35.13.85. Please check log file for details. [2014-11-14 13:40:00.224980] E [glusterd-syncop.c:1602:gd_sync_task_begin] 0-management: Locking Peers Failed. [2014-11-14 13:40:00.239014] E [glusterd-syncop.c:105:gd_collate_errors] 0-: Locking failed on 10.35.13.85. Please check log file for details. [2014-11-14 13:40:00.239975] E [glusterd-syncop.c:1602:gd_sync_task_begin] 0-management: Locking Peers Failed.
The locking issue was resolved by patching commit https://github.com/gluster/glusterfs/commit/97ccd45fb66a63c0b2436a0245dfb9490e2941b7 back to 3.6.1. So not related, issue is still present. Title changed because the number of files is not always 202,908. # gluster v rebalance videostore-las status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 1118 1.0GB 247513 6 0 completed 7228.00 10.35.13.85 5245 4.0GB 247519 6 0 completed 7228.00 10.35.13.80 2954 2.8GB 247530 6 0 completed 7228.00 10.35.13.93 12020 9.7GB 247506 6 0 completed 7228.00 10.35.13.71 2763 2.3GB 247516 6 0 completed 7228.00 10.35.13.94 240 321.9MB 247507 6 0 completed 7228.00 10.35.13.88 0 0Bytes 247508 6 0 completed 7228.00 10.35.13.191 4644 3.8GB 247520 6 0 completed 7228.00 10.35.13.192 2706 2.6GB 247520 6 0 completed 7228.00 10.35.13.193 3815 2.9GB 247582 6 0 completed 7228.00 10.35.13.86 779 670.1MB 247514 6 0 completed 7228.00 10.35.13.77 1212 1.0GB 247509 6 0 completed 7228.00 10.35.13.171 13472 10.4GB 247641 6 0 completed 7228.00
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.