I tried a simple creation of small files from the mount and removed the brick. I am still seeing the same errors which saw during rmdir. This bug need to be fixed. [root@dhcp47-197 ~]# gluster volume remove-brick testvol 10.70.46.142:/bricks/brick3/testvol_brick9 10.70.47.197:/bricks/brick3/testvol_brick10 10.70.47.175:/bricks/brick3/testvol_brick11 start volume remove-brick start: success ID: a767153b-2fc4-4228-9c1d-6ec51a6288c0 [root@dhcp47-197 ~]# gluster volume remove-brick testvol 10.70.46.142:/bricks/brick3/testvol_brick9 10.70.47.197:/bricks/brick3/testvol_brick10 10.70.47.175:/bricks/brick3/testvol_brick11 status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 1 0 failed 0:0:3 dhcp46-142.lab.eng.blr.redhat.com 0 0Bytes 0 1 0 failed 0:0:2 10.70.47.175 0 0Bytes 0 0 0 in progress 0:13:4 <snip> from one of the rebalance logs:- [2016-12-06 08:43:41.345522] I [MSGID: 109081] [dht-common.c:4000:dht_setxattr] 0-testvol-dht: fixing the layout of / [2016-12-06 08:43:41.353765] E [MSGID: 109026] [dht-rebalance.c:3756:gf_defrag_start_crawl] 0-testvol-dht: fix layout on / failed [2016-12-06 08:43:41.354343] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-testvol-dht: Rebalance is failed. Time taken is 3.00 secs [2016-12-06 08:43:41.354364] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-testvol-dht: Files migrated: 0, size: 0, lookups: 0, failures: 1, skipped: 0 [2016-12-06 08:43:41.354703] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f2130a5ddc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f21320f1c45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f21320f1abb] ) 0-: received signum (15), shutting down </snip> ##################################################################### My application just error out with below message:- starting all threads by creating starting gate file /mnt/point/network_shared/starting_gate.tmp Traceback (most recent call last): File "/mnt/point/smallfile/smallfile_remote.py", line 48, in <module> run_workload() File "/mnt/point/smallfile/smallfile_remote.py", line 39, in run_workload return multi_thread_workload.run_multi_thread_workload(params) File "/mnt/point/smallfile/multi_thread_workload.py", line 189, in run_multi_thread_workload sync_files.write_pickle(result_filename, invok_list) File "/mnt/point/smallfile/sync_files.py", line 19, in write_pickle with open(fpath, 'wb') as result_file: IOError: [Errno 107] Transport endpoint is not connected: '/mnt/point/network_shared/dhcp47-116.lab.eng.blr.redhat.com_result.pickle' ERROR: ssh thread for host dhcp47-116.lab.eng.blr.redhat.com completed with status 256 host dhcp47-116.lab.eng.blr.redhat.com filename /mnt/point/network_shared/dhcp47-116.lab.eng.blr.redhat.com_result.pickle: [Errno 107] Transport endpoint is not connected: '/mnt/point/network_shared/dhcp47-116.lab.eng.blr.redhat.com_result.pickle'
There is one more bug [1] filed today which also complains about setxattr failure. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1401869 On Tue, Dec 6, 2016 at 4:24 PM, Tirumala Satya Prasad Desala <tdesala> wrote: I have tried the same scenario now with distrep on 3.8.4-6 and I observed the Setxattr failure but the it is failing on other directories. Output snippet below: [2016-12-06 10:42:51.400525] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /file_srcdir/10.70.42.156/thrd_02 [2016-12-06 10:42:51.401220] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /file_srcdir/10.70.42.156 [2016-12-06 10:42:51.401850] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /file_srcdir Regards, Prasad ----- Original Message ----- From: "Susant Palai" <spalai> To: "Atin Mukherjee" <amukherj> Cc: "Nag Pavan Chilakam" <nchilaka>, "Karan Sandha" <ksandha>, "Tirumala Satya Prasad Desala" <tdesala>, "Rahul Hinduja" <rhinduja>, "Nithya Balachandran" <nbalacha>, "Raghavendra Gowdappa" <rgowdapp>, "Pranith Kumar Karampuri" <pkarampu> Sent: Tuesday, December 6, 2016 4:13:49 PM Subject: Re: Need info on https://bugzilla.redhat.com/show_bug.cgi?id=1400037 In one of the logs I found EAGAIN returned by blocking inodelk. I think that should be fixed by http://review.gluster.org/#/c/15984/. This is one issue where AFR unwinds EAGAIN for a blocking lk request. I have sent a patch for gathering helpful logs for fix-layout failures here: http://review.gluster.org/#/c/16040/1. I guess we would be needing both of them in the new build. In the mean time planning to reproduce the fix-layout failures in house, and if I can reproduce will cherry pick the patches and test it out. -Susant ----- Original Message ----- > From: "Atin Mukherjee" <amukherj> > To: "Nag Pavan Chilakam" <nchilaka> > Cc: "Karan Sandha" <ksandha>, "Susant Palai" <spalai>, "Tirumala Satya Prasad Desala" > <tdesala>, "Rahul Hinduja" <rhinduja>, "Nithya Balachandran" <nbalacha>, > "Raghavendra Gowdappa" <rgowdapp> > Sent: Tuesday, 6 December, 2016 3:45:39 PM > Subject: Re: Need info on https://bugzilla.redhat.com/show_bug.cgi?id=1400037 > > My question was more from a dist rep side only, I should have been precise > about it. > > On Tue, Dec 6, 2016 at 3:44 PM, Nag Pavan Chilakam <nchilaka> > wrote: > > > Arbiter is a new feature in 3.2 , hence wouldn't have existed in 3.1.3 > > Prasad, Do you see this on a regular distrep volume? > > > > ----- Original Message ----- > > From: "Atin Mukherjee" <amukherj> > > To: "Karan Sandha" <ksandha>, "Susant Palai" <spalai > > > > > Cc: "Tirumala Satya Prasad Desala" <tdesala>, "Rahul Hinduja" < > > rhinduja>, "Nithya Balachandran" <nbalacha>, > > "Raghavendra Gowdappa" <rgowdapp>, "Nag Pavan Chilakam" < > > nchilaka> > > Sent: Tuesday, 6 December, 2016 2:47:09 PM > > Subject: Re: Need info on https://bugzilla.redhat.com/ > > show_bug.cgi?id=1400037 > > > > I guess this issue is there in 3.1.3 too, right? > > > > Susant - can you please clarify? > > > > On Tue, Dec 6, 2016 at 2:37 PM, Karan Sandha <ksandha> wrote: > > > > > Hi all, > > > > > > I tried a simple test of creating small files with application residing > > on > > > the volume and while creation is in progress i remove the bricks and my > > > application errored out and i saw the same errors in the rabalance logs. > > > > > > Thanks & regards > > > > > > Karan Sandha > > > > > > > > > > > > On 12/06/2016 12:28 PM, Tirumala Satya Prasad Desala wrote: > > > > > >> I did not see any testcase covering RMDIR +remove-brick scenario but I > > >> feel it is a valid use case. > > >> > > >> Regards, > > >> Prasad > > >> > > >> ----- Original Message ----- > > >> From: "Rahul Hinduja" <rhinduja> > > >> To: "Nithya Balachandran" <nbalacha>, "Raghavendra > > Gowdappa" < > > >> rgowdapp> > > >> Cc: "Atin Mukherjee" <amukherj>, "Karan Sandha" < > > >> ksandha>, "Tirumala Satya Prasad Desala" <tdesala > > > > > >> Sent: Tuesday, December 6, 2016 11:25:54 AM > > >> Subject: Need info on https://bugzilla.redhat.com/ > > show_bug.cgi?id=1400037 > > >> > > >> Hi Nithya, Du, > > >> > > >> Can you please provide your thoughts on the BZ [1]. I see it is a valid > > >> use case where cu can have rmdir and remove brick at the same time, not > > >> sure if the dht covers it. @Prasad to add his thoughts on test case. > > >> > > >> [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1400037 > > >> > > >> Thanks, > > >> Rahul > > >> > > > > > > > > > > > > -- > > > > ~ Atin (atinm) > > > > > > -- > > ~ Atin (atinm) > -- ~ Atin (atinm)
(In reply to nchilaka from comment #5) > There is one more bug [1] filed today which also complains about setxattr > failure. > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1401869 > > On Tue, Dec 6, 2016 at 4:24 PM, Tirumala Satya Prasad Desala > <tdesala> wrote: > > I have tried the same scenario now with distrep on 3.8.4-6 and I > observed the Setxattr failure but the it is failing on other directories. > > Output snippet below: > > [2016-12-06 10:42:51.400525] E [MSGID: 109016] > [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed > for /file_srcdir/10.70.42.156/thrd_02 > [2016-12-06 10:42:51.401220] E [MSGID: 109016] > [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed > for /file_srcdir/10.70.42.156 > [2016-12-06 10:42:51.401850] E [MSGID: 109016] > [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed > for /file_srcdir > > Regards, > Prasad > > ----- Original Message ----- > From: "Susant Palai" <spalai> > To: "Atin Mukherjee" <amukherj> > Cc: "Nag Pavan Chilakam" <nchilaka>, "Karan Sandha" > <ksandha>, "Tirumala Satya Prasad Desala" <tdesala>, > "Rahul Hinduja" <rhinduja>, "Nithya Balachandran" > <nbalacha>, "Raghavendra Gowdappa" <rgowdapp>, > "Pranith Kumar Karampuri" <pkarampu> > Sent: Tuesday, December 6, 2016 4:13:49 PM > Subject: Re: Need info on > https://bugzilla.redhat.com/show_bug.cgi?id=1400037 > > In one of the logs I found EAGAIN returned by blocking inodelk. I think > that should be fixed by http://review.gluster.org/#/c/15984/. > This is one issue where AFR unwinds EAGAIN for a blocking lk request. > > I have sent a patch for gathering helpful logs for fix-layout failures > here: http://review.gluster.org/#/c/16040/1. > > I guess we would be needing both of them in the new build. In the mean > time planning to reproduce the fix-layout failures in house, and > if I can reproduce will cherry pick the patches and test it out. > > -Susant > > ----- Original Message ----- > > From: "Atin Mukherjee" <amukherj> > > To: "Nag Pavan Chilakam" <nchilaka> > > Cc: "Karan Sandha" <ksandha>, "Susant Palai" > <spalai>, "Tirumala Satya Prasad Desala" > > <tdesala>, "Rahul Hinduja" <rhinduja>, "Nithya > Balachandran" <nbalacha>, > > "Raghavendra Gowdappa" <rgowdapp> > > Sent: Tuesday, 6 December, 2016 3:45:39 PM > > Subject: Re: Need info on > https://bugzilla.redhat.com/show_bug.cgi?id=1400037 > > > > My question was more from a dist rep side only, I should have been > precise > > about it. > > > > On Tue, Dec 6, 2016 at 3:44 PM, Nag Pavan Chilakam > <nchilaka> > > wrote: > > > > > Arbiter is a new feature in 3.2 , hence wouldn't have existed in > 3.1.3 > > > Prasad, Do you see this on a regular distrep volume? > > > > > > ----- Original Message ----- > > > From: "Atin Mukherjee" <amukherj> > > > To: "Karan Sandha" <ksandha>, "Susant Palai" > <spalai > > > > > > > Cc: "Tirumala Satya Prasad Desala" <tdesala>, "Rahul > Hinduja" < > > > rhinduja>, "Nithya Balachandran" <nbalacha>, > > > "Raghavendra Gowdappa" <rgowdapp>, "Nag Pavan Chilakam" < > > > nchilaka> > > > Sent: Tuesday, 6 December, 2016 2:47:09 PM > > > Subject: Re: Need info on https://bugzilla.redhat.com/ > > > show_bug.cgi?id=1400037 > > > > > > I guess this issue is there in 3.1.3 too, right? > > > > > > Susant - can you please clarify? > > > > > > On Tue, Dec 6, 2016 at 2:37 PM, Karan Sandha <ksandha> > wrote: > > > > > > > Hi all, > > > > > > > > I tried a simple test of creating small files with application > residing > > > on > > > > the volume and while creation is in progress i remove the bricks > and my > > > > application errored out and i saw the same errors in the rabalance > logs. > > > > > > > > Thanks & regards > > > > > > > > Karan Sandha > > > > > > > > > > > > > > > > On 12/06/2016 12:28 PM, Tirumala Satya Prasad Desala wrote: > > > > > > > >> I did not see any testcase covering RMDIR +remove-brick scenario > but I > > > >> feel it is a valid use case. > > > >> > > > >> Regards, > > > >> Prasad > > > >> > > > >> ----- Original Message ----- > > > >> From: "Rahul Hinduja" <rhinduja> > > > >> To: "Nithya Balachandran" <nbalacha>, "Raghavendra > > > Gowdappa" < > > > >> rgowdapp> > > > >> Cc: "Atin Mukherjee" <amukherj>, "Karan Sandha" < > > > >> ksandha>, "Tirumala Satya Prasad Desala" > <tdesala > > > > > > > >> Sent: Tuesday, December 6, 2016 11:25:54 AM > > > >> Subject: Need info on https://bugzilla.redhat.com/ > > > show_bug.cgi?id=1400037 > > > >> > > > >> Hi Nithya, Du, > > > >> > > > >> Can you please provide your thoughts on the BZ [1]. I see it is a > valid > > > >> use case where cu can have rmdir and remove brick at the same > time, not > > > >> sure if the dht covers it. @Prasad to add his thoughts on test > case. > > > >> > > > >> [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1400037 > > > >> > > > >> Thanks, > > > >> Rahul > > > >> > > > > > > > > > > > > > > > > > -- > > > > > > ~ Atin (atinm) > > > > > > > > > > > -- > > > > ~ Atin (atinm) > > > > > > > -- > > ~ Atin (atinm) The issue looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1401869. Hence marking this as duplicate. For more info follow this link [1] [1] https://bugzilla.redhat.com/show_bug.cgi?id=1401869#c4 Please reopen this bug if seen even after 1401869 is fixed.
Oops duplicate flag got removed. Marking this duplicate of 1401869. *** This bug has been marked as a duplicate of bug 1401869 ***
The symptom is same as 1401869, but here the test case is slightly different. Marking this as depends on 1401869, and leaving this bug open, so that there is no confusion.
While verifying this issue on 3.8.4-8 . I saw below observation:- 1) continuous multiple errors for failed look up :- <SNIP> 2016-12-13 11:22:11.661154] E [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-samsung-dht: /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_07/d_005/_07_500_.d lookup failed with 2 [2016-12-13 11:22:11.663923] E [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-samsung-dht: /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_07/d_005/_07_501_.d lookup failed with 2 </SNIP> 2) Setxattr failed ERROR:- [2016-12-13 11:28:05.935641] E [dht-rebalance.c:3348:gf_defrag_fix_layout] 0-samsung-dht: Setxattr failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009/d_007/_05_9732_.d 3) Fix layout failing on ERRORs:- [2016-12-13 11:28:05.936034] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009/d_007 [2016-12-13 11:28:05.936398] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009 [2016-12-13 11:28:05.936739] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05 [2016-12-13 11:28:05.937916] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com [2016-12-13 11:28:05.938438] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir 4) Rebalance Failure:- [2016-12-13 11:28:05.939820] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-samsung-dht: Rebalance is failed. Time taken is 378.00 secs [2016-12-13 11:28:05.939848] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-samsung-dht: Files migrated: 0, size: 0, lookups: 0, failures: 7, skipped: 0 With original observation are resolved. With new findings is similiar to Bug 1368437. Hence marking this bug as verified and updating the new finding on the other bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html
> > With original observation are resolved. With new findings is similar to Bug > 1368437. Hence marking this bug as verified and updating the new finding on > the other bug. Lookup failures are common and expected if the file/dir in question has been removed or renamed. This is not something we can fix.