Bug 1400037 - [Arbiter] Fixed layout failed on the volume after remove-brick while rmdir is progress
Summary: [Arbiter] Fixed layout failed on the volume after remove-brick while rmdir is...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.2
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Pranith Kumar K
QA Contact: Karan Sandha
URL:
Whiteboard:
Depends On: 1401869
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-11-30 10:30 UTC by Karan Sandha
Modified: 2017-08-22 07:29 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.8.4-8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-23 05:52:58 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Comment 3 Karan Sandha 2016-12-06 09:05:11 UTC
I tried a simple creation of small files from the mount and removed the brick. I am still seeing the same errors which saw during rmdir. This bug need to be fixed.


[root@dhcp47-197 ~]# gluster volume remove-brick testvol 10.70.46.142:/bricks/brick3/testvol_brick9 10.70.47.197:/bricks/brick3/testvol_brick10 10.70.47.175:/bricks/brick3/testvol_brick11 start
volume remove-brick start: success
ID: a767153b-2fc4-4228-9c1d-6ec51a6288c0
[root@dhcp47-197 ~]# gluster volume remove-brick testvol 10.70.46.142:/bricks/brick3/testvol_brick9 10.70.47.197:/bricks/brick3/testvol_brick10 10.70.47.175:/bricks/brick3/testvol_brick11 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             1             0               failed        0:0:3
       dhcp46-142.lab.eng.blr.redhat.com                0        0Bytes             0             1             0               failed        0:0:2
                            10.70.47.175                0        0Bytes             0             0             0          in progress        0:13:4
 
<snip> from one of the rebalance logs:- 
[2016-12-06 08:43:41.345522] I [MSGID: 109081] [dht-common.c:4000:dht_setxattr] 0-testvol-dht: fixing the layout of /
[2016-12-06 08:43:41.353765] E [MSGID: 109026] [dht-rebalance.c:3756:gf_defrag_start_crawl] 0-testvol-dht: fix layout on / failed
[2016-12-06 08:43:41.354343] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-testvol-dht: Rebalance is failed. Time taken is 3.00 secs
[2016-12-06 08:43:41.354364] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-testvol-dht: Files migrated: 0, size: 0, lookups: 0, failures: 1, skipped: 0
[2016-12-06 08:43:41.354703] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f2130a5ddc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f21320f1c45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f21320f1abb] ) 0-: received signum (15), shutting down

</snip>

#####################################################################
My application just error out with below message:-

starting all threads by creating starting gate file /mnt/point/network_shared/starting_gate.tmp
Traceback (most recent call last):
  File "/mnt/point/smallfile/smallfile_remote.py", line 48, in <module>
    run_workload()
  File "/mnt/point/smallfile/smallfile_remote.py", line 39, in run_workload
    return multi_thread_workload.run_multi_thread_workload(params)
  File "/mnt/point/smallfile/multi_thread_workload.py", line 189, in run_multi_thread_workload
    sync_files.write_pickle(result_filename, invok_list)
  File "/mnt/point/smallfile/sync_files.py", line 19, in write_pickle
    with open(fpath, 'wb') as result_file:
IOError: [Errno 107] Transport endpoint is not connected: '/mnt/point/network_shared/dhcp47-116.lab.eng.blr.redhat.com_result.pickle'
ERROR: ssh thread for host dhcp47-116.lab.eng.blr.redhat.com completed with status 256
host dhcp47-116.lab.eng.blr.redhat.com filename /mnt/point/network_shared/dhcp47-116.lab.eng.blr.redhat.com_result.pickle: [Errno 107] Transport endpoint is not connected: '/mnt/point/network_shared/dhcp47-116.lab.eng.blr.redhat.com_result.pickle'

Comment 5 Nag Pavan Chilakam 2016-12-06 11:24:41 UTC
There is one more bug [1] filed today which also complains about setxattr failure.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1401869

On Tue, Dec 6, 2016 at 4:24 PM, Tirumala Satya Prasad Desala <tdesala> wrote:

    I have tried the same scenario now with distrep on 3.8.4-6 and I observed the Setxattr failure but the it is failing on other directories.

    Output snippet below:

    [2016-12-06 10:42:51.400525] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /file_srcdir/10.70.42.156/thrd_02
    [2016-12-06 10:42:51.401220] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /file_srcdir/10.70.42.156
    [2016-12-06 10:42:51.401850] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /file_srcdir

    Regards,
    Prasad

    ----- Original Message -----
    From: "Susant Palai" <spalai>
    To: "Atin Mukherjee" <amukherj>
    Cc: "Nag Pavan Chilakam" <nchilaka>, "Karan Sandha" <ksandha>, "Tirumala Satya Prasad Desala" <tdesala>, "Rahul Hinduja" <rhinduja>, "Nithya Balachandran" <nbalacha>, "Raghavendra Gowdappa" <rgowdapp>, "Pranith Kumar Karampuri" <pkarampu>
    Sent: Tuesday, December 6, 2016 4:13:49 PM
    Subject: Re: Need info on https://bugzilla.redhat.com/show_bug.cgi?id=1400037

    In one of the logs I found EAGAIN returned by blocking inodelk. I think that should be fixed by http://review.gluster.org/#/c/15984/.
    This is one issue where AFR unwinds EAGAIN for a blocking lk request.

    I have sent a patch for gathering helpful logs for fix-layout failures here: http://review.gluster.org/#/c/16040/1.

    I guess we would be needing both of them in the new build. In the mean time planning to reproduce the fix-layout failures in house, and
    if I can reproduce will cherry pick the patches and test it out.

    -Susant

    ----- Original Message -----
    > From: "Atin Mukherjee" <amukherj>
    > To: "Nag Pavan Chilakam" <nchilaka>
    > Cc: "Karan Sandha" <ksandha>, "Susant Palai" <spalai>, "Tirumala Satya Prasad Desala"
    > <tdesala>, "Rahul Hinduja" <rhinduja>, "Nithya Balachandran" <nbalacha>,
    > "Raghavendra Gowdappa" <rgowdapp>
    > Sent: Tuesday, 6 December, 2016 3:45:39 PM
    > Subject: Re: Need info on https://bugzilla.redhat.com/show_bug.cgi?id=1400037
    >
    > My question was more from a dist rep side only, I should have been precise
    > about it.
    >
    > On Tue, Dec 6, 2016 at 3:44 PM, Nag Pavan Chilakam <nchilaka>
    > wrote:
    >
    > > Arbiter is a new feature in 3.2 , hence wouldn't have existed in 3.1.3
    > > Prasad, Do you see this on a regular distrep volume?
    > >
    > > ----- Original Message -----
    > > From: "Atin Mukherjee" <amukherj>
    > > To: "Karan Sandha" <ksandha>, "Susant Palai" <spalai
    > > >
    > > Cc: "Tirumala Satya Prasad Desala" <tdesala>, "Rahul Hinduja" <
    > > rhinduja>, "Nithya Balachandran" <nbalacha>,
    > > "Raghavendra Gowdappa" <rgowdapp>, "Nag Pavan Chilakam" <
    > > nchilaka>
    > > Sent: Tuesday, 6 December, 2016 2:47:09 PM
    > > Subject: Re: Need info on https://bugzilla.redhat.com/
    > > show_bug.cgi?id=1400037
    > >
    > > I guess this issue is there in 3.1.3 too, right?
    > >
    > > Susant - can you please clarify?
    > >
    > > On Tue, Dec 6, 2016 at 2:37 PM, Karan Sandha <ksandha> wrote:
    > >
    > > > Hi all,
    > > >
    > > > I tried a simple test of creating small files with application residing
    > > on
    > > > the volume and while creation is in progress i remove the bricks and my
    > > > application errored out and i saw the same errors in the rabalance logs.
    > > >
    > > > Thanks & regards
    > > >
    > > > Karan Sandha
    > > >
    > > >
    > > >
    > > > On 12/06/2016 12:28 PM, Tirumala Satya Prasad Desala wrote:
    > > >
    > > >> I did not see any testcase covering RMDIR +remove-brick scenario but I
    > > >> feel it is a valid use case.
    > > >>
    > > >> Regards,
    > > >> Prasad
    > > >>
    > > >> ----- Original Message -----
    > > >> From: "Rahul Hinduja" <rhinduja>
    > > >> To: "Nithya Balachandran" <nbalacha>, "Raghavendra
    > > Gowdappa" <
    > > >> rgowdapp>
    > > >> Cc: "Atin Mukherjee" <amukherj>, "Karan Sandha" <
    > > >> ksandha>, "Tirumala Satya Prasad Desala" <tdesala
    > > >
    > > >> Sent: Tuesday, December 6, 2016 11:25:54 AM
    > > >> Subject: Need info on https://bugzilla.redhat.com/
    > > show_bug.cgi?id=1400037
    > > >>
    > > >> Hi Nithya, Du,
    > > >>
    > > >> Can you please provide your thoughts on the BZ [1]. I see it is a valid
    > > >> use case where cu can have rmdir and remove brick at the same time, not
    > > >> sure if the dht covers it. @Prasad to add his thoughts on test case.
    > > >>
    > > >> [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1400037
    > > >>
    > > >> Thanks,
    > > >> Rahul
    > > >>
    > > >
    > > >
    > >
    > >
    > > --
    > >
    > > ~ Atin (atinm)
    > >
    >
    >
    >
    > --
    >
    > ~ Atin (atinm)
    >




-- 

~ Atin (atinm)

Comment 8 Susant Kumar Palai 2016-12-08 06:21:56 UTC
(In reply to nchilaka from comment #5)
> There is one more bug [1] filed today which also complains about setxattr
> failure.
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1401869
> 
> On Tue, Dec 6, 2016 at 4:24 PM, Tirumala Satya Prasad Desala
> <tdesala> wrote:
> 
>     I have tried the same scenario now with distrep on 3.8.4-6 and I
> observed the Setxattr failure but the it is failing on other directories.
> 
>     Output snippet below:
> 
>     [2016-12-06 10:42:51.400525] E [MSGID: 109016]
> [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed
> for /file_srcdir/10.70.42.156/thrd_02
>     [2016-12-06 10:42:51.401220] E [MSGID: 109016]
> [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed
> for /file_srcdir/10.70.42.156
>     [2016-12-06 10:42:51.401850] E [MSGID: 109016]
> [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed
> for /file_srcdir
> 
>     Regards,
>     Prasad
> 
>     ----- Original Message -----
>     From: "Susant Palai" <spalai>
>     To: "Atin Mukherjee" <amukherj>
>     Cc: "Nag Pavan Chilakam" <nchilaka>, "Karan Sandha"
> <ksandha>, "Tirumala Satya Prasad Desala" <tdesala>,
> "Rahul Hinduja" <rhinduja>, "Nithya Balachandran"
> <nbalacha>, "Raghavendra Gowdappa" <rgowdapp>,
> "Pranith Kumar Karampuri" <pkarampu>
>     Sent: Tuesday, December 6, 2016 4:13:49 PM
>     Subject: Re: Need info on
> https://bugzilla.redhat.com/show_bug.cgi?id=1400037
> 
>     In one of the logs I found EAGAIN returned by blocking inodelk. I think
> that should be fixed by http://review.gluster.org/#/c/15984/.
>     This is one issue where AFR unwinds EAGAIN for a blocking lk request.
> 
>     I have sent a patch for gathering helpful logs for fix-layout failures
> here: http://review.gluster.org/#/c/16040/1.
> 
>     I guess we would be needing both of them in the new build. In the mean
> time planning to reproduce the fix-layout failures in house, and
>     if I can reproduce will cherry pick the patches and test it out.
> 
>     -Susant
> 
>     ----- Original Message -----
>     > From: "Atin Mukherjee" <amukherj>
>     > To: "Nag Pavan Chilakam" <nchilaka>
>     > Cc: "Karan Sandha" <ksandha>, "Susant Palai"
> <spalai>, "Tirumala Satya Prasad Desala"
>     > <tdesala>, "Rahul Hinduja" <rhinduja>, "Nithya
> Balachandran" <nbalacha>,
>     > "Raghavendra Gowdappa" <rgowdapp>
>     > Sent: Tuesday, 6 December, 2016 3:45:39 PM
>     > Subject: Re: Need info on
> https://bugzilla.redhat.com/show_bug.cgi?id=1400037
>     >
>     > My question was more from a dist rep side only, I should have been
> precise
>     > about it.
>     >
>     > On Tue, Dec 6, 2016 at 3:44 PM, Nag Pavan Chilakam
> <nchilaka>
>     > wrote:
>     >
>     > > Arbiter is a new feature in 3.2 , hence wouldn't have existed in
> 3.1.3
>     > > Prasad, Do you see this on a regular distrep volume?
>     > >
>     > > ----- Original Message -----
>     > > From: "Atin Mukherjee" <amukherj>
>     > > To: "Karan Sandha" <ksandha>, "Susant Palai"
> <spalai
>     > > >
>     > > Cc: "Tirumala Satya Prasad Desala" <tdesala>, "Rahul
> Hinduja" <
>     > > rhinduja>, "Nithya Balachandran" <nbalacha>,
>     > > "Raghavendra Gowdappa" <rgowdapp>, "Nag Pavan Chilakam" <
>     > > nchilaka>
>     > > Sent: Tuesday, 6 December, 2016 2:47:09 PM
>     > > Subject: Re: Need info on https://bugzilla.redhat.com/
>     > > show_bug.cgi?id=1400037
>     > >
>     > > I guess this issue is there in 3.1.3 too, right?
>     > >
>     > > Susant - can you please clarify?
>     > >
>     > > On Tue, Dec 6, 2016 at 2:37 PM, Karan Sandha <ksandha>
> wrote:
>     > >
>     > > > Hi all,
>     > > >
>     > > > I tried a simple test of creating small files with application
> residing
>     > > on
>     > > > the volume and while creation is in progress i remove the bricks
> and my
>     > > > application errored out and i saw the same errors in the rabalance
> logs.
>     > > >
>     > > > Thanks & regards
>     > > >
>     > > > Karan Sandha
>     > > >
>     > > >
>     > > >
>     > > > On 12/06/2016 12:28 PM, Tirumala Satya Prasad Desala wrote:
>     > > >
>     > > >> I did not see any testcase covering RMDIR +remove-brick scenario
> but I
>     > > >> feel it is a valid use case.
>     > > >>
>     > > >> Regards,
>     > > >> Prasad
>     > > >>
>     > > >> ----- Original Message -----
>     > > >> From: "Rahul Hinduja" <rhinduja>
>     > > >> To: "Nithya Balachandran" <nbalacha>, "Raghavendra
>     > > Gowdappa" <
>     > > >> rgowdapp>
>     > > >> Cc: "Atin Mukherjee" <amukherj>, "Karan Sandha" <
>     > > >> ksandha>, "Tirumala Satya Prasad Desala"
> <tdesala
>     > > >
>     > > >> Sent: Tuesday, December 6, 2016 11:25:54 AM
>     > > >> Subject: Need info on https://bugzilla.redhat.com/
>     > > show_bug.cgi?id=1400037
>     > > >>
>     > > >> Hi Nithya, Du,
>     > > >>
>     > > >> Can you please provide your thoughts on the BZ [1]. I see it is a
> valid
>     > > >> use case where cu can have rmdir and remove brick at the same
> time, not
>     > > >> sure if the dht covers it. @Prasad to add his thoughts on test
> case.
>     > > >>
>     > > >> [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1400037
>     > > >>
>     > > >> Thanks,
>     > > >> Rahul
>     > > >>
>     > > >
>     > > >
>     > >
>     > >
>     > > --
>     > >
>     > > ~ Atin (atinm)
>     > >
>     >
>     >
>     >
>     > --
>     >
>     > ~ Atin (atinm)
>     >
> 
> 
> 
> 
> -- 
> 
> ~ Atin (atinm)


The issue looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1401869. Hence marking this as duplicate.

For more info follow this link [1]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1401869#c4


Please reopen this bug if seen even after 1401869 is fixed.

Comment 9 Susant Kumar Palai 2016-12-08 06:23:02 UTC
Oops duplicate flag got removed. 

Marking this duplicate of 1401869.

*** This bug has been marked as a duplicate of bug 1401869 ***

Comment 10 Susant Kumar Palai 2016-12-08 06:38:17 UTC
The symptom is same as 1401869, but here the test case is slightly different. Marking this as depends on 1401869, and leaving this bug open, so that there is no confusion.

Comment 13 Karan Sandha 2016-12-14 06:50:49 UTC
While verifying this issue on 3.8.4-8 . I saw below observation:-

1) continuous multiple errors  for failed look up :-
<SNIP>
2016-12-13 11:22:11.661154] E [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-samsung-dht: /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_07/d_005/_07_500_.d lookup failed with 2
[2016-12-13 11:22:11.663923] E [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-samsung-dht: /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_07/d_005/_07_501_.d lookup failed with 2
</SNIP>

2) Setxattr failed ERROR:-

[2016-12-13 11:28:05.935641] E [dht-rebalance.c:3348:gf_defrag_fix_layout] 0-samsung-dht: Setxattr failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009/d_007/_05_9732_.d

3) Fix layout failing on ERRORs:-

[2016-12-13 11:28:05.936034] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009/d_007
[2016-12-13 11:28:05.936398] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009
[2016-12-13 11:28:05.936739] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05
[2016-12-13 11:28:05.937916] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com
[2016-12-13 11:28:05.938438] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir


4) Rebalance Failure:-

[2016-12-13 11:28:05.939820] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-samsung-dht: Rebalance is failed. Time taken is 378.00 secs
[2016-12-13 11:28:05.939848] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-samsung-dht: Files migrated: 0, size: 0, lookups: 0, failures: 7, skipped: 0



With original observation are resolved. With new findings is similiar to Bug 1368437. Hence marking this bug as verified and updating the new finding on the other bug.

Comment 15 errata-xmlrpc 2017-03-23 05:52:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Comment 16 Nithya Balachandran 2017-08-22 07:29:40 UTC
> 
> With original observation are resolved. With new findings is similar to Bug
> 1368437. Hence marking this bug as verified and updating the new finding on
> the other bug.


Lookup failures are common and expected if the file/dir in question has been removed or renamed. This is not something we can fix.


Note You need to log in before you can comment on or make changes to this bug.