Bug 1400037

Summary: [Arbiter] Fixed layout failed on the volume after remove-brick while rmdir is progress
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Karan Sandha <ksandha>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: Karan Sandha <ksandha>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, asrivast, nbalacha, nchilaka, rcyriac, rhinduja, rhs-bugs, spalai, storage-qa-internal
Target Milestone: ---Keywords: Reopened
Target Release: RHGS 3.2.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 05:52:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1401869    
Bug Blocks: 1351528    

Comment 3 Karan Sandha 2016-12-06 09:05:11 UTC
I tried a simple creation of small files from the mount and removed the brick. I am still seeing the same errors which saw during rmdir. This bug need to be fixed.


[root@dhcp47-197 ~]# gluster volume remove-brick testvol 10.70.46.142:/bricks/brick3/testvol_brick9 10.70.47.197:/bricks/brick3/testvol_brick10 10.70.47.175:/bricks/brick3/testvol_brick11 start
volume remove-brick start: success
ID: a767153b-2fc4-4228-9c1d-6ec51a6288c0
[root@dhcp47-197 ~]# gluster volume remove-brick testvol 10.70.46.142:/bricks/brick3/testvol_brick9 10.70.47.197:/bricks/brick3/testvol_brick10 10.70.47.175:/bricks/brick3/testvol_brick11 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             1             0               failed        0:0:3
       dhcp46-142.lab.eng.blr.redhat.com                0        0Bytes             0             1             0               failed        0:0:2
                            10.70.47.175                0        0Bytes             0             0             0          in progress        0:13:4
 
<snip> from one of the rebalance logs:- 
[2016-12-06 08:43:41.345522] I [MSGID: 109081] [dht-common.c:4000:dht_setxattr] 0-testvol-dht: fixing the layout of /
[2016-12-06 08:43:41.353765] E [MSGID: 109026] [dht-rebalance.c:3756:gf_defrag_start_crawl] 0-testvol-dht: fix layout on / failed
[2016-12-06 08:43:41.354343] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-testvol-dht: Rebalance is failed. Time taken is 3.00 secs
[2016-12-06 08:43:41.354364] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-testvol-dht: Files migrated: 0, size: 0, lookups: 0, failures: 1, skipped: 0
[2016-12-06 08:43:41.354703] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f2130a5ddc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f21320f1c45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f21320f1abb] ) 0-: received signum (15), shutting down

</snip>

#####################################################################
My application just error out with below message:-

starting all threads by creating starting gate file /mnt/point/network_shared/starting_gate.tmp
Traceback (most recent call last):
  File "/mnt/point/smallfile/smallfile_remote.py", line 48, in <module>
    run_workload()
  File "/mnt/point/smallfile/smallfile_remote.py", line 39, in run_workload
    return multi_thread_workload.run_multi_thread_workload(params)
  File "/mnt/point/smallfile/multi_thread_workload.py", line 189, in run_multi_thread_workload
    sync_files.write_pickle(result_filename, invok_list)
  File "/mnt/point/smallfile/sync_files.py", line 19, in write_pickle
    with open(fpath, 'wb') as result_file:
IOError: [Errno 107] Transport endpoint is not connected: '/mnt/point/network_shared/dhcp47-116.lab.eng.blr.redhat.com_result.pickle'
ERROR: ssh thread for host dhcp47-116.lab.eng.blr.redhat.com completed with status 256
host dhcp47-116.lab.eng.blr.redhat.com filename /mnt/point/network_shared/dhcp47-116.lab.eng.blr.redhat.com_result.pickle: [Errno 107] Transport endpoint is not connected: '/mnt/point/network_shared/dhcp47-116.lab.eng.blr.redhat.com_result.pickle'

Comment 5 Nag Pavan Chilakam 2016-12-06 11:24:41 UTC
There is one more bug [1] filed today which also complains about setxattr failure.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1401869

On Tue, Dec 6, 2016 at 4:24 PM, Tirumala Satya Prasad Desala <tdesala> wrote:

    I have tried the same scenario now with distrep on 3.8.4-6 and I observed the Setxattr failure but the it is failing on other directories.

    Output snippet below:

    [2016-12-06 10:42:51.400525] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /file_srcdir/10.70.42.156/thrd_02
    [2016-12-06 10:42:51.401220] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /file_srcdir/10.70.42.156
    [2016-12-06 10:42:51.401850] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed for /file_srcdir

    Regards,
    Prasad

    ----- Original Message -----
    From: "Susant Palai" <spalai>
    To: "Atin Mukherjee" <amukherj>
    Cc: "Nag Pavan Chilakam" <nchilaka>, "Karan Sandha" <ksandha>, "Tirumala Satya Prasad Desala" <tdesala>, "Rahul Hinduja" <rhinduja>, "Nithya Balachandran" <nbalacha>, "Raghavendra Gowdappa" <rgowdapp>, "Pranith Kumar Karampuri" <pkarampu>
    Sent: Tuesday, December 6, 2016 4:13:49 PM
    Subject: Re: Need info on https://bugzilla.redhat.com/show_bug.cgi?id=1400037

    In one of the logs I found EAGAIN returned by blocking inodelk. I think that should be fixed by http://review.gluster.org/#/c/15984/.
    This is one issue where AFR unwinds EAGAIN for a blocking lk request.

    I have sent a patch for gathering helpful logs for fix-layout failures here: http://review.gluster.org/#/c/16040/1.

    I guess we would be needing both of them in the new build. In the mean time planning to reproduce the fix-layout failures in house, and
    if I can reproduce will cherry pick the patches and test it out.

    -Susant

    ----- Original Message -----
    > From: "Atin Mukherjee" <amukherj>
    > To: "Nag Pavan Chilakam" <nchilaka>
    > Cc: "Karan Sandha" <ksandha>, "Susant Palai" <spalai>, "Tirumala Satya Prasad Desala"
    > <tdesala>, "Rahul Hinduja" <rhinduja>, "Nithya Balachandran" <nbalacha>,
    > "Raghavendra Gowdappa" <rgowdapp>
    > Sent: Tuesday, 6 December, 2016 3:45:39 PM
    > Subject: Re: Need info on https://bugzilla.redhat.com/show_bug.cgi?id=1400037
    >
    > My question was more from a dist rep side only, I should have been precise
    > about it.
    >
    > On Tue, Dec 6, 2016 at 3:44 PM, Nag Pavan Chilakam <nchilaka>
    > wrote:
    >
    > > Arbiter is a new feature in 3.2 , hence wouldn't have existed in 3.1.3
    > > Prasad, Do you see this on a regular distrep volume?
    > >
    > > ----- Original Message -----
    > > From: "Atin Mukherjee" <amukherj>
    > > To: "Karan Sandha" <ksandha>, "Susant Palai" <spalai
    > > >
    > > Cc: "Tirumala Satya Prasad Desala" <tdesala>, "Rahul Hinduja" <
    > > rhinduja>, "Nithya Balachandran" <nbalacha>,
    > > "Raghavendra Gowdappa" <rgowdapp>, "Nag Pavan Chilakam" <
    > > nchilaka>
    > > Sent: Tuesday, 6 December, 2016 2:47:09 PM
    > > Subject: Re: Need info on https://bugzilla.redhat.com/
    > > show_bug.cgi?id=1400037
    > >
    > > I guess this issue is there in 3.1.3 too, right?
    > >
    > > Susant - can you please clarify?
    > >
    > > On Tue, Dec 6, 2016 at 2:37 PM, Karan Sandha <ksandha> wrote:
    > >
    > > > Hi all,
    > > >
    > > > I tried a simple test of creating small files with application residing
    > > on
    > > > the volume and while creation is in progress i remove the bricks and my
    > > > application errored out and i saw the same errors in the rabalance logs.
    > > >
    > > > Thanks & regards
    > > >
    > > > Karan Sandha
    > > >
    > > >
    > > >
    > > > On 12/06/2016 12:28 PM, Tirumala Satya Prasad Desala wrote:
    > > >
    > > >> I did not see any testcase covering RMDIR +remove-brick scenario but I
    > > >> feel it is a valid use case.
    > > >>
    > > >> Regards,
    > > >> Prasad
    > > >>
    > > >> ----- Original Message -----
    > > >> From: "Rahul Hinduja" <rhinduja>
    > > >> To: "Nithya Balachandran" <nbalacha>, "Raghavendra
    > > Gowdappa" <
    > > >> rgowdapp>
    > > >> Cc: "Atin Mukherjee" <amukherj>, "Karan Sandha" <
    > > >> ksandha>, "Tirumala Satya Prasad Desala" <tdesala
    > > >
    > > >> Sent: Tuesday, December 6, 2016 11:25:54 AM
    > > >> Subject: Need info on https://bugzilla.redhat.com/
    > > show_bug.cgi?id=1400037
    > > >>
    > > >> Hi Nithya, Du,
    > > >>
    > > >> Can you please provide your thoughts on the BZ [1]. I see it is a valid
    > > >> use case where cu can have rmdir and remove brick at the same time, not
    > > >> sure if the dht covers it. @Prasad to add his thoughts on test case.
    > > >>
    > > >> [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1400037
    > > >>
    > > >> Thanks,
    > > >> Rahul
    > > >>
    > > >
    > > >
    > >
    > >
    > > --
    > >
    > > ~ Atin (atinm)
    > >
    >
    >
    >
    > --
    >
    > ~ Atin (atinm)
    >




-- 

~ Atin (atinm)

Comment 8 Susant Kumar Palai 2016-12-08 06:21:56 UTC
(In reply to nchilaka from comment #5)
> There is one more bug [1] filed today which also complains about setxattr
> failure.
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1401869
> 
> On Tue, Dec 6, 2016 at 4:24 PM, Tirumala Satya Prasad Desala
> <tdesala> wrote:
> 
>     I have tried the same scenario now with distrep on 3.8.4-6 and I
> observed the Setxattr failure but the it is failing on other directories.
> 
>     Output snippet below:
> 
>     [2016-12-06 10:42:51.400525] E [MSGID: 109016]
> [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed
> for /file_srcdir/10.70.42.156/thrd_02
>     [2016-12-06 10:42:51.401220] E [MSGID: 109016]
> [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed
> for /file_srcdir/10.70.42.156
>     [2016-12-06 10:42:51.401850] E [MSGID: 109016]
> [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-distrep-dht: Fix layout failed
> for /file_srcdir
> 
>     Regards,
>     Prasad
> 
>     ----- Original Message -----
>     From: "Susant Palai" <spalai>
>     To: "Atin Mukherjee" <amukherj>
>     Cc: "Nag Pavan Chilakam" <nchilaka>, "Karan Sandha"
> <ksandha>, "Tirumala Satya Prasad Desala" <tdesala>,
> "Rahul Hinduja" <rhinduja>, "Nithya Balachandran"
> <nbalacha>, "Raghavendra Gowdappa" <rgowdapp>,
> "Pranith Kumar Karampuri" <pkarampu>
>     Sent: Tuesday, December 6, 2016 4:13:49 PM
>     Subject: Re: Need info on
> https://bugzilla.redhat.com/show_bug.cgi?id=1400037
> 
>     In one of the logs I found EAGAIN returned by blocking inodelk. I think
> that should be fixed by http://review.gluster.org/#/c/15984/.
>     This is one issue where AFR unwinds EAGAIN for a blocking lk request.
> 
>     I have sent a patch for gathering helpful logs for fix-layout failures
> here: http://review.gluster.org/#/c/16040/1.
> 
>     I guess we would be needing both of them in the new build. In the mean
> time planning to reproduce the fix-layout failures in house, and
>     if I can reproduce will cherry pick the patches and test it out.
> 
>     -Susant
> 
>     ----- Original Message -----
>     > From: "Atin Mukherjee" <amukherj>
>     > To: "Nag Pavan Chilakam" <nchilaka>
>     > Cc: "Karan Sandha" <ksandha>, "Susant Palai"
> <spalai>, "Tirumala Satya Prasad Desala"
>     > <tdesala>, "Rahul Hinduja" <rhinduja>, "Nithya
> Balachandran" <nbalacha>,
>     > "Raghavendra Gowdappa" <rgowdapp>
>     > Sent: Tuesday, 6 December, 2016 3:45:39 PM
>     > Subject: Re: Need info on
> https://bugzilla.redhat.com/show_bug.cgi?id=1400037
>     >
>     > My question was more from a dist rep side only, I should have been
> precise
>     > about it.
>     >
>     > On Tue, Dec 6, 2016 at 3:44 PM, Nag Pavan Chilakam
> <nchilaka>
>     > wrote:
>     >
>     > > Arbiter is a new feature in 3.2 , hence wouldn't have existed in
> 3.1.3
>     > > Prasad, Do you see this on a regular distrep volume?
>     > >
>     > > ----- Original Message -----
>     > > From: "Atin Mukherjee" <amukherj>
>     > > To: "Karan Sandha" <ksandha>, "Susant Palai"
> <spalai
>     > > >
>     > > Cc: "Tirumala Satya Prasad Desala" <tdesala>, "Rahul
> Hinduja" <
>     > > rhinduja>, "Nithya Balachandran" <nbalacha>,
>     > > "Raghavendra Gowdappa" <rgowdapp>, "Nag Pavan Chilakam" <
>     > > nchilaka>
>     > > Sent: Tuesday, 6 December, 2016 2:47:09 PM
>     > > Subject: Re: Need info on https://bugzilla.redhat.com/
>     > > show_bug.cgi?id=1400037
>     > >
>     > > I guess this issue is there in 3.1.3 too, right?
>     > >
>     > > Susant - can you please clarify?
>     > >
>     > > On Tue, Dec 6, 2016 at 2:37 PM, Karan Sandha <ksandha>
> wrote:
>     > >
>     > > > Hi all,
>     > > >
>     > > > I tried a simple test of creating small files with application
> residing
>     > > on
>     > > > the volume and while creation is in progress i remove the bricks
> and my
>     > > > application errored out and i saw the same errors in the rabalance
> logs.
>     > > >
>     > > > Thanks & regards
>     > > >
>     > > > Karan Sandha
>     > > >
>     > > >
>     > > >
>     > > > On 12/06/2016 12:28 PM, Tirumala Satya Prasad Desala wrote:
>     > > >
>     > > >> I did not see any testcase covering RMDIR +remove-brick scenario
> but I
>     > > >> feel it is a valid use case.
>     > > >>
>     > > >> Regards,
>     > > >> Prasad
>     > > >>
>     > > >> ----- Original Message -----
>     > > >> From: "Rahul Hinduja" <rhinduja>
>     > > >> To: "Nithya Balachandran" <nbalacha>, "Raghavendra
>     > > Gowdappa" <
>     > > >> rgowdapp>
>     > > >> Cc: "Atin Mukherjee" <amukherj>, "Karan Sandha" <
>     > > >> ksandha>, "Tirumala Satya Prasad Desala"
> <tdesala
>     > > >
>     > > >> Sent: Tuesday, December 6, 2016 11:25:54 AM
>     > > >> Subject: Need info on https://bugzilla.redhat.com/
>     > > show_bug.cgi?id=1400037
>     > > >>
>     > > >> Hi Nithya, Du,
>     > > >>
>     > > >> Can you please provide your thoughts on the BZ [1]. I see it is a
> valid
>     > > >> use case where cu can have rmdir and remove brick at the same
> time, not
>     > > >> sure if the dht covers it. @Prasad to add his thoughts on test
> case.
>     > > >>
>     > > >> [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1400037
>     > > >>
>     > > >> Thanks,
>     > > >> Rahul
>     > > >>
>     > > >
>     > > >
>     > >
>     > >
>     > > --
>     > >
>     > > ~ Atin (atinm)
>     > >
>     >
>     >
>     >
>     > --
>     >
>     > ~ Atin (atinm)
>     >
> 
> 
> 
> 
> -- 
> 
> ~ Atin (atinm)


The issue looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1401869. Hence marking this as duplicate.

For more info follow this link [1]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1401869#c4


Please reopen this bug if seen even after 1401869 is fixed.

Comment 9 Susant Kumar Palai 2016-12-08 06:23:02 UTC
Oops duplicate flag got removed. 

Marking this duplicate of 1401869.

*** This bug has been marked as a duplicate of bug 1401869 ***

Comment 10 Susant Kumar Palai 2016-12-08 06:38:17 UTC
The symptom is same as 1401869, but here the test case is slightly different. Marking this as depends on 1401869, and leaving this bug open, so that there is no confusion.

Comment 13 Karan Sandha 2016-12-14 06:50:49 UTC
While verifying this issue on 3.8.4-8 . I saw below observation:-

1) continuous multiple errors  for failed look up :-
<SNIP>
2016-12-13 11:22:11.661154] E [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-samsung-dht: /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_07/d_005/_07_500_.d lookup failed with 2
[2016-12-13 11:22:11.663923] E [dht-rebalance.c:3334:gf_defrag_fix_layout] 0-samsung-dht: /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_07/d_005/_07_501_.d lookup failed with 2
</SNIP>

2) Setxattr failed ERROR:-

[2016-12-13 11:28:05.935641] E [dht-rebalance.c:3348:gf_defrag_fix_layout] 0-samsung-dht: Setxattr failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009/d_007/_05_9732_.d

3) Fix layout failing on ERRORs:-

[2016-12-13 11:28:05.936034] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009/d_007
[2016-12-13 11:28:05.936398] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05/d_009
[2016-12-13 11:28:05.936739] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com/thrd_05
[2016-12-13 11:28:05.937916] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir/dhcp47-116.lab.eng.blr.redhat.com
[2016-12-13 11:28:05.938438] E [MSGID: 109016] [dht-rebalance.c:3378:gf_defrag_fix_layout] 0-samsung-dht: Fix layout failed for /file_srcdir


4) Rebalance Failure:-

[2016-12-13 11:28:05.939820] I [MSGID: 109028] [dht-rebalance.c:4126:gf_defrag_status_get] 0-samsung-dht: Rebalance is failed. Time taken is 378.00 secs
[2016-12-13 11:28:05.939848] I [MSGID: 109028] [dht-rebalance.c:4130:gf_defrag_status_get] 0-samsung-dht: Files migrated: 0, size: 0, lookups: 0, failures: 7, skipped: 0



With original observation are resolved. With new findings is similiar to Bug 1368437. Hence marking this bug as verified and updating the new finding on the other bug.

Comment 15 errata-xmlrpc 2017-03-23 05:52:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Comment 16 Nithya Balachandran 2017-08-22 07:29:40 UTC
> 
> With original observation are resolved. With new findings is similar to Bug
> 1368437. Hence marking this bug as verified and updating the new finding on
> the other bug.


Lookup failures are common and expected if the file/dir in question has been removed or renamed. This is not something we can fix.