Bug 1428936

Summary: [GSS]Remove-brick operation is slow in a distribute-replicate volume in RHGS 3.1.3
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Cal Calhoun <ccalhoun>
Component: distributeAssignee: Susant Kumar Palai <spalai>
Status: CLOSED ERRATA QA Contact: Prasad Desala <tdesala>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, asrivast, bkunal, ccalhoun, nbalacha, olim, omasek, pousley, ravishankar, rcyriac, rhinduja, rhs-bugs, rnalakka, spalai, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.3.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard: dht-rebalance
Fixed In Version: glusterfs-3.8.4-27 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-21 04:33:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1417145    

Description Cal Calhoun 2017-03-03 16:17:04 UTC
Description of problem:

Observing the below Assertion failed messages in rebalance logs.

[2017-02-27 02:41:30.131290] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:43:14.836106] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:44:27.161614] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:44:33.495690] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:45:09.172526] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)

On a different node, different volume:

[2017-03-01 06:40:32.458913] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:40:51.784109] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:40:52.208967] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:41:01.669186] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:41:15.839481] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)

Version-Release number of selected component (if applicable):

3.7.9-12.el7rhgs.x86_64

How reproducible:

Happens regularly on customer trusted storage pool

Additional info:

23 Node Storage Pool
58 Volumes
All volumes showing the assertion errors have a distribute component, distribute, distributed-replicate, etc.

Comment 6 Susant Kumar Palai 2017-03-09 10:30:45 UTC
Could not reproduce the issue in my test machine.

Created 8*2 volume and created data (dirs and files). Did multiple remove-bricks, but found no errors.

Will give it few more try.

Comment 7 Bipin Kunal 2017-03-13 13:48:46 UTC
@ Susant, Do you need any additional information from Cal or customer, which you think might be useful for reproducing ?

@ Cal, Can you even try to to reproduce issue with miniature version of customer environment.

Comment 8 Susant Kumar Palai 2017-03-13 14:42:29 UTC
(In reply to Bipin Kunal from comment #7)
> @ Susant, Do you need any additional information from Cal or customer, which
> you think might be useful for reproducing ?
> 
> @ Cal, Can you even try to to reproduce issue with miniature version of
> customer environment.

The problem in hand points to a memory overrun. I went through the rebalance code and could not find any evidence of such problem and was it caused by some other translator e.g AFR can not be confirmed from the logs as it does not point to the translator which caused it.  

A reproducer will be highly helpful here.

Still few more information will be helpful here. 
1- Xattr information on the directories and files.
2- What kind of operations were running in parallel?

-Susant

Comment 9 Cal Calhoun 2017-03-17 18:40:21 UTC
@ Bipin: I'll try to set up a simplified reproducer tomorrow.

@ Susant: I'll ask the customer to supply the additional information.

-Cal

Comment 10 Cal Calhoun 2017-03-18 18:35:45 UTC
@ Susant: Can you supply a command that will return the Xattr information you need?  I'm not sure exactly what you're looking for.  I've asked the customer about what else might have been running in parallel.

Comment 52 Nithya Balachandran 2017-05-08 15:20:28 UTC
Does the customer have hardlinks to his files?

Comment 121 Prasad Desala 2017-07-20 12:51:44 UTC
Verified this BZ on glusterfs version 3.8.4-33.el7rhgs.x86_64. Followed the same steps as in Comment 107, the script didn't throw any errors and all the files on the bricks migrated successfully as expected without any issues.

Moving this BZ to Verified.

Comment 124 Oonkwee Lim 2017-08-02 21:01:37 UTC
*** Bug 1467495 has been marked as a duplicate of this bug. ***

Comment 132 errata-xmlrpc 2017-09-21 04:33:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 133 errata-xmlrpc 2017-09-21 04:57:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774