Bug 1428936

Summary:	[GSS]Remove-brick operation is slow in a distribute-replicate volume in RHGS 3.1.3
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Cal Calhoun <ccalhoun>
Component:	distribute	Assignee:	Susant Kumar Palai <spalai>
Status:	CLOSED ERRATA	QA Contact:	Prasad Desala <tdesala>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, asrivast, bkunal, ccalhoun, nbalacha, olim, omasek, pousley, ravishankar, rcyriac, rhinduja, rhs-bugs, rnalakka, spalai, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.3.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:	dht-rebalance
Fixed In Version:	glusterfs-3.8.4-27	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-09-21 04:33:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1417145

Description Cal Calhoun 2017-03-03 16:17:04 UTC

Description of problem:

Observing the below Assertion failed messages in rebalance logs.

[2017-02-27 02:41:30.131290] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:43:14.836106] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:44:27.161614] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:44:33.495690] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:45:09.172526] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)

On a different node, different volume:

[2017-03-01 06:40:32.458913] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:40:51.784109] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:40:52.208967] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:41:01.669186] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:41:15.839481] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)

Version-Release number of selected component (if applicable):

3.7.9-12.el7rhgs.x86_64

How reproducible:

Happens regularly on customer trusted storage pool

Additional info:

23 Node Storage Pool
58 Volumes
All volumes showing the assertion errors have a distribute component, distribute, distributed-replicate, etc.

Comment 6 Susant Kumar Palai 2017-03-09 10:30:45 UTC

Could not reproduce the issue in my test machine.

Created 8*2 volume and created data (dirs and files). Did multiple remove-bricks, but found no errors.

Will give it few more try.

Comment 7 Bipin Kunal 2017-03-13 13:48:46 UTC

@ Susant, Do you need any additional information from Cal or customer, which you think might be useful for reproducing ?

@ Cal, Can you even try to to reproduce issue with miniature version of customer environment.

Comment 8 Susant Kumar Palai 2017-03-13 14:42:29 UTC

(In reply to Bipin Kunal from comment #7)
> @ Susant, Do you need any additional information from Cal or customer, which
> you think might be useful for reproducing ?
> 
> @ Cal, Can you even try to to reproduce issue with miniature version of
> customer environment.

The problem in hand points to a memory overrun. I went through the rebalance code and could not find any evidence of such problem and was it caused by some other translator e.g AFR can not be confirmed from the logs as it does not point to the translator which caused it.  

A reproducer will be highly helpful here.

Still few more information will be helpful here. 
1- Xattr information on the directories and files.
2- What kind of operations were running in parallel?

-Susant

Comment 9 Cal Calhoun 2017-03-17 18:40:21 UTC

@ Bipin: I'll try to set up a simplified reproducer tomorrow.

@ Susant: I'll ask the customer to supply the additional information.

-Cal

Comment 10 Cal Calhoun 2017-03-18 18:35:45 UTC

@ Susant: Can you supply a command that will return the Xattr information you need?  I'm not sure exactly what you're looking for.  I've asked the customer about what else might have been running in parallel.

Comment 52 Nithya Balachandran 2017-05-08 15:20:28 UTC

Does the customer have hardlinks to his files?

Comment 121 Prasad Desala 2017-07-20 12:51:44 UTC

Verified this BZ on glusterfs version 3.8.4-33.el7rhgs.x86_64. Followed the same steps as in Comment 107, the script didn't throw any errors and all the files on the bricks migrated successfully as expected without any issues.

Moving this BZ to Verified.

Comment 124 Oonkwee Lim 2017-08-02 21:01:37 UTC

*** Bug 1467495 has been marked as a duplicate of this bug. ***

Comment 132 errata-xmlrpc 2017-09-21 04:33:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 133 errata-xmlrpc 2017-09-21 04:57:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774