1428936 – [GSS]Remove-brick operation is slow in a distribute-replicate volume in RHGS 3.1.3

Bug 1428936 - [GSS]Remove-brick operation is slow in a distribute-replicate volume in RHGS 3.1.3

Summary: [GSS]Remove-brick operation is slow in a distribute-replicate volume in RHGS ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Susant Kumar Palai
QA Contact:	Prasad Desala
Docs Contact:
URL:
Whiteboard:	dht-rebalance
Depends On:
Blocks:	1417145
TreeView+	depends on / blocked

Reported:	2017-03-03 16:17 UTC by Cal Calhoun
Modified:	2020-12-14 08:17 UTC (History)
CC List:	15 users (show)
Fixed In Version:	glusterfs-3.8.4-27
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 04:33:25 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Cal Calhoun 2017-03-03 16:17:04 UTC

Description of problem:

Observing the below Assertion failed messages in rebalance logs.

[2017-02-27 02:41:30.131290] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:43:14.836106] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:44:27.161614] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:44:33.495690] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-02-27 02:45:09.172526] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7f303b897750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7f303b896fd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7f303b8de6dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)

On a different node, different volume:

[2017-03-01 06:40:32.458913] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:40:51.784109] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:40:52.208967] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:41:01.669186] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)
[2017-03-01 06:41:15.839481] E [mem-pool.c:314:__gf_free] (-->/lib64/libglusterfs.so.0(dict_destroy+0x40) [0x7fe904ceb750] -->/lib64/libglusterfs.so.0(data_destroy+0x55) [0x7fe904ceafd5] -->/lib64/libglusterfs.so.0(__gf_free+0xfc) [0x7fe904d326dc] ) 0-: Assertion failed: GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size)

Version-Release number of selected component (if applicable):

3.7.9-12.el7rhgs.x86_64

How reproducible:

Happens regularly on customer trusted storage pool

Additional info:

23 Node Storage Pool
58 Volumes
All volumes showing the assertion errors have a distribute component, distribute, distributed-replicate, etc.

Comment 6 Susant Kumar Palai 2017-03-09 10:30:45 UTC

Could not reproduce the issue in my test machine.

Created 8*2 volume and created data (dirs and files). Did multiple remove-bricks, but found no errors.

Will give it few more try.

Comment 7 Bipin Kunal 2017-03-13 13:48:46 UTC

@ Susant, Do you need any additional information from Cal or customer, which you think might be useful for reproducing ?

@ Cal, Can you even try to to reproduce issue with miniature version of customer environment.

Comment 8 Susant Kumar Palai 2017-03-13 14:42:29 UTC

(In reply to Bipin Kunal from comment #7)
> @ Susant, Do you need any additional information from Cal or customer, which
> you think might be useful for reproducing ?
> 
> @ Cal, Can you even try to to reproduce issue with miniature version of
> customer environment.

The problem in hand points to a memory overrun. I went through the rebalance code and could not find any evidence of such problem and was it caused by some other translator e.g AFR can not be confirmed from the logs as it does not point to the translator which caused it.  

A reproducer will be highly helpful here.

Still few more information will be helpful here. 
1- Xattr information on the directories and files.
2- What kind of operations were running in parallel?

-Susant

Comment 9 Cal Calhoun 2017-03-17 18:40:21 UTC

@ Bipin: I'll try to set up a simplified reproducer tomorrow.

@ Susant: I'll ask the customer to supply the additional information.

-Cal

Comment 10 Cal Calhoun 2017-03-18 18:35:45 UTC

@ Susant: Can you supply a command that will return the Xattr information you need?  I'm not sure exactly what you're looking for.  I've asked the customer about what else might have been running in parallel.

Comment 52 Nithya Balachandran 2017-05-08 15:20:28 UTC

Does the customer have hardlinks to his files?

Comment 121 Prasad Desala 2017-07-20 12:51:44 UTC

Verified this BZ on glusterfs version 3.8.4-33.el7rhgs.x86_64. Followed the same steps as in Comment 107, the script didn't throw any errors and all the files on the bricks migrated successfully as expected without any issues.

Moving this BZ to Verified.

Comment 124 Oonkwee Lim 2017-08-02 21:01:37 UTC

*** Bug 1467495 has been marked as a duplicate of this bug. ***

Comment 132 errata-xmlrpc 2017-09-21 04:33:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 133 errata-xmlrpc 2017-09-21 04:57:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.