Bug 1769315 - Rebalance is causing glusterfs crash on client node
Summary: Rebalance is causing glusterfs crash on client node
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: 7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact:
URL:
Whiteboard:
Depends On: 1759141 1756325 1757399 1760779 1786983 1804522
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-06 11:26 UTC by Nithya Balachandran
Modified: 2020-02-19 03:48 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1757399
Environment:
Last Closed: 2019-11-27 06:02:05 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Gluster.org Gerrit 23686 None Merged cluster/dht: Correct fd processing loop 2019-11-27 06:02:03 UTC

Comment 1 Nithya Balachandran 2019-11-07 03:06:49 UTC
Description of problem:
When on distributed-dispersed volume, add-brick and rebalance is performed with cmvlt python script (see attachment for the script) running as IO, there is glusterfs crash on the client node.

Version-Release number of selected component (if applicable):
6.0.14

How reproducible:
1/1

Steps to Reproduce:
1. Create a 3X(4+2) volume.
2. Mount the volume on a client node using FUSE.
3. On the client node, create multiple directories (in this case 5) and start running the python script simultaneously from all the 5 dirs.
4. Add bricks to the volume to convert it into 4X(4+2) and trigger rebalance.
5. Wait for the rebalance to complete.

Actual results:
* Rebalance completes successfully, without any failures.
* But on client node, there IO errors and also glusterfs crashes generating a core file.

Expected results:
There should be no IO errors and glusterfs crash on client node.


An unsafe loop while processing fds in the dht rebalance check tasks caused the client process to crash as it was operating on an fd that had already been freed.


It looks like the fd has already been freed.


fd->inode is set to NULL in fd_destroy. fds are allocated from the mempools using mem_get. Checking the pool header info:



(gdb) f 1
#1  0x00007f3923004af7 in fd_unref (fd=0x7f3910ccec28) at fd.c:515
515	    LOCK(&fd->inode->lock);
(gdb) p *fd
$1 = {pid = 13340, flags = 33345, refcount = {lk = 0x7f3910ccec38 "\t", value = 9}, inode_list = {next = 0x7f3910ccec40, prev = 0x7f3910ccec40}, inode = 0x0, lock = {
    spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = -1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 16 times>, "\377\377\377\377", '\000' <repeats 19 times>, __align = 0}}, _ctx = 0x7f3910c73b70, xl_count = 39, lk_ctx = 0x7f39100ede90, 
  anonymous = false}
(gdb) p ((pooled_obj_hdr_t *)fd)-1
$2 = (pooled_obj_hdr_t *) 0x7f3910ccec00
(gdb) p sizeof(pooled_obj_hdr_t)
$3 = 40
(gdb) p/x sizeof(pooled_obj_hdr_t)
$4 = 0x28
(gdb) p *$2
$5 = {magic = 3735929054, next = 0x7f3910c3cec0, pool_list = 0x7f3910000960, power_of_two = 8, pool = 0x7f39100605c0}
(gdb) p/x *$2
$6 = {magic = 0xdeadc0de, next = 0x7f3910c3cec0, pool_list = 0x7f3910000960, power_of_two = 0x8, pool = 0x7f39100605c0}
(gdb) 


$6->magic = 0xdeadc0de 

#define GF_MEM_INVALID_MAGIC 0xDEADC0DE

In mem_put:

    hdr->magic = GF_MEM_INVALID_MAGIC;


As fd_destroy calls mem_put, this indicates that the memory has already been freed.

To double check, check the memory header for fd->_ctx which is allocated using GF_CALLOC:

(gdb) p fd->_ctx
$13 = (struct _fd_ctx *) 0x7f3910c73b70
(gdb) p *(((struct mem_header *)0x7f3910c73b70) -1)
$14 = {type = 269061216, size = 139883059273280, mem_acct = 0x0, magic = 0, padding = {0, 0, 0, 0, 0, 0, 0, 0}}
(gdb) p/x *(((struct mem_header *)0x7f3910c73b70) -1)
$15 = {type = 0x10098c60, size = 0x7f39100ede40, mem_acct = 0x0, magic = 0x0, padding = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}
(gdb) 



The header struct members are invalid.

--- Additional comment from Worker Ant on 2019-10-01 12:18:39 UTC ---

REVIEW: https://review.gluster.org/23506 (cluster/dht: Correct fd processing loop) posted (#1) for review on master by N Balachandran

--- Additional comment from Worker Ant on 2019-10-02 13:43:37 UTC ---

REVIEW: https://review.gluster.org/23506 (cluster/dht: Correct fd processing loop) merged (#6) on master by Xavi Hernandez

Comment 2 Worker Ant 2019-11-07 03:29:11 UTC
REVIEW: https://review.gluster.org/23686 (cluster/dht: Correct fd processing loop) posted (#1) for review on release-7 by N Balachandran

Comment 3 Worker Ant 2019-11-27 06:02:05 UTC
REVIEW: https://review.gluster.org/23686 (cluster/dht: Correct fd processing loop) merged (#2) on release-7 by N Balachandran


Note You need to log in before you can comment on or make changes to this bug.