Bug 1662828

Summary: Longevity: glusterfsd(brick process) crashed when we do volume creates and deletes
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: coreAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: Rochelle <rallan>
Severity: urgent Docs Contact:
Priority: high    
Version: rhgs-3.4CC: amukherj, kiyer, nchilaka, rcyriac, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, vdas
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.z Batch Update 3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-37 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1662906 (view as bug list) Environment:
Last Closed: 2019-02-04 07:41:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1662906    
Attachments:
Description Flags
script none

Description Nag Pavan Chilakam 2019-01-02 06:09:05 UTC
Description of problem:
======================
was running a test to check memory footprint increasing when we create/delete volumes and see if oom kill can happen. also was retesting "bz#1661144 - Longevity: Over time brickmux feature not being honored(ie new bricks spawning) and bricks not getting attached to brick process" without heketi

after about 3 days I found that the brick crashed, this was the brick whose memory footprint was increasing


(gdb) bt
#0  0x00007f3e88e5ce30 in pthread_detach () from /lib64/libpthread.so.0
#1  0x00007f3e77ded1df in posix_spawn_health_check_thread (xl=0x7f328698ae60) at posix-helpers.c:1858
#2  0x00007f3e77de80f1 in init (this=<optimized out>) at posix.c:7935
#3  0x00007f3e89ffa1db in __xlator_init (xl=0x7f328698ae60) at xlator.c:472
#4  xlator_init (xl=xl@entry=0x7f328698ae60) at xlator.c:500
#5  0x00007f3e8a033ae9 in glusterfs_graph_init (graph=graph@entry=0x7f328693f410) at graph.c:363
#6  0x00007f3e8a035314 in glusterfs_graph_attach (orig_graph=0x7f3e78004230, path=<optimized out>, 
    newgraph=newgraph@entry=0x7f3964340748) at graph.c:1248
#7  0x000055ad7dc2813d in glusterfs_handle_attach (req=0x7f3964006cb8) at glusterfsd-mgmt.c:978
#8  0x00007f3e8a0361f0 in synctask_wrap () at syncop.c:375
#9  0x00007f3e8866e010 in ?? () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb) bt
#0  0x00007f3e88e5ce30 in pthread_detach () from /lib64/libpthread.so.0
#1  0x00007f3e77ded1df in posix_spawn_health_check_thread (xl=0x7f328698ae60) at posix-helpers.c:1858
#2  0x00007f3e77de80f1 in init (this=<optimized out>) at posix.c:7935
#3  0x00007f3e89ffa1db in __xlator_init (xl=0x7f328698ae60) at xlator.c:472
#4  xlator_init (xl=xl@entry=0x7f328698ae60) at xlator.c:500
#5  0x00007f3e8a033ae9 in glusterfs_graph_init (graph=graph@entry=0x7f328693f410) at graph.c:363
#6  0x00007f3e8a035314 in glusterfs_graph_attach (orig_graph=0x7f3e78004230, path=<optimized out>, 
    newgraph=newgraph@entry=0x7f3964340748) at graph.c:1248
#7  0x000055ad7dc2813d in glusterfs_handle_attach (req=0x7f3964006cb8) at glusterfsd-mgmt.c:978
#8  0x00007f3e8a0361f0 in synctask_wrap () at syncop.c:375
#9  0x00007f3e8866e010 in ?? () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb) t a a bt

Thread 2565 (Thread 0x7f323a833700 (LWP 39875)):
#0  0x00007f3e88e5f965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f3e76ca4883 in changelog_ev_connector (data=0x7f321dcc7838) at changelog-ev-handle.c:205
#2  0x00007f3e88e5bdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f3e88723ead in clone () from /lib64/libc.so.6

Thread 2564 (Thread 0x7f2fef88e700 (LWP 43331)):
#0  0x00007f3e88e5f965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f3e76862303 in br_stub_signth (arg=<optimized out>) at bit-rot-stub.c:867
#2  0x00007f3e88e5bdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f3e88723ead in clone () from /lib64/libc.so.6

Thread 2563 (Thread 0x7f30030a8700 (LWP 43332)):
#0  0x00007f3e88e5f965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f3e76860d2b in br_stub_worker (data=<optimized out>) at bit-rot-stub-helpers.c:375
#2  0x00007f3e88e5bdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f3e88723ead in clone () from /lib64/libc.so.6

Thread 2562 (Thread 0x7f34f4ad5700 (LWP 47099)):
#0  0x00007f3e88e5f965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f3e76ca4883 in changelog_ev_connector (data=0x7f311a363d48) at changelog-ev-handle.c:205
#2  0x00007f3e88e5bdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f3e88723ead in clone () from /lib64/libc.so.6

Thread 2561 (Thread 0x7f34e382d700 (LWP 47106)):
#0  0x00007f3e88e5f965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f3e77ded70b in posix_fsyncer_pick (this=this@entry=0x7f322c2a7e00, head=head@entry=0x7f34e382ce80)
    at posix-helpers.c:1988


Version-Release number of selected component (if applicable):
============================
3.12.2-34


How reproducible:
================
hit once

Steps to Reproduce:
1) 3 node setup , brickmux enabled, default of maxbrickperproc=250
2) created 1 volume which will NOT be deleted throughout the test
3) creating about 100volumes and starting them, then creating next set of 100volumes and deleting old 100volumes
4) so at any time max 201 volumes exist






Actual results:
===========
after about 3 days of test, crash was seen

Comment 7 Nag Pavan Chilakam 2019-01-04 13:17:33 UTC
Created attachment 1518390 [details]
script

Comment 18 errata-xmlrpc 2019-02-04 07:41:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0263