1790336 – brick crashed when creating and deleting volumes over time (with brick mux enabled only)

Bug 1790336 - brick crashed when creating and deleting volumes over time (with brick mux enabled only)

Summary: brick crashed when creating and deleting volumes over time (with brick mux en...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.5.z Batch Update 3
Assignee:	Mohit Agrawal
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1653225 1761423 (view as bug list)
Depends On:
Blocks:	1656682 1811631
TreeView+	depends on / blocked

Reported:	2020-01-13 07:18 UTC by Nag Pavan Chilakam
Modified:	2020-12-17 04:51 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-6.0-38
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1811631 (view as bug list)
Environment:
Last Closed:	2020-12-17 04:51:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:5603	0	None	None	None	2020-12-17 04:51:38 UTC

Description Nag Pavan Chilakam 2020-01-13 07:18:33 UTC

Description of problem:
Was executing non-functional test "RHG3-13273 - Memory Leak Testing: check for memory leaks when all but one volume are created and deleted in loop"

The idea is to check if there is a memory leak with volume creates and deletes, without disturbing the first volume created{so as to avoid stopping the glusterfsd, as atleast one volume is always linked to it}

The test involves, creating 100 volumes of either arbiter/x3 type and then delete them.
The above is executed in many iterations

However, after about 4 days, glusterfsd crashed with below details


tail of glusterfsd log:
=======================

[2020-01-13 06:01:12.384086] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-basevol_x3-server: Shutting down connection CTX_ID:0f899e45-573a-4983-be30-3b6c9a583a02-GRAPH_ID:0-PID:27959-HOST:rhsqa12.lab.eng.blr.redhat.com-PC_NAME:arb_4yk1nis18o9rk-client-2-RECON_NO:-0
[2020-01-13 06:01:12.384210] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-basevol_x3-server: Shutting down connection CTX_ID:0f899e45-573a-4983-be30-3b6c9a583a02-GRAPH_ID:0-PID:27959-HOST:rhsqa12.lab.eng.blr.redhat.com-PC_NAME:arb_aj5xr35kgw9ju-client-2-RECON_NO:-0
[2020-01-13 06:01:12.384398] I [MSGID: 101055] [client_t.c:436:gf_client_unref] 0-basevol_x3-server: Shutting down connection CTX_ID:0f899e45-573a-4983-be30-3b6c9a583a02-GRAPH_ID:0-PID:27959-HOST:rhsqa12.lab.eng.blr.redhat.com-PC_NAME:arb_bdxziwl7awxly-client-1-RECON_NO:-0
[2020-01-13 06:01:14.649922] E [posix-helpers.c:2301:posix_disk_space_check] (-->/lib64/libpthread.so.0(+0x7ea5) [0x7f4a1d347ea5] -->/usr/lib64/glusterfs/6.0/xlator/storage/posix.so(+0xc3c4) [0x7f4a0fbc73c4] -->/usr/lib64/glusterfs/6.0/xlator/storage/posix.so(+0xc269) [0x7f4a0fbc7269] ) 0-rep_rviudtsghmn2j-posix: invalid argument: priv [Invalid argument]
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2020-01-13 06:01:14
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.0
/lib64/libglusterfs.so.0(+0x271e0)[0x7f4a1e5091e0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f4a1e513c04]
/lib64/libc.so.6(+0x363f0)[0x7f4a1cb453f0]
/usr/lib64/glusterfs/6.0/xlator/storage/posix.so(+0xc3c4)[0x7f4a0fbc73c4]
/lib64/libpthread.so.0(+0x7ea5)[0x7f4a1d347ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f4a1cc0d8cd]
---------


gdb /core.5248 


Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/b5/a5458535a1397fa6baaf5e8c13a6395426a1b2

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfsd -s rhsqa10.lab.eng.blr.redhat.com --volfile-id basevol_x3.'.
Program terminated with signal 11, Segmentation fault.
#0  posix_disk_space_check_thread_proc (data=0x7f0440edbb20) at posix-helpers.c:2358
2358	        if (!priv->disk_space_check_active)
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 libacl-2.2.51-14.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7_7.1.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 sqlite-3.7.17-8.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) 
(gdb) bt
#0  posix_disk_space_check_thread_proc (data=0x7f0440edbb20) at posix-helpers.c:2358
#1  0x00007f4a1d347ea5 in start_thread () from /lib64/libpthread.so.0
#2  0x00007f4a1cc0d8cd in clone () from /lib64/libc.so.6
(gdb) 
#0  posix_disk_space_check_thread_proc (data=0x7f0440edbb20) at posix-helpers.c:2358
#1  0x00007f4a1d347ea5 in start_thread () from /lib64/libpthread.so.0
#2  0x00007f4a1cc0d8cd in clone () from /lib64/libc.so.6
(gdb) 
#0  posix_disk_space_check_thread_proc (data=0x7f0440edbb20) at posix-helpers.c:2358
#1  0x00007f4a1d347ea5 in start_thread () from /lib64/libpthread.so.0
#2  0x00007f4a1cc0d8cd in clone () from /lib64/libc.so.6
(gdb) t a  a bt

Thread 1071 (Thread 0x7f051ffff700 (LWP 44662)):
#0  0x00007f4a1d34ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4a0fbc760b in posix_fsyncer_pick (this=this@entry=0x7f060fd2bbe0, head=head@entry=0x7f051fffe140)
    at posix-helpers.c:2417
#2  0x00007f4a0fbc7895 in posix_fsyncer (d=0x7f060fd2bbe0) at posix-helpers.c:2507
#3  0x00007f4a1d347ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f4a1cc0d8cd in clone () from /lib64/libc.so.6

Thread 1070 (Thread 0x7f4a072fb700 (LWP 5259)):
#0  0x00007f4a1d34ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4a0ea73df3 in changelog_ev_connector (data=0x7f4a08073678) at changelog-ev-handle.c:221
#2  0x00007f4a1d347ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f4a1cc0d8cd in clone () from /lib64/libc.so.6

Thread 1069 (Thread 0x7f01c4e91700 (LWP 50570)):
#0  0x00007f4a1d34ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4a0ea73df3 in changelog_ev_connector (data=0x7f081260ec68) at changelog-ev-handle.c:221
#2  0x00007f4a1d347ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f4a1cc0d8cd in clone () from /lib64/libc.so.6

Thread 1068 (Thread 0x7f01c7e97700 (LWP 50349)):
#0  0x00007f4a1d34ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4a0fbc760b in posix_fsyncer_pick (this=this@entry=0x7f03267fd250, head=head@entry=0x7f01c7e96140)
    at posix-helpers.c:2417
---Type <return> to continue, or q <return> to quit--- 
#2  0x00007f4a0fbc7895 in posix_fsyncer (d=0x7f03267fd250) at posix-helpers.c:2507
#3  0x00007f4a1d347ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f4a1cc0d8cd in clone () from /lib64/libc.so.6

Thread 1067 (Thread 0x7f01c0688700 (LWP 50576)):
#0  0x00007f4a1d34ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4a0fbc760b in posix_fsyncer_pick (this=this@entry=0x7f0490344540, head=head@entry=0x7f01c0687140)
    at posix-helpers.c:2417
#2  0x00007f4a0fbc7895 in posix_fsyncer (d=0x7f0490344540) at posix-helpers.c:2507
#3  0x00007f4a1d347ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f4a1cc0d8cd in clone () from /lib64/libc.so.6

Thread 1066 (Thread 0x7f0199ffb700 (LWP 51193)):
#0  0x00007f4a1d34ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4a0fbc760b in posix_fsyncer_pick (this=this@entry=0x7f099b003070, head=head@entry=0x7f0199ffa140)
    at posix-helpers.c:2417
#2  0x00007f4a0fbc7895 in posix_fsyncer (d=0x7f099b003070) at posix-helpers.c:2507
#3  0x00007f4a1d347ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f4a1cc0d8cd in clone () from /lib64/libc.so.6

Thread 1065 (Thread 0x7f02edf93700 (LWP 46493)):
#0  0x00007f4a1d34ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4a0fbc760b in posix_fsyncer_pick (this=this@entry=0x7f0796332880, head=head@entry=0x7f02edf92140)
    at posix-helpers.c:2417
#2  0x00007f4a0fbc7895 in posix_fsyncer (d=0x7f0796332880) at posix-helpers.c:2507






Version-Release number of selected component (if applicable):
=================
6.0.25


How reproducible:
=============
given it is a non-functional test run over days, hit at first run as part of executing this test


Steps to Reproduce:
===============
1.created 3 node setup, brickmux enabled
2.created 2 volumes, one called base_x3 of type x3, and another of type arbiter named "basevol_bitter"
the above 2 volumes will not be deleted at all throughout the test

3.now started creating volumes, starting them in batches of 100, and then deleting them

4. the above step#3 was to go on for a few days

Actual results:
===============
brick crashed on n1, with a coredump file size of *256GB*



Expected results:


Additional info:

Comment 1 Nag Pavan Chilakam 2020-01-13 07:19:46 UTC

marking as regression as this test was passed in 3.5.0
Also, proposing as blocker.
Will attach the logs, however note that the core file is of >256GB size

Comment 2 Mohit Agrawal 2020-01-13 07:36:17 UTC

Can you please share the logs?

Thanks,
Mohit Agrawal

Comment 4 Mohit Agrawal 2020-01-13 08:27:34 UTC

Hi,

As per coredump glusterfsd was crashed due to not able to access priv 
object by posix_disk_space_thread


p data
$1 = (void *) 0x7f0440edbb20
(gdb) p (xlator_t*) data
$2 = (xlator_t *) 0x7f0440edbb20
(gdb) p *(xlator_t*) data
$3 = {name = 0x7f0cf75b1b00 "rep_rviudtsghmn2j-posix", type = 0x7f09123c7a70 "storage/posix", instance_name = 0x0, next = 0x0, 
  prev = 0x7f091266b410, parents = 0x7f0913b82f60, children = 0x0, options = 0x7f09123c7948, dlhandle = 0x7f4a08008f10, 
  fops = 0x7f4a0fe050e0 <fops>, cbks = 0x7f4a0fe05080 <cbks>, dumpops = 0x7f4a0fe05480 <dumpops>, volume_options = {next = 0x7f0cf774d520, 
    prev = 0x7f00cd02b120}, fini = 0x7f4a0fbf8260 <posix_fini>, init = 0x7f4a0fbf68e0 <posix_init>, 
  reconfigure = 0x7f4a0fbf5f60 <posix_reconfigure>, mem_acct_init = 0x7f4a0fbf5f40 <mem_acct_init>, dump_metrics = 0x0, 
  notify = 0x7f4a0fbf5d90 <posix_notify>, loglevel = GF_LOG_NONE, stats = {total = {metrics = {{fop = {lk = 0x7f0440edbbc8 "", value = 0}, 
          cbk = {lk = 0x7f0440edbbd0 "", value = 0}} <repeats 59 times>}, count = {lk = 0x7f0440edbf78 "", value = 0}}, interval = {latencies = {
        {min = 0, max = 0, total = 0, count = 0} <repeats 59 times>}, metrics = {{fop = {lk = 0x7f0440edc6e0 "", value = 0}, cbk = {
            lk = 0x7f0440edc6e8 "", value = 0}} <repeats 59 times>}, count = {lk = 0x7f0440edca90 "", value = 0}}}, history = 0x0, 
  ctx = 0x55ceca7a5010, graph = 0x7f0cf774d400, itable = 0x0, init_succeeded = 0 '\000', private = 0x0, mem_acct = 0x7f0440eea4f0, winds = 0, 
  switched = 0 '\000', local_pool = 0x0, is_autoloaded = false, volfile_id = 0x0, xl_id = 1, op_version = {1, 0, 0, 0}, flags = 0, id = 0, 
  identifier = 0x7f4a0fbfa520 "posix", pass_through = false, pass_through_fops = 0x7f4a1e7f0c40 <_default_fops>, cleanup_starting = 0, 
  call_cleanup = 0, category = GF_MAINTAINED, xprtrefcnt = {lk = 0x7f0440edcb40 "", value = 0}, notify_down = 0, parent_down = 0}

And core dump is showing brick xlator name is "rep_rviudtsghmn2j", As we can see in brick logs brick(rep_rviudtsghmn2j) graph was not successfully built due to not able to spawn a Thread after throwing the error "Cannot allocate memory"  

>>>>>>>>>>>>

[2020-01-13 06:01:09.544728] I [login.c:110:gf_auth] 0-auth/login: allowed user names: 6d70754d-230b-494f-a981-fc3a31c632d0
[2020-01-13 06:01:09.544744] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-basevol_x3-server: accepted client from CTX_ID:0f899e45-573a-4983-be30-3b6c9a583a02-GRAPH_ID:0-PID:27959-HOST:rhsqa12.lab.eng.blr.redhat.com-PC_NAME:arb_bdxziwl7awxly-client-1-RECON_NO:-0 (version: 6.0) with subvol /gluster/brick1/arb_bdxziwl7awxly
[2020-01-13 06:01:09.640607] I [glusterfsd-mgmt.c:928:glusterfs_handle_attach] 0-glusterfs: got attach for /var/lib/glusterd/vols/rep_rviudtsghmn2j/rep_rviudtsghmn2j.rhsqa10.lab.eng.blr.redhat.com.gluster-brick1-rep_rviudtsghmn2j.vol
[2020-01-13 06:01:09.648163] I [socket.c:962:__socket_server_bind] 31293-socket.rep_rviudtsghmn2j-changelog: closing (AF_UNIX) reuse check socket 185
[2020-01-13 06:01:09.648414] I [MSGID: 121050] [ctr-helper.c:248:extract_ctr_options] 31293-gfdbdatastore: CTR Xlator is disabled.
[2020-01-13 06:01:09.648446] I [MSGID: 121050] [changetimerecorder.c:2144:init] 31293-gfdbdatastore: CTR Xlator is not enabled so skip ctr init
[2020-01-13 06:01:09.648489] I [trash.c:2450:init] 31293-rep_rviudtsghmn2j-trash: no option specified for 'eliminate', using NULL
[2020-01-13 06:01:09.649592] E [MSGID: 101072] [common-utils.c:4030:gf_thread_vcreate] 31293-rep_rviudtsghmn2j-posix: Thread creation failed [Resource temporarily unavailable]
[2020-01-13 06:01:09.649626] E [MSGID: 113075] [posix-helpers.c:2276:posix_spawn_health_check_thread] 31293-rep_rviudtsghmn2j-posix: unable to setup health-check thread [Cannot allocate memory]
[2020-01-13 06:01:09.649647] E [MSGID: 101019] [xlator.c:629:xlator_init] 0-rep_rviudtsghmn2j-posix: Initialization of volume 'rep_rviudtsghmn2j-posix' failed, review your volfile again
[2020-01-13 06:01:09.649661] E [MSGID: 101066] [graph.c:362:glusterfs_graph_init] 0-rep_rviudtsghmn2j-posix: initializing translator failed
[2020-01-13 06:01:09.649674] W [graph.c:1235:glusterfs_graph_attach] 0-glusterfs: failed to initialize graph for xlator /gluster/brick1/rep_rviudtsghmn2j
[2020-01-13 06:01:09.649731] E [rpcsvc.c:631:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor (4867634:2:12) failed to complete successfully
[2020-01-13 06:01:09.906526] W [socket.c:781:__socket_rwv] 0-tcp.basevol_x3-server: readv on 10.70.44.10:45580 failed (No data available)
[2020-01-13 06:01:09.906526] W [socket.c:781:__socket_rwv] 0-tcp.basevol_x3-server: readv on 10.70.44.10:45588 failed (No data available)
[2020-01-13 06:01:09.906597] I [MSGID: 115036] [server.c:498:server_rpc_notify] 0-basevol_x3-server: disconnecting connection from CTX_ID:4f05dd6d-85f2-445d-87a2-59cb5d8b3e9f-GRAPH_ID:0-PID:61541-HOST:rhsqa10.lab.eng.blr.redhat.com-PC_NAME:arb_1ubdr7lvu47a0-client-2-RECON_NO:-0
[2020-01-13 06:01:09.906613] I [MSGID: 115036] [server.c:498:server_rpc_notify] 0-basevol_x3-server: disconnecting connection from CTX_ID:4f05dd6d-85f2-445d-87a2-59cb5d8b3e9f-GRAPH_ID:0-PID:61541-HOST:rhsqa10.lab.eng.blr.redhat.com-PC_NAME:arb_0swl6dsidwusm-client-1-RECON_NO:-0
[2020-01-13 06:01:09.906692] W [socket.c:781:__socket_rwv] 0-tcp.basevol_x3-server: readv on 10.70.44.10:45576 failed (No data available)
[2020-01-13 06:01:09.906744] I [MSGID: 115036] [server.c:498:server_rpc_notify] 0-basevol_x3-server: disconnecting connection from CTX_ID:4f05dd6d-85f2-445d-87a2-59cb5d8b3e9f-GRAPH_ID:0-PID:61541-HOST:rhsqa10.lab.eng.blr.redhat.com-PC_NAME:arb_2c7dd68qsbr7i-client-1-RECON_NO:-0
[2020-01-13 06:01:09.906741] W [socket.c:781:__socket_rwv] 0-tcp.basevol_x3-server: readv on 10.70.44.10:45568 failed (No data available)

>>>>>>>>>>>>>>>

After checked the script logs it is showing at that moment brick memory consumption was around 19.2g, it seems brick has some leak.

#### top output and volume count after creating additional 100 volumes and before deleting them in iteration 313
Mon Jan 13 11:50:53 IST 2020
 5248 root      20   0  291.6g  19.2g   5008 D  22.2 30.6   2301:06 glusterfsd
 4326 root      20   0  629532 133116   4192 S   5.6  0.2 317:11.07 glusterd
61877 root      20   0 1125156  16312   4316 S   0.0  0.0   0:00.43 glusterfsd
62166 root      20   0 1123908  16540   4336 S   0.0  0.0   0:00.40 glusterfsd
62630 root      20   0 2929700  43540   4348 S   0.0  0.1   0:00.90 glusterfsd
64488 root      20   0  266996   4004   2852 S   0.0  0.0   0:00.00 gluster
64490 root      20   0 4083240 152816   3204 S   0.0  0.2   0:03.51 glusterfs


To fix the same I have to reproduce the issue and trying to find out the leak code path then I can provide the fix.

Thanks,
Mohit Agrawal

Comment 6 Mohit Agrawal 2020-01-16 12:17:55 UTC

There are some leaks that are known and those existed in previous release also.
It is coincident brick was not crashed in previous releases in case of unavailable of memory but I believe
it is not a regression. I am still testing the patch, it will take some time.
I will provide the update as soon as the testing will finish.

Two known patches specific to resolve a memory leak
https://review.gluster.org/#/c/glusterfs/+/23642/
https://review.gluster.org/#/c/glusterfs/+/22738/

Comment 7 Mohit Agrawal 2020-01-17 05:06:42 UTC

Nag,

Can you please try to reproduce in 3.5 to prove it is not a regression? 
I believe memory leak is not specific to 3.5.1?

Comment 8 Nag Pavan Chilakam 2020-01-17 05:47:45 UTC

(In reply to Mohit Agrawal from comment #7)
> Nag,
> 
> Can you please try to reproduce in 3.5 to prove it is not a regression? 
> I believe memory leak is not specific to 3.5.1?

MOhit, as per previous test results , it was run in 3.5.0 and hence marked as regression.
That means it was run in 3.5.0

Comment 9 Mohit Agrawal 2020-01-20 09:15:59 UTC

Hi,

This is not a regression, in 3.5 also the brick process was consuming similar memory
the only thing we need to wait for the crash. It depends on the system but after sometime
when a brick is not able to get memory it will also crash.

As we can see on the same system 3_5_1 for attempt(277) brick usage was 17.6G

grep -iw "volume count at start of iteration 277" 351_memleak-RHG3-13273.log -A 5
#### top output and volume count at start of iteration 277 
Mon Jan 13 01:10:25 IST 2020
 4326 root      20   0  624796 126136   5368 S   0.0  0.2 267:11.85 glusterd
 5248 root      20   0  255.3g  17.6g   4988 S   0.0 28.0   2013:52 glusterfsd
74732 root      20   0  704148   7740   3096 S   0.0  0.0   0:00.08 glusterfs
#### top output and volume count after creating additional 100 volumes and before deleting them in iteration 277


and on 3.5 the current test case is running consumption is 17.5g it is almost the same.

#### top output and volume count at start of iteration 277 
Mon Jan 20 14:28:14 IST 2020
 4365 root      20   0  624796 125896   5416 S   0.0  0.2 262:45.19 glusterd
 5295 root      20   0  255.1g  17.5g   4996 S   0.0 27.9   1774:22 glusterfsd
74508 root      20   0  704148   7740   3092 S   0.0  0.0   0:00.08 glusterfs

This leak exists in brick_mux from day 1. At this point of time, it would be difficult to fix because
regress testing is required to validate the patch. 
The fix is in core part and to avoid the leak we have to delete detached xlator objects completely or we need to implement some other mechanism so that brick should not crash if two threads access detached xl object at the same time.

@Nag,
Can you please delete this bug from blocker list for 3.5.1


Regards,
Mohit Agrawal

Comment 12 Mohit Agrawal 2020-01-21 06:46:55 UTC

Hi,

Yes, it is the same as I mentioned earlier.
As we can see in mem-script logs for the attempt 318 new brick processes are spawned.

>>>>>>>>>>>>>

#### top output and volume count at start of iteration 318
Tue Jan 21 01:38:02 IST 2020
 4365 root      20   0  629916 132380   5436 S   0.0  0.2 314:55.83 glusterd
 5295 root      20   0  291.3g  19.2g   4996 S   0.0 30.6   2026:53 glusterfsd
38392 root      20   0  458364   9668   3064 S   0.0  0.0   0:00.05 glusterfs
#### top output and volume count after creating additional 100 volumes and before deleting them in iteration 318
Tue Jan 21 02:32:39 IST 2020
62047 root      20   0 3926132 151844   3156 S 177.8  0.2   0:04.14 glusterfs
39566 root      20   0   33.8g 777152   4712 S  11.1  1.2   2:32.47 glusterfsd
 4365 root      20   0  630172 131292   4236 S   0.0  0.2 315:19.46 glusterd
39334 root      20   0 1125156  24836   4556 S   0.0  0.0   0:02.58 glusterfsd
39396 root      20   0 1125156  24092   4564 S   0.0  0.0   0:02.65 glusterfsd
39463 root      20   0 1125156  22760   4560 S   0.0  0.0   0:02.39 glusterfsd
62046 root      20   0  146712   2084   1456 S   0.0  0.0   0:00.02 glusterfs
#### top output and volume count after deleting all the volumes created in  iteration 318
Tue Jan 21 02:40:34 IST 2020
#### top output and volume count at start of iteration 319
Tue Jan 21 02:40:34 IST 2020
 4365 root      20   0  630300 131316   4236 S   0.0  0.2 315:37.06 glusterd
78430 root      20   0  458364   7624   3068 S   0.0  0.0   0:00.06 glusterfs

>>>>>>>>>>>>

Logs are also confirming the same brick was crashed due to not able to get memory.

>>>>>>>>>>>>>>>

[2020-01-20 20:08:13.142913] I [glusterfsd-mgmt.c:928:glusterfs_handle_attach] 0-glusterfs: got attach for /var/lib/glusterd/vols/rep_2szcvqrcdn6xt/rep_2szcvqrcdn6xt.rhsqa10.lab.eng.blr.redhat.com.gluster-brick1-rep_2szcvqrcdn6xt.vol
[2020-01-20 20:08:13.159599] A [MSGID: 0] [mem-pool.c:145:__gf_calloc] : no memory available for size (9437112) current memory usage in kilobytes 20114652 [call stack follows]
/lib64/libglusterfs.so.0(+0x271e0)[0x7fdff80811e0]

>>>>>>>>>>>>>>>

>>>>>>>>>>>>.
[2020-01-21 05:59:10.756378] I [glusterfsd-mgmt.c:928:glusterfs_handle_attach] 0-glusterfs: got attach for /var/lib/glusterd/vols/rep_pr9ynhpxpyk5s/rep_pr9ynhpxpyk5s.rhsqa12.lab.eng.blr.redhat.com.gluster-brick1-rep_pr9ynhpxpyk5s.vol
[2020-01-21 05:59:10.771123] A [MSGID: 0] [mem-pool.c:145:__gf_calloc] : no memory available for size (9437112) current memory usage in kilobytes 20425156 [call stack follows]
/lib64/libglusterfs.so.0(+0x271e0)[0x7f5e76f201e0]
/lib64/libglusterfs.so.0(_gf_msg_nomem+0x401)[0x7f5e76f207e1]
/lib64/libglusterfs.so.0(__gf_calloc+0xf8)[0x7f5e76f4be08]
/usr/lib64/glusterfs/6.0/xlator/debug/io-stats.so(+0xf425)[0x7f5e61320425]
/usr/lib64/glusterfs/6.0/xlator/debug/io-stats.so(+0xf4a5)[0x7f5e613204a5]
[2020-01-21 05:59:10.771711] E [io-stats.c:3955:init] 31573-rep_pr9ynhpxpyk5s-io-stats: Out of memory.

>>>>>>>>>>>>>>>>>>

Thanks,
Mohit Agrawal

Comment 15 Mohit Agrawal 2020-02-14 13:53:47 UTC

*** Bug 1653225 has been marked as a duplicate of this bug. ***

Comment 17 Yaniv Kaul 2020-04-30 12:59:19 UTC

(In reply to Mohit Agrawal from comment #12)


> >>>>>>>>>>>>>>>
> 
> [2020-01-20 20:08:13.142913] I
> [glusterfsd-mgmt.c:928:glusterfs_handle_attach] 0-glusterfs: got attach for
> /var/lib/glusterd/vols/rep_2szcvqrcdn6xt/rep_2szcvqrcdn6xt.rhsqa10.lab.eng.
> blr.redhat.com.gluster-brick1-rep_2szcvqrcdn6xt.vol
> [2020-01-20 20:08:13.159599] A [MSGID: 0] [mem-pool.c:145:__gf_calloc] : no
> memory available for size (9437112) current memory usage in kilobytes
> 20114652 [call stack follows]

What is trying to allocate so much memory (9437112) ?
io-stats xlator with so many bricks requires so much memory?

Comment 29 Mohit Agrawal 2020-06-10 07:56:41 UTC

*** Bug 1761423 has been marked as a duplicate of this bug. ***

Comment 32 errata-xmlrpc 2020-12-17 04:51:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5603

Note You need to log in before you can comment on or make changes to this bug.