1637743 – Glusterd seems to be attempting to start the same brick process twice

Bug 1637743 - Glusterd seems to be attempting to start the same brick process twice

Summary: Glusterd seems to be attempting to start the same brick process twice

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Mohit Agrawal
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-09 22:35 UTC by Shyamsundar
Modified:	2020-02-24 11:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-02-24 11:00:31 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shyamsundar 2018-10-09 22:35:49 UTC

Description of problem:

A brick process log seems to indicate that it is processing the same CLI arguments twice and hence starting the same brick twice although possibly within different processes. This leads to the second instance of changelog xlator failing to bind to its socket (or failure to initialize its RPC session) and thus resulsting in xlator init failure. This further leads to glusterfs_process_volfp failure and hence calling cleanup_and_exit.

When the above happens it causes several SEGV induced thread faults. As the xlators are unloaded and symbols missing, things like gcov and backtrace cause SEGVs when they fail to process address to symbol resolution.

One delta is that the CLI arguments for the second brick instance seems to have a different brick port, but all other argument (including the log file) seem to be the same.

It is entirely possible that there are 2 processes using the same log file, or 2 instances of the same brick being started up in the process (with different ports), what is causing this needs further investigation to address the resultant cores.

### Tests where this pattern of failure is noticed ###
https://build.gluster.org/job/line-coverage/530/consoleText
  - Test: ./tests/bugs/snapshot/bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t
https://build.gluster.org/job/regression-test-burn-in/4127/consoleText
  - Test: ./tests/bugs/glusterd/quorum-validation.t

Comment 1 Shyamsundar 2018-10-09 22:36:13 UTC

### Log analysis ###
###### Logs pointing to changelog initialization failure ######
[2018-10-06 18:20:11.032759] E [socket.c:905:__socket_server_bind] 0-socket.0627e719755f46f294fee9f266f72d60-changelog: binding to  failed: Address already in use
[2018-10-06 18:20:11.032784] E [socket.c:907:__socket_server_bind] 0-socket.0627e719755f46f294fee9f266f72d60-changelog: Port is already in use
[2018-10-06 18:20:11.032804] W [rpcsvc.c:1795:rpcsvc_create_listener] 0-rpc-service: listening on transport failed

(other instance)
[2018-10-05 16:57:15.646645] E [socket.c:905:__socket_server_bind] 0-socket.patchy-changelog: binding to  failed: Address already in use
[2018-10-05 16:57:15.646667] E [socket.c:907:__socket_server_bind] 0-socket.patchy-changelog: Port is already in use
[2018-10-05 16:57:15.646665] I [MSGID: 121050] [changetimerecorder.c:2145:init] 0-gfdbdatastore: CTR Xlator is not enabled so skip ctr init
[2018-10-05 16:57:15.646686] W [rpcsvc.c:1795:rpcsvc_create_listener] 0-rpc-service: listening on transport failed

###### Startup sequence logs ######
[2018-10-05 16:57:12.383081] W [glusterfsd.c:1481:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7fa7d12bbe25] -->/build/install/sbin/glusterfsd(glusterfs_sigwaiter+0xe6) [0x40a7da] -->/build/install/sbin/glusterfsd(cleanup_and_exit+0x88) [0x408b07] ) 0-: received signum (15), shutting down

[2018-10-05 16:57:13.395291]:++++++++++ G_LOG:./tests/bugs/glusterd/quorum-validation.t: TEST: 95 0 cluster_brick_up_status 1 patchy 127.1.1.1 /d/backends/1/patchy0 ++++++++++

[2018-10-05 16:57:13.497347]:++++++++++ G_LOG:./tests/bugs/glusterd/quorum-validation.t: TEST: 97 glusterd --xlator-option management.working-directory=/d/backends/2/glusterd --xlator-option management.transport.socket.bind-address=127.1.1.2 --xlator-option management.run-directory=/d/backends/2/run/gluster --xlator-option management.glusterd-sockfile=/d/backends/2/glusterd/gd.sock --log-file=/var/log/glusterfs/quorum-validation.t_glusterd2.log --pid-file=/d/backends/2/glusterd.pid ++++++++++

[2018-10-05 16:57:15.447758]:++++++++++ G_LOG:./tests/bugs/glusterd/quorum-validation.t: TEST: 98 1 cluster_brick_up_status 1 patchy 127.1.1.1 /d/backends/1/patchy0 ++++++++++

<CLI instance 1>
[2018-10-05 16:57:15.544023] I [MSGID: 100030] [glusterfsd.c:2691:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbin/glusterfsd version 6dev (args: /build/install/sbin/glusterfsd -s 127.1.1.1 --volfile-id patchy.127.1.1.1.d-backends-1-patchy0 -p /d/backends/1/run/gluster/vols/patchy/127.1.1.1-d-backends-1-patchy0.pid -S /var/run/gluster/f96de9a0bebaefd6.socket --brick-name /d/backends/1/patchy0 -l /var/log/glusterfs/bricks/d-backends-1-patchy0.log --xlator-option *-posix.glusterd-uuid=c8779063-a544-453b-925f-6eaf388bb107 --process-name brick --brick-port 49154 --xlator-option patchy-server.listen-port=49154 --xlator-option transport.socket.bind-address=127.1.1.1)

[2018-10-05 16:57:15.552571] W [MSGID: 101002] [options.c:975:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'transport.address-family', continuing with correction

[2018-10-05 16:57:15.552961] I [dict.c:1178:data_to_uint16] (-->/build/install/lib/glusterfs/6dev/rpc-transport/socket.so(socket_client_get_remote_sockaddr+0x1dc) [0x7ff8ab57025a] -->/build/install/lib/glusterfs/6dev/rpc-transport/socket.so(+0x12773) [0x7ff8ab56f773] -->/build/install/lib/libglusterfs.so.0(data_to_uint16+0x11a) [0x7ff8b698adfb] ) 0-dict: key null, unsigned integer type asked, has integer type [Invalid argument]

[2018-10-05 16:57:15.557053] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1

<CLI instance 2>
[2018-10-05 16:57:15.599487] I [MSGID: 100030] [glusterfsd.c:2691:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbin/glusterfsd version 6dev (args: /build/install/sbin/glusterfsd -s 127.1.1.1 --volfile-id patchy.127.1.1.1.d-backends-1-patchy0 -p /d/backends/1/run/gluster/vols/patchy/127.1.1.1-d-backends-1-patchy0.pid -S /var/run/gluster/f96de9a0bebaefd6.socket --brick-name /d/backends/1/patchy0 -l /var/log/glusterfs/bricks/d-backends-1-patchy0.log --xlator-option *-posix.glusterd-uuid=c8779063-a544-453b-925f-6eaf388bb107 --process-name brick --brick-port 49155 --xlator-option patchy-server.listen-port=49155 --xlator-option transport.socket.bind-address=127.1.1.1)

[2018-10-05 16:57:15.607454] W [MSGID: 101002] [options.c:975:xl_opt_validate] 0-glusterfs: option 'address-family' is deprecated, preferred is 'transport.address-family', continuing with correction

[2018-10-05 16:57:15.607880] I [dict.c:1178:data_to_uint16] (-->/build/install/lib/glusterfs/6dev/rpc-transport/socket.so(socket_client_get_remote_sockaddr+0x1dc) [0x7f26f7afc25a] -->/build/install/lib/glusterfs/6dev/rpc-transport/socket.so(+0x12773) [0x7f26f7afb773] -->/build/install/lib/libglusterfs.so.0(data_to_uint16+0x11a) [0x7f2702f16dfb] ) 0-dict: key null, unsigned integer type asked, has integer type [Invalid argument]

[2018-10-05 16:57:15.612344] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1

<NOTE: Double dlsym of various xlator shared libraries>
[2018-10-05 16:57:15.639477] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /build/install/lib/glusterfs/6dev/xlator/storage/posix.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-10-05 16:57:15.639665] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /build/install/lib/glusterfs/6dev/xlator/features/trash.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-10-05 16:57:15.639835] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /build/install/lib/glusterfs/6dev/xlator/storage/posix.so: undefined symbol: xlator_api. Fall back to old symbols
[2018-10-05 16:57:15.640019] I [MSGID: 101097] [xlator.c:341:xlator_dynload_newway] 0-xlator: dlsym(xlator_api) on /build/install/lib/glusterfs/6dev/xlator/features/trash.so: undefined symbol: xlator_api. Fall back to old symbols

Comment 2 Shyamsundar 2018-10-09 22:36:33 UTC

### Basic core pattern where symbol resolution fails ###

Test: https://build.gluster.org/job/line-coverage/530/consoleText

Thread 1 (Thread 0x7fa197877700 (LWP 25856)):
#0  x86_64_fallback_frame_state (context=0x7fa197875c70, context=0x7fa197875c70, fs=0x7fa197875d60) at ./md-unwind-support.h:58
#1  uw_frame_state_for (context=context@entry=0x7fa197875c70, fs=fs@entry=0x7fa197875d60) at ../../../libgcc/unwind-dw2.c:1253
#2  0x00007fa1a26a6fb9 in _Unwind_Backtrace (trace=0x7fa1adabbaa0 <backtrace_helper>, trace_argument=0x7fa197875f20) at ../../../libgcc/unwind.inc:290
#3  0x00007fa1adabbc16 in __GI___backtrace (array=<optimized out>, size=200) at ../sysdeps/x86_64/backtrace.c:109
#4  0x00007fa1af423eda in _gf_msg_backtrace_nomem (level=GF_LOG_ALERT, stacksize=200) at logging.c:1115
#5  0x00007fa1af431717 in gf_print_trace (signum=11, ctx=0x159a010) at common-utils.c:954
#6  0x000000000040e09c in glusterfsd_print_trace (signum=11) at glusterfsd.c:2316
#7  <signal handler called>
#8  0x00007fa1a084b0d3 in ?? ()
#9  0x00007fa1ae1dce25 in start_thread (arg=0x7fa197877700) at pthread_create.c:308
#10 0x00007fa1adaa5bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

(gdb) disassemble 
Dump of assembler code for function uw_frame_state_for:
...
   0x00007fa1a26a60ac <+1068>:	mov    0x90(%r13),%rdx
   0x00007fa1a26a60b3 <+1075>:	mov    $0x5,%eax
=> 0x00007fa1a26a60b8 <+1080>:	cmpb   $0x48,(%rcx)
   0x00007fa1a26a60bb <+1083>:	jne    0x7fa1a26a5e8d <uw_frame_state_for+525>
...

(gdb) info registers 
rax            0x5	5
rbx            0x7fa197875c70	140332008692848
rcx            0x7fa1a084b0d3	140332159512787
...

(gdb) x/x $rcx
0x7fa1a084b0d3:	Cannot access memory at address 0x7fa1a084b0d3

(gdb) t 1
[Switching to thread 1 (Thread 0x7fa197877700 (LWP 25856))]
#0  x86_64_fallback_frame_state (context=0x7fa197875c70, context=0x7fa197875c70, fs=0x7fa197875d60) at ./md-unwind-support.h:58
58	  if (*(unsigned char *)(pc+0) == 0x48
(gdb) info locals 
pc = 0x7fa1a084b0d3 <Address 0x7fa1a084b0d3 out of bounds>
sc = <optimized out>
new_cfa = <optimized out>

Test: https://build.gluster.org/job/regression-test-burn-in/4127/consoleText

Core was generated by `/build/install/sbin/glusterfsd -s 127.1.1.1 --volfile-id patchy.127.1.1.1.d-bac'.
Program terminated with signal 11, Segmentation fault.
#0  x86_64_fallback_frame_state (context=0x7f26e9ea9c70, context=0x7f26e9ea9c70, fs=0x7f26e9ea9d60) at ./md-unwind-support.h:58
58	  if (*(unsigned char *)(pc+0) == 0x48
(gdb) bt
#0  x86_64_fallback_frame_state (context=0x7f26e9ea9c70, context=0x7f26e9ea9c70, fs=0x7f26e9ea9d60) at ./md-unwind-support.h:58
#1  uw_frame_state_for (context=context@entry=0x7f26e9ea9c70, fs=fs@entry=0x7f26e9ea9d60) at ../../../libgcc/unwind-dw2.c:1253
#2  0x00007f26f6250fb9 in _Unwind_Backtrace (trace=0x7f270164faa0 <backtrace_helper>, trace_argument=0x7f26e9ea9f20) at ../../../libgcc/unwind.inc:290
#3  0x00007f270164fc16 in __GI___backtrace (array=<optimized out>, size=200) at ../sysdeps/x86_64/backtrace.c:109
#4  0x00007f2702f2429b in _gf_msg_backtrace_nomem (level=GF_LOG_ALERT, stacksize=200) at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/logging.c:1115
#5  0x00007f2702f2d1dd in gf_print_trace (signum=11, ctx=0x125a010) at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/common-utils.c:954
#6  0x000000000040a84a in glusterfsd_print_trace (signum=11) at /home/jenkins/root/workspace/regression-test-burn-in/glusterfsd/src/glusterfsd.c:2316
#7  <signal handler called>
#8  0x00007f26f4497a07 in ?? ()
#9  0x00007f2701d70e25 in start_thread (arg=0x7f26e9eab700) at pthread_create.c:308
#10 0x00007f2701639bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) info locals 
pc = 0x7f26f4497a07 <Address 0x7f26f4497a07 out of bounds>
sc = <optimized out>
new_cfa = <optimized out>

Comment 3 Shyamsundar 2018-10-09 22:36:48 UTC

Other threads of interest (showing cleanup sequence etc. in the above core):

Thread 12 (Thread 0x7f26f602d700 (LWP 15527)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f2701d72e1b in _L_lock_812 () from /lib64/libpthread.so.0
#2  0x00007f2701d72ce8 in __GI___pthread_mutex_lock (mutex=0x125a2f8) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f2702f269d0 in _gf_msg (domain=0x414457 "", file=0x413b58 "/home/jenkins/root/workspace/regression-test-burn-in/glusterfsd/src/glusterfsd.c", 
    function=0x4154d0 <__FUNCTION__.22322> "cleanup_and_exit", line=1481, level=GF_LOG_WARNING, errnum=0, trace=1, msgid=100032, fmt=0x414a18 "received signum (%d), shutting down")
    at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/logging.c:2097
#4  0x0000000000408b07 in cleanup_and_exit (signum=-1) at /home/jenkins/root/workspace/regression-test-burn-in/glusterfsd/src/glusterfsd.c:1480
#5  0x000000000040affe in glusterfs_process_volfp (ctx=0x125a010, fp=0x7f26f0003850) at /home/jenkins/root/workspace/regression-test-burn-in/glusterfsd/src/glusterfsd.c:2545
#6  0x000000000041094c in mgmt_getspec_cbk (req=0x7f26f0002028, iov=0x7f26f0002068, count=1, myframe=0x7f26f00015a8) at /home/jenkins/root/workspace/regression-test-burn-in/glusterfsd/src/glusterfsd-mgmt.c:2076
---Type <return> to continue, or q <return> to quit---
#7  0x00007f2702ce717d in rpc_clnt_handle_reply (clnt=0x12db800, pollin=0x7f26f0002610) at /home/jenkins/root/workspace/regression-test-burn-in/rpc/rpc-lib/src/rpc-clnt.c:755
#8  0x00007f2702ce76ad in rpc_clnt_notify (trans=0x12dbb60, mydata=0x12db830, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f26f0002610)
    at /home/jenkins/root/workspace/regression-test-burn-in/rpc/rpc-lib/src/rpc-clnt.c:923
#9  0x00007f2702ce3895 in rpc_transport_notify (this=0x12dbb60, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f26f0002610) at /home/jenkins/root/workspace/regression-test-burn-in/rpc/rpc-lib/src/rpc-transport.c:525
#10 0x00007f26f7af5547 in socket_event_poll_in (this=0x12dbb60, notify_handled=true) at /home/jenkins/root/workspace/regression-test-burn-in/rpc/rpc-transport/socket/src/socket.c:2523
#11 0x00007f26f7af6528 in socket_event_handler (fd=9, idx=1, gen=1, data=0x12dbb60, poll_in=1, poll_out=0, poll_err=0)
    at /home/jenkins/root/workspace/regression-test-burn-in/rpc/rpc-transport/socket/src/socket.c:2924
#12 0x00007f2702f9e3d6 in event_dispatch_epoll_handler (event_pool=0x1291c40, event=0x7f26f602cea0) at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/event-epoll.c:591
#13 0x00007f2702f9e6cf in event_dispatch_epoll_worker (data=0x12dd6d0) at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/event-epoll.c:668
#14 0x00007f2701d70e25 in start_thread (arg=0x7f26f602d700) at pthread_create.c:308
#15 0x00007f2701639bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 9 (Thread 0x7f26ebbb3700 (LWP 15539)):
#0  _int_malloc (av=av@entry=0x7f26dc000020, bytes=bytes@entry=8192) at malloc.c:3784
#1  0x00007f27015c12c4 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3245
#2  0x00007f27015b2863 in __GI_open_memstream (bufloc=bufloc@entry=0x7f26ebbb23a0, sizeloc=sizeloc@entry=0x7f26ebbb23b0) at memstream.c:85
#3  0x00007f27016336fa in __GI___vsyslog_chk (pri=25, flag=flag@entry=-1, fmt=0x7f2703000c4e "%s\n", ap=ap@entry=0x7f26ebbb2488) at ../misc/syslog.c:167
#4  0x00007f2701633c5f in __syslog (pri=<optimized out>, fmt=<optimized out>) at ../misc/syslog.c:117
#5  0x00007f2702f23f0b in _gf_msg_plain_internal (level=GF_LOG_ALERT, msg=0x7f2703001b2a "pending frames:") at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/logging.c:998
#6  0x00007f2702f241dd in _gf_msg_plain_nomem (level=GF_LOG_ALERT, msg=0x7f2703001b2a "pending frames:") at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/logging.c:1087
#7  0x00007f2702f2d038 in gf_print_trace (signum=11, ctx=0x125a010) at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/common-utils.c:925
#8  0x000000000040a84a in glusterfsd_print_trace (signum=11) at /home/jenkins/root/workspace/regression-test-burn-in/glusterfsd/src/glusterfsd.c:2316
#9  <signal handler called>
#10 0x00007f26f42573aa in ?? ()
#11 0x00007f2701d70e25 in start_thread (arg=0x7f26ebbb3700) at pthread_create.c:308
#12 0x00007f2701639bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 2 (Thread 0x7f26ec3b4700 (LWP 15538)):
#0  0x00007f270163406a in __mmap (addr=0x0, len=134217728, prot=<optimized out>, flags=16418, fd=<optimized out>, offset=0) at ../sysdeps/unix/sysv/linux/wordsize-64/mmap.c:34
#1  0x00007f27015ba011 in new_heap (size=135168, size@entry=2248, top_pad=<optimized out>) at arena.c:557
---Type <return> to continue, or q <return> to quit---
#2  0x00007f27015baa44 in _int_new_arena (size=8) at arena.c:752
#3  arena_get2 (size=size@entry=8, avoid_arena=avoid_arena@entry=0x0, a_tsd=<optimized out>) at arena.c:991
#4  0x00007f27015c132a in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3215
#5  0x00007f2702f64fe5 in __gf_default_calloc (cnt=1, size=8) at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/mem-pool.h:118
#6  0x00007f2702f653c4 in __glusterfs_this_location () at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/globals.c:252
#7  0x00007f2702f22405 in gf_log_flush () at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/logging.c:170
#8  0x00007f2702f2d018 in gf_print_trace (signum=11, ctx=0x125a010) at /home/jenkins/root/workspace/regression-test-burn-in/libglusterfs/src/common-utils.c:920
#9  0x000000000040a84a in glusterfsd_print_trace (signum=11) at /home/jenkins/root/workspace/regression-test-burn-in/glusterfsd/src/glusterfsd.c:2316
#10 <signal handler called>
#11 0x00007f26f425be0b in ?? ()
#12 0x00007f2701d70e25 in start_thread (arg=0x7f26ec3b4700) at pthread_create.c:308
#13 0x00007f2701639bad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 4 Shyamsundar 2018-10-09 22:37:00 UTC

### Other observations in the code ###

- Attempt flush if THIS is allocated, else skip, in gf_log_flush (else THIS calls allocation routines which we want to avoid in signal handlers)

- _gf_msg_plain_internal should not invoke syslog as that allocates memory (when in a signal handler)

- signal disposition should be changed to default when we are addressing the signal in the handler, as soon as possible than at the end to avoid cascading signals

- cleanup_and exit can change signal dispositions to default, and generate a core in case of failures, than causing cascading cleanup SEGV signals when such events occur

Comment 5 Shyamsundar 2018-10-09 22:43:32 UTC

Attempts to induce such failures, like starting the same brick process with different ports, or forcing in code init failures in changelog xlator does not seem to crash in limited testing.

We need some questions answered,
- Are above observations correct?
- Is the same brick instance against the same log file being started twice?
- If so, can this be root caused or induced to see if the process may crash?

Comment 6 Sanju 2020-02-24 11:00:31 UTC

This issue is not seen with latest master, closing the bug.

Note You need to log in before you can comment on or make changes to this bug.