1744420 – glusterd crashing with core dump on the latest nightly builds.

Bug 1744420 - glusterd crashing with core dump on the latest nightly builds.

Summary: glusterd crashing with core dump on the latest nightly builds.

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Assignee:	Sanju
QA Contact:	Kshithij Iyer
Docs Contact:
URL:
Whiteboard:
Depends On:	1745965
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-22 07:08 UTC by Kshithij Iyer
Modified:	2019-08-28 02:17 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-08-28 02:17:16 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kshithij Iyer 2019-08-22 07:08:35 UTC

Description of problem:
While trying to verify glusto-test patches we had observed that glusterd isn't starting during the installation. This was observed for the last 2 nightly builds.

https://ci.centos.org/job/gluster_glusto-patch-check/1537/consoleFull 

# systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-08-21 12:19:20 IST; 24h ago
     Docs: man:glusterd(8)
  Process: 13852 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=1/FAILURE)

Aug 21 12:19:20 dhcp35-114.lab.eng.blr.redhat.com glusterd[13853]: spinlock 1
Aug 21 12:19:20 dhcp35-114.lab.eng.blr.redhat.com glusterd[13853]: epoll.h 1
Aug 21 12:19:20 dhcp35-114.lab.eng.blr.redhat.com glusterd[13853]: xattr.h 1
Aug 21 12:19:20 dhcp35-114.lab.eng.blr.redhat.com glusterd[13853]: st_atim.tv_nsec 1
Aug 21 12:19:20 dhcp35-114.lab.eng.blr.redhat.com glusterd[13853]: package-string: glusterfs 20190820.95f71df
Aug 21 12:19:20 dhcp35-114.lab.eng.blr.redhat.com glusterd[13853]: ---------
Aug 21 12:19:20 dhcp35-114.lab.eng.blr.redhat.com systemd[1]: glusterd.service: control process exited, code=exited status=1
Aug 21 12:19:20 dhcp35-114.lab.eng.blr.redhat.com systemd[1]: Failed to start GlusterFS, a clustered file-system server.
Aug 21 12:19:20 dhcp35-114.lab.eng.blr.redhat.com systemd[1]: Unit glusterd.service entered failed state.
Aug 21 12:19:20 dhcp35-114.lab.eng.blr.redhat.com systemd[1]: glusterd.service failed.

Program terminated with signal 6, Aborted.
#0  0x00007fcb85868207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt 
#0  0x00007fcb85868207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fcb858698f8 in __GI_abort () at abort.c:90
#2  0x00007fcb858aad27 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fcb859bb312 "*** %s ***: %s terminated\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007fcb859499e7 in __GI___fortify_fail (msg=msg@entry=0x7fcb859bb2b8 "buffer overflow detected") at fortify_fail.c:30
#4  0x00007fcb85947b62 in __GI___chk_fail () at chk_fail.c:28
#5  0x00007fcb8594727b in ___vsnprintf_chk (s=<optimized out>, maxlen=<optimized out>, flags=<optimized out>, slen=<optimized out>, format=<optimized out>, 
    args=args@entry=0x7fffd7e97708) at vsnprintf_chk.c:37
#6  0x00007fcb85947198 in ___snprintf_chk (s=s@entry=0x7fffd7e97a40 "", maxlen=maxlen@entry=4096, flags=flags@entry=1, slen=slen@entry=3776, 
    format=format@entry=0x7fcb7b48dd4b "%s") at snprintf_chk.c:35
#7  0x00007fcb7b34efb9 in snprintf (__fmt=0x7fcb7b48dd4b "%s", __n=4096, __s=0x7fffd7e97a40 "") at /usr/include/bits/stdio2.h:64
#8  init (this=0x561402cb9520) at glusterd.c:1450
#9  0x00007fcb87222ea1 in __xlator_init (xl=0x561402cb9520) at xlator.c:597
#10 xlator_init (xl=xl@entry=0x561402cb9520) at xlator.c:623
#11 0x00007fcb8725fb29 in glusterfs_graph_init (graph=graph@entry=0x561402cb50f0) at graph.c:422
#12 0x00007fcb87260195 in glusterfs_graph_activate (graph=graph@entry=0x561402cb50f0, ctx=ctx@entry=0x561402c70010) at graph.c:776
#13 0x00005614017c3182 in glusterfs_process_volfp (ctx=ctx@entry=0x561402c70010, fp=fp@entry=0x561402cb4e70) at glusterfsd.c:2728
#14 0x00005614017c333d in glusterfs_volumes_init (ctx=ctx@entry=0x561402c70010) at glusterfsd.c:2800
#15 0x00005614017bea3a in main (argc=4, argv=<optimized out>) at glusterfsd.c:2962
(gdb) t a a bt

Thread 7 (Thread 0x7fcb7c002700 (LWP 13727)):
#0  0x00007fcb85926f73 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fcb872a4224 in runner (arg=0x561402cb2bf0) at ../../contrib/timer-wheel/timer-wheel.c:186
#2  0x00007fcb86067dd5 in start_thread (arg=0x7fcb7c002700) at pthread_create.c:307
#3  0x00007fcb8592fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 6 (Thread 0x7fcb7e807700 (LWP 13722)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007fcb872337ab in gf_timer_proc (data=0x561402cae280) at timer.c:140
#2  0x00007fcb86067dd5 in start_thread (arg=0x7fcb7e807700) at pthread_create.c:307
#3  0x00007fcb8592fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 5 (Thread 0x7fcb7e006700 (LWP 13723)):
#0  0x00007fcb8606f361 in do_sigwait (sig=0x7fcb7e0050dc, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:60
#1  __sigwait (set=set@entry=0x7fcb7e0050e0, sig=sig@entry=0x7fcb7e0050dc) at ../sysdeps/unix/sysv/linux/sigwait.c:95
#2  0x00005614017c277b in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2463
#3  0x00007fcb86067dd5 in start_thread (arg=0x7fcb7e006700) at pthread_create.c:307
#4  0x00007fcb8592fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7fcb7d805700 (LWP 13724)):
#0  0x00007fcb858f6e2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fcb858f6cc4 in __sleep (seconds=0, seconds@entry=30) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x00007fcb87250868 in pool_sweeper (arg=<optimized out>) at mem-pool.c:446
#3  0x00007fcb86067dd5 in start_thread (arg=0x7fcb7d805700) at pthread_create.c:307
#4  0x00007fcb8592fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7fcb7d004700 (LWP 13725)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007fcb872659b0 in syncenv_task (proc=proc@entry=0x561402caea70) at syncop.c:517
#2  0x00007fcb87266860 in syncenv_processor (thdata=0x561402caea70) at syncop.c:584
#3  0x00007fcb86067dd5 in start_thread (arg=0x7fcb7d004700) at pthread_create.c:307
#4  0x00007fcb8592fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7fcb7c803700 (LWP 13726)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007fcb872659b0 in syncenv_task (proc=proc@entry=0x561402caee30) at syncop.c:517
#2  0x00007fcb87266860 in syncenv_processor (thdata=0x561402caee30) at syncop.c:584
---Type <return> to continue, or q <return> to quit---
#3  0x00007fcb86067dd5 in start_thread (arg=0x7fcb7c803700) at pthread_create.c:307
#4  0x00007fcb8592fead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7fcb8772a4c0 (LWP 13721)):
#0  0x00007fcb85868207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fcb858698f8 in __GI_abort () at abort.c:90
#2  0x00007fcb858aad27 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fcb859bb312 "*** %s ***: %s terminated\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007fcb859499e7 in __GI___fortify_fail (msg=msg@entry=0x7fcb859bb2b8 "buffer overflow detected") at fortify_fail.c:30
#4  0x00007fcb85947b62 in __GI___chk_fail () at chk_fail.c:28
#5  0x00007fcb8594727b in ___vsnprintf_chk (s=<optimized out>, maxlen=<optimized out>, flags=<optimized out>, slen=<optimized out>, format=<optimized out>, 
    args=args@entry=0x7fffd7e97708) at vsnprintf_chk.c:37
#6  0x00007fcb85947198 in ___snprintf_chk (s=s@entry=0x7fffd7e97a40 "", maxlen=maxlen@entry=4096, flags=flags@entry=1, slen=slen@entry=3776, 
    format=format@entry=0x7fcb7b48dd4b "%s") at snprintf_chk.c:35
#7  0x00007fcb7b34efb9 in snprintf (__fmt=0x7fcb7b48dd4b "%s", __n=4096, __s=0x7fffd7e97a40 "") at /usr/include/bits/stdio2.h:64
#8  init (this=0x561402cb9520) at glusterd.c:1450
#9  0x00007fcb87222ea1 in __xlator_init (xl=0x561402cb9520) at xlator.c:597
#10 xlator_init (xl=xl@entry=0x561402cb9520) at xlator.c:623
#11 0x00007fcb8725fb29 in glusterfs_graph_init (graph=graph@entry=0x561402cb50f0) at graph.c:422
#12 0x00007fcb87260195 in glusterfs_graph_activate (graph=graph@entry=0x561402cb50f0, ctx=ctx@entry=0x561402c70010) at graph.c:776
#13 0x00005614017c3182 in glusterfs_process_volfp (ctx=ctx@entry=0x561402c70010, fp=fp@entry=0x561402cb4e70) at glusterfsd.c:2728
#14 0x00005614017c333d in glusterfs_volumes_init (ctx=ctx@entry=0x561402c70010) at glusterfsd.c:2800
#15 0x00005614017bea3a in main (argc=4, argv=<optimized out>) at glusterfsd.c:2962

Version-Release number of selected component (if applicable):
Whatever is the version in upstream. 

How reproducible:
Always

Steps to Reproduce:
1.service glusterd start.

Actual results:
glusterd crashing with core dump.

Expected results:
glusterd shouldn't crash and core files shouldn't be created.

Additional info:

Comment 1 Sanju 2019-08-22 09:22:28 UTC

Kshithij,

Can you please mention all steps of reproducer? i.e, what are the steps performed on the cluster before glusterd crashed?

I'm not sure but this might be having a relationship with the bugs that are filed under shd-multiplexing feature (saying this just because this is also a sigabrt).

Thanks,
Sanju

Comment 2 Kshithij Iyer 2019-08-22 09:25:48 UTC

(In reply to Sanju from comment #1)
> Kshithij,
> 
> Can you please mention all steps of reproducer? i.e, what are the steps
> performed on the cluster before glusterd crashed?

Just installed glusterfs using the nightly builds and tried to start glusterd. That's it! I didn't perform any other steps.


> I'm not sure but this might be having a relationship with the bugs that are
> filed under shd-multiplexing feature (saying this just because this is also
> a sigabrt).
> 
> Thanks,
> Sanju

Comment 3 Amar Tumballi 2019-08-23 13:18:13 UTC

I too have seen this, last week, but Kshithij, can you try restarting glusterd? It worked fine after restart.

Comment 4 Kshithij Iyer 2019-08-23 13:22:41 UTC

(In reply to Amar Tumballi from comment #3)
> I too have seen this, last week, but Kshithij, can you try restarting
> glusterd? It worked fine after restart.

I tried restarting as well, it didn't help Amar.

Comment 5 Atin Mukherjee 2019-08-26 03:21:58 UTC

So does this mean it happens every time we try to start glusterd? If so can you please pass the setup (ping offline) ?

Comment 6 Kshithij Iyer 2019-08-26 04:54:03 UTC

(In reply to Atin Mukherjee from comment #5)
> So does this mean it happens every time we try to start glusterd? 

Yes! This happens every time.

> If so can
> you please pass the setup (ping offline) ?

Have shared the details with you offline.

Comment 7 Atin Mukherjee 2019-08-27 12:06:04 UTC

Fix posted https://review.gluster.org/#/c/glusterfs/+/23309

Note You need to log in before you can comment on or make changes to this bug.