One of brick just crashed glusterfsd and it cant be started again What can I do to start it again ? crash dump gdb: Program terminated with signal SIGSEGV, Segmentation fault. #0 up_lk (frame=0x7fea88193f30, this=0x7feb3401c770, fd=0x0, cmd=6, flock=0x7feb0d174d40, xdata=0x0) at upcall.c:239 239 local = upcall_local_init (frame, this, NULL, NULL, fd->inode, NULL); [Current thread is 1 (Thread 0x7feb0031e700 (LWP 12319))] (gdb) bt #0 up_lk (frame=0x7fea88193f30, this=0x7feb3401c770, fd=0x0, cmd=6, flock=0x7feb0d174d40, xdata=0x0) at upcall.c:239 #1 0x00007feb3e1cf65d in default_lk_resume (frame=0x7feb0d174ae0, this=0x7feb3401e060, fd=0x0, cmd=6, lock=0x7feb0d174d40, xdata=0x0) at defaults.c:1833 #2 0x00007feb3e166f35 in call_resume (stub=0x7feb0d174bf0) at call-stub.c:2508 #3 0x00007feb31e00d74 in iot_worker (data=0x7feb34058480) at io-threads.c:222 #4 0x00007feb3d8ca6ba in start_thread (arg=0x7feb0031e700) at pthread_create.c:333 #5 0x00007feb3d60041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 (gdb) bt full #0 up_lk (frame=0x7fea88193f30, this=0x7feb3401c770, fd=0x0, cmd=6, flock=0x7feb0d174d40, xdata=0x0) at upcall.c:239 op_errno = -1 local = 0x0 __FUNCTION__ = "up_lk" #1 0x00007feb3e1cf65d in default_lk_resume (frame=0x7feb0d174ae0, this=0x7feb3401e060, fd=0x0, cmd=6, lock=0x7feb0d174d40, xdata=0x0) at defaults.c:1833 _new = 0x7fea88193f30 old_THIS = 0x7feb3401e060 tmp_cbk = 0x7feb3e1bafa0 <default_lk_cbk> __FUNCTION__ = "default_lk_resume" #2 0x00007feb3e166f35 in call_resume (stub=0x7feb0d174bf0) at call-stub.c:2508 old_THIS = 0x7feb3401e060 __FUNCTION__ = "call_resume" #3 0x00007feb31e00d74 in iot_worker (data=0x7feb34058480) at io-threads.c:222 conf = 0x7feb34058480 this = <optimized out> stub = 0x7feb0d174bf0 sleep_till = {tv_sec = 1556637893, tv_nsec = 0} ret = <optimized out> pri = 1 bye = _gf_false __FUNCTION__ = "iot_worker" #4 0x00007feb3d8ca6ba in start_thread (arg=0x7feb0031e700) at pthread_create.c:333 __res = <optimized out> pd = 0x7feb0031e700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140647297312512, 5756482990956014801, 0, 140648089937359, 140647297313216, 140648166818944, -5749651260269466415, -5749590536105693999}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" #5 0x00007feb3d60041d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 No locals. (gdb) # config # gluster volume info Volume Name: hadoop_volume Type: Disperse Volume ID: f13b43b0-ff9e-429b-81ed-15c92cdd1181 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: hdd1:/hadoop Brick2: hdd2:/hadoop Brick3: hdd3:/hadoop Options Reconfigured: cluster.disperse-self-heal-daemon: enable server.statedump-path: /tmp performance.client-io-threads: on server.event-threads: 16 client.event-threads: 16 cluster.lookup-optimize: on performance.parallel-readdir: on transport.address-family: inet nfs.disable: on features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.cache-invalidation: on performance.md-cache-timeout: 600 network.inode-lru-limit: 500000 features.lock-heal: on # status # gluster volume status Status of volume: hadoop_volume Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick hdd1:/hadoop 49152 0 Y 5085 Brick hdd2:/hadoop 49152 0 Y 4044 Self-heal Daemon on localhost N/A N/A Y 2383 Self-heal Daemon on serv3 N/A N/A Y 2423 Self-heal Daemon on serv2 N/A N/A Y 3429 Self-heal Daemon on hdd2 N/A N/A Y 4035 Self-heal Daemon on hdd1 N/A N/A Y 5076 Task Status of Volume hadoop_volume ------------------------------------------------------------------------------ There are no active volume tasks
Can you upload the coredump to be able to analyze it ? I will also need to know the exact version of gluster and the operating system you are using. To restart the crashed brick, the following command should help: # gluster volume start hadoop_volume force
https://drive.google.com/file/d/1n2IeRNqwXYmF1q664Rvtr5RuDu5taDz9/view?usp=sharing
Thanks for the sharing the coredump. I'll take a look as soon as I can.
Sorry for the late answer. I've checked the core dump and it seems to belong to a glusterfs 3.10.10. This is a very old version and it's already EOL. Is it possible to upgrade to a newer supported version and check if it works ? At first sight I don't see a similar bug, but many things have changed since then. If you are unable to upgrade, let me know which version of operating system are you using and which source you use to install gluster packages so that I can find the appropriate symbols to analyze the core.
Hi waza123, Did you get a chance to upgrade? We have fixed many issues post 3.10.10 version, and we are already at 6.3 version, and about to make a glusterfs-7.0 version. I will be closing the issue as EOL (as the given version is no more supported). Please re-open if the issue persists on higher versions.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days