Description of problem: Randomly on the glusterfs pods, it crashes, creates a core.* file and fills up the / partition, at this point the gluster pod stops working and we have to manually kill it. Version-Release number of selected component (if applicable): CNS 3.9 How reproducible: Customer environment Additional info:
# gdb /usr/sbin/glusterfs core.36187 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/glusterfsd... warning: the debug information found in "/usr/lib/debug//usr/sbin/glusterfsd.debug" does not match "/usr/sbin/glusterfsd" (CRC mismatch). warning: the debug information found in "/usr/lib/debug/usr/sbin/glusterfsd.debug" does not match "/usr/sbin/glusterfsd" (CRC mismatch). Missing separate debuginfo for /usr/sbin/glusterfsd Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/e7/fa7c0b09c86663966ceeb6320e43e760a521ba.debug Reading symbols from /usr/sbin/glusterfsd...(no debugging symbols found)...done. (no debugging symbols found)...done. warning: core file may not match specified executable file. [New LWP 36191] [New LWP 36187] [New LWP 36192] [New LWP 36188] [New LWP 36189] [New LWP 36193] [New LWP 36190] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gl'. Program terminated with signal 11, Segmentation fault. #0 0x0000557e5ca70051 in glusterfs_handle_translator_op () (gdb) thread apply all bt Thread 7 (Thread 0x7fa5ff44d700 (LWP 36190)): #0 0x00007fa6010154fd in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007fa601015394 in __sleep (seconds=0, seconds@entry=30) at ../sysdeps/unix/sysv/linux/sleep.c:137 #2 0x00007fa60294c3fd in pool_sweeper (arg=<optimized out>) at mem-pool.c:464 #3 0x00007fa601785dd5 in start_thread (arg=0x7fa5ff44d700) at pthread_create.c:308 #4 0x00007fa60104eb3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 6 (Thread 0x7fa5fc18d700 (LWP 36193)): #0 0x00007fa60104f113 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007fa6029806d2 in event_dispatch_epoll_worker (data=0x557e5ebf74e0) at event-epoll.c:638 #2 0x00007fa601785dd5 in start_thread (arg=0x7fa5fc18d700) at pthread_create.c:308 #3 0x00007fa60104eb3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 5 (Thread 0x7fa5ffc4e700 (LWP 36189)): #0 0x00007fa60178d411 in do_sigwait (sig=0x7fa5ffc4de1c, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:61 #1 __sigwait (set=0x7fa5ffc4de20, sig=0x7fa5ffc4de1c) at ../sysdeps/unix/sysv/linux/sigwait.c:99 #2 0x0000557e5ca6c07b in glusterfs_sigwaiter () #3 0x00007fa601785dd5 in start_thread (arg=0x7fa5ffc4e700) at pthread_create.c:308 #4 0x00007fa60104eb3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 4 (Thread 0x7fa60044f700 (LWP 36188)): #0 0x00007fa60178ceed in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007fa602932f2e in gf_timer_proc (data=0x557e5ebb9250) at timer.c:176 #2 0x00007fa601785dd5 in start_thread (arg=0x7fa60044f700) at pthread_create.c:308 #3 0x00007fa60104eb3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 3 (Thread 0x7fa5fe44b700 (LWP 36192)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007fa60295e9d8 in syncenv_task (proc=proc@entry=0x557e5ebba770) at syncop.c:603 #2 0x00007fa60295f820 in syncenv_processor (thdata=0x557e5ebba770) at syncop.c:695 #3 0x00007fa601785dd5 in start_thread (arg=0x7fa5fe44b700) at pthread_create.c:308 #4 0x00007fa60104eb3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 2 (Thread 0x7fa602e05780 (LWP 36187)): #0 0x00007fa601786f47 in pthread_join (threadid=140350875817728, thread_return=thread_return@entry=0x0) at pthread_join.c:92 #1 0x00007fa602980b90 in event_dispatch_epoll (event_pool=0x557e5ebb2f40) at event-epoll.c:732 #2 0x0000557e5ca68ea3 in main () Thread 1 (Thread 0x7fa5fec4c700 (LWP 36191)): #0 0x0000557e5ca70051 in glusterfs_handle_translator_op () #1 0x00007fa60295c4a2 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #2 0x00007fa600f97fc0 in ?? () from /lib64/libc.so.6 #3 0x0000000000000000 in ?? ()
# gdb /usr/sbin/glusterfs core.121082 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/glusterfsd... warning: the debug information found in "/usr/lib/debug//usr/sbin/glusterfsd.debug" does not match "/usr/sbin/glusterfsd" (CRC mismatch). warning: the debug information found in "/usr/lib/debug/usr/sbin/glusterfsd.debug" does not match "/usr/sbin/glusterfsd" (CRC mismatch). Missing separate debuginfo for /usr/sbin/glusterfsd Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/e7/fa7c0b09c86663966ceeb6320e43e760a521ba.debug Reading symbols from /usr/sbin/glusterfsd...(no debugging symbols found)...done. (no debugging symbols found)...done. warning: core file may not match specified executable file. [New LWP 121086] [New LWP 121087] [New LWP 121088] [New LWP 121083] [New LWP 121082] [New LWP 121084] [New LWP 121085] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gl'. Program terminated with signal 11, Segmentation fault. #0 0x000055c8eebda051 in glusterfs_handle_translator_op () (gdb) thread apply all bt Thread 7 (Thread 0x7f39a1d1f700 (LWP 121085)): #0 0x00007f39a38e74fd in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f39a38e7394 in __sleep (seconds=0, seconds@entry=30) at ../sysdeps/unix/sysv/linux/sleep.c:137 #2 0x00007f39a521e3fd in pool_sweeper (arg=<optimized out>) at mem-pool.c:464 #3 0x00007f39a4057dd5 in start_thread (arg=0x7f39a1d1f700) at pthread_create.c:308 #4 0x00007f39a3920b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 6 (Thread 0x7f39a2520700 (LWP 121084)): #0 0x00007f39a405f411 in do_sigwait (sig=0x7f39a251fe1c, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:61 #1 __sigwait (set=0x7f39a251fe20, sig=0x7f39a251fe1c) at ../sysdeps/unix/sysv/linux/sigwait.c:99 #2 0x000055c8eebd607b in glusterfs_sigwaiter () #3 0x00007f39a4057dd5 in start_thread (arg=0x7f39a2520700) at pthread_create.c:308 #4 0x00007f39a3920b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 5 (Thread 0x7f39a56d7780 (LWP 121082)): #0 0x00007f39a4058f47 in pthread_join (threadid=139885451540224, thread_return=thread_return@entry=0x0) at pthread_join.c:92 #1 0x00007f39a5252b90 in event_dispatch_epoll (event_pool=0x55c8eee26f40) at event-epoll.c:732 #2 0x000055c8eebd2ea3 in main () Thread 4 (Thread 0x7f39a2d21700 (LWP 121083)): #0 0x00007f39a405eeed in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f39a5204f2e in gf_timer_proc (data=0x55c8eee2d250) at timer.c:176 #2 0x00007f39a4057dd5 in start_thread (arg=0x7f39a2d21700) at pthread_create.c:308 #3 0x00007f39a3920b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 3 (Thread 0x7f399ea5f700 (LWP 121088)): #0 0x00007f39a3921113 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f39a52526d2 in event_dispatch_epoll_worker (data=0x55c8eee6b4e0) at event-epoll.c:638 #2 0x00007f39a4057dd5 in start_thread (arg=0x7f399ea5f700) at pthread_create.c:308 #3 0x00007f39a3920b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 2 (Thread 0x7f39a0d1d700 (LWP 121087)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f39a52309d8 in syncenv_task (proc=proc@entry=0x55c8eee2e770) at syncop.c:603 #2 0x00007f39a5231820 in syncenv_processor (thdata=0x55c8eee2e770) at syncop.c:695 #3 0x00007f39a4057dd5 in start_thread (arg=0x7f39a0d1d700) at pthread_create.c:308 #4 0x00007f39a3920b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 1 (Thread 0x7f39a151e700 (LWP 121086)): #0 0x000055c8eebda051 in glusterfs_handle_translator_op () #1 0x00007f39a522e4a2 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #2 0x00007f39a3869fc0 in ?? () from /lib64/libc.so.6 #3 0x0000000000000000 in ?? ()
upstream patch : https://review.gluster.org/20422
Downstream patch on rhgs-3.3.1 branch: https://code.engineering.redhat.com/gerrit/#/c/143109/
I'm not an expert on CNS work flow, so I cannot comment on that. But if you have a consistent reproducer that gives the same shd crash+backtrace, I suppose it should be fine. FWIW, the steps I carried out on a plain glusterfs setup (no cns) is described here https://bugzilla.redhat.com/show_bug.cgi?id=1596513#c0.
Bug report changed to ON_QA status by Errata System. A QE request has been submitted for advisory RHBA-2018:34436-01 https://errata.devel.redhat.com/advisory/34436
I have run the steps as mentioned in c#20 ie as below 1. create a replica 2 volume and start it. 2. `while true; do gluster volume heal <volname>;sleep 0.5; done` in one terminal. 3. In another terminal, keep running 'service glusterd restart` I was seen crash frequently before fix, but now with fix, I didnt see this problem , after running test for an hour hence moving to verified test version: 3.8.4-54.14
LGTM.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2222