Bug 1272436
Summary: | glusterd crashing | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | gene |
Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.7.4 | CC: | amukherj, bugs, florian.leduc, gene, mselvaga, nicolas, smohan |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-06-22 05:06:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1288060 | ||
Bug Blocks: |
Description
gene
2015-10-16 12:05:28 UTC
Hello guys, I've been experiencing the same issue lately. I've got a 255 GB core dump ... that I could'nt exploit. The symptoms are quite the same that explained in the original report (https://www.gluster.org/pipermail/gluster-users/2015-October/023784.html) glustershd.log.2.gz:[2015-11-26 06:47:59.053991] W [glusterfsd.c:1236:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7fbe53b8f182] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fbe548cc7c5] -->/usr/sbin/glusterfs(clea nup_and_exit+0x69) [0x7fbe548cc659] ) 0-: received signum (15), shutting down after that message, glusterd is crashed. I'm running glusterfs on ubuntu 14.04: ii glusterfs-client 3.7.6-ubuntu1~trusty1 amd64 clustered file-system (client package) ii glusterfs-common 3.7.6-ubuntu1~trusty1 amd64 GlusterFS common libraries and translator modules ii glusterfs-server 3.7.6-ubuntu1~trusty1 amd64 clustered file-system (server package) I will follow this thread, if you need more input, feel free to let me know. Hello guys, Any updates on this bug? What do you suggest ? Should I wait for a immient patch or downgrade my servers to a prior version (<= 3.7.4) ? Could you mention the configuration values for the following options in glusterd.vol file? ping-timeout event-threads We observed few crashes when multi threaded e-poll support was enabled in glusterd and I suspect this could be one of them. We had decided to revert the settings. You shouldn't be seeing this crash with 3.7.6 onwards. Hello, thanks for your quick answers. Here's a sample or glusterd.vol: volume management type mgmt/glusterd option working-directory /var/lib/glusterd option transport-type socket,rdma option transport.socket.keepalive-time 10 option transport.socket.keepalive-interval 2 option transport.socket.read-fail-log off option ping-timeout 30 # option base-port 49152 end-volume Are you ok to upgrade to 3.7.6 and try it out? Hello, The main problem is that I'm already using that version (see comment#1). Should I downgrade ? (In reply to florian.leduc from comment #6) > Hello, > > The main problem is that I'm already using that version (see comment#1). > Should I downgrade ? Hi florian, Could you do some configuration in glusterd.vol file which present in /usr/local/etc/glusterfs/glusterd.vol file. in that file add/modify below entry: option event-threads 1 option ping-timeout 0 and restart the glusterd, and let me know if you face glusterd crash problem again. I've one more point to share. I thought we had already disabled multi threaded e-poll support in GlusterD and it seems like we missed to do so and will surely do it in next 3.7.x release. #c7 actually is a work around to disable it. Hello Guys, I'll do that today or tomorrow. I'll keep you up to date. Hi florian, patch http://review.gluster.org/#/c/12874/ will be available soon in gluster codebase. meanwhile you can do configuration of glusterd.vol file manually and let us know if the issue is still reproducing after doing configuration. Perfect, I've just modified the settings. We will monitor our systems intensively and let you know if the crashes still occur. Thanks for your quick replies. Hello Guys, No glusterd crashing during the whole weekend :). Should I maintain those options in my CMDB or should I wait for the next patch to get it? Regards, (In reply to florian.leduc from comment #12) > Hello Guys, > > No glusterd crashing during the whole weekend :). Should I maintain those > options in my CMDB or should I wait for the next patch to get it? > > Regards, Folrian, We'd encourage you to maintain the same configuration till we release 3.7.7. Thanks, Atin Hello guys, For some times, no crashed occured, but after enabling quota feature we started to see crashes of glusterfsd (but no more glusterd) and we experienced wierd behavior: 1. glusterfsd crashes from times to times (see backtrace below) 2. after enabling quotas, a lot of CPU was consumed (around 60% of 32 vcpu). 3. a lot of split-brain and unsynched entries has appeared in gluster vomheal info [2015-12-15 17:35:54.236684] I [glusterfsd-mgmt.c:57:mgmt_cbk_spec] 0-mgmt: Volume file changed [2015-12-15 17:35:54.241767] I [graph.c:269:gf_add_cmdline_options] 0-data-01-server: adding option 'listen-port' for volume 'data-01-server' with value '49154' [2015-12-15 17:35:54.241810] I [graph.c:269:gf_add_cmdline_options] 0-data-01-posix: adding option 'glusterd-uuid' for volume 'data-01-posix' with value 'e2a44035-0e7d-4796-819a-062f916b0d49' [2015-12-15 17:35:54.248617] I [MSGID: 121037] [changetimerecorder.c:1686:reconfigure] 0-data-01-changetimerecorder: set! [2015-12-15 17:35:54.249140] W [socket.c:3636:reconfigure] 0-data-01-quota: NBIO on -1 failed (Bad file descriptor) [2015-12-15 17:35:54.249388] I [MSGID: 115034] [server.c:403:_check_for_auth_option] 0-/var/opt/hosting/data/volume_data-01: skip format check for non-addr auth option auth.login./var/opt/hosting/data/volume_data-01.allow [2015-12-15 17:35:54.249442] I [MSGID: 115034] [server.c:403:_check_for_auth_option] 0-/var/opt/hosting/data/volume_data-01: skip format check for non-addr auth option auth.login.8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e.password [2015-12-15 17:35:54.249648] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e [2015-12-15 17:35:54.249686] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e [2015-12-15 17:35:54.249713] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e [2015-12-15 17:35:54.249741] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e [2015-12-15 17:35:54.249771] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e [2015-12-15 17:35:54.249795] I [login.c:81:gf_auth] 0-auth/login: allowed user names: 8d63107f-2fe9-40ce-99e6-6a7a6ac0d49e pending frames: frame : type(0) op(14) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-12-15 17:35:54 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.6 /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x92)[0x7f9aced33562] /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f9aced4f51d] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7f9ace131d40] /lib/x86_64-linux-gnu/libpthread.so.0(pthread_spin_lock+0x0)[0x7f9ace4cd0f0] --------- Hi Vijaikumar, Can you please take a look at it? Hi Folrian, Could you please provide the stack-trace from the glusterfsd core-dump Thanks, Vijay Hello Vijaikumar, thanks for your reply. After a quick look at the system. I could'nt find any core dumps, can you give me a hint of where it should be located ? (I tried to google it, but no luck so far). I once got a core dump in brick which is: /var/opt/hosting/data/volume_data-01. BTW, here's our configuration: Volume Name: data-01 Type: Replicate Volume ID: 4b2b4dbe-a8dd-4988-b76e-0e1fc7c0dda9 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.234.208.154:/var/opt/hosting/data/volume_data-01 Brick2: 10.234.208.155:/var/opt/hosting/data/volume_data-01 Options Reconfigured: features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.readdir-ahead: on nfs.disable: on cluster.self-heal-window-size: 128 cluster.data-self-heal-algorithm: diff cluster.min-free-disk: 5 network.frame-timeout: 600 network.ping-timeout: 60 performance.write-behind-window-size: 128MB performance.cache-max-file-size: 100MB performance.cache-min-file-size: 1KB performance.cache-size: 10GB performance.cache-refresh-timeout: 5 cluster.self-heal-daemon: on Hi Folrian, Usually core-file will be generated under the root dir '/' (which is a cwd of a brick process). If the core pattern is set in the kernel parameter to gerenerate corefile in a different directory other than cwd, it will be in the specified dir. In RHEL, core pattern may be set to '/var/crash' or '/var/log/crash' Command to check the core pattern 'sysctl kernel.core_pattern' Also check for 'ulimit -c', if it is zero then corefile would have not generated We will also try to re-create this problem in-house Thanks, Vijay Hi, I havent found any trails of core files on that system (that should be named "core" according to sysctl). I'll do more searching on the next crash. here's a pastebin alerts sent by thru syslog: http://pastebin.com/1JZZuz86 Hi everyone, We're still experiencing a lot of severe crashes (no trail of core dump on the volume) and then a lot of unsynched entries after healing passed even after reinstalling the whole volume from scratch. ==== Logs: red-ack Dec 22 21:10:30: Program: ssh%3A%2F%2Froot%4010.234.208.15 [2015-12-22 20, Facility: daemon, Level: crit 10:30.601517] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153 has not responded in the last 60 seconds, disconnecting. 10:30.601517] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153 has not responded in the last 60 seconds, disconnecting. red-ack Dec 22 21:10:21: Program: glustershd[40694], Facility: daemon, Level: crit [2015-12-22 20:10:21.209994] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153 has not responded in the last 60 seconds, disconnecting. [2015-12-22 20:10:21.209994] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153 has not responded in the last 60 seconds, disconnecting. red-ack Dec 22 21:10:15: Program: ssh%3A%2F%2Froot%4010.234.144.57 [2015-12-22 20, Facility: daemon, Level: crit 10:15.976956] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153 has not responded in the last 60 seconds, disconnecting. 10:15.976956] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153 has not responded in the last 60 seconds, disconnecting. red-ack Dec 22 21:09:30: Program: var-opt-hosting-shared-volumes-d [2015-12-22 20, Facility: daemon, Level: crit 09:30.414887] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153 has not responded in the last 60 seconds, disconnecting. 09:30.414887] C [rpc-clnt-ping.c:165:rpc_clnt_ping_timer_expired] 0-data-01-client-0: server 10.234.240.57:49153 has not responded in the last 60 seconds, disconnecting. ==== Volume Heal info output: .... <gfid:e2d18ab9-a607-499d-babf-8fdaa90dd0bb> <gfid:199ba193-0788-4e3b-8951-26f0841c7e45> <gfid:77e2401a-2b98-4713-99b3-444bff26a222> <gfid:aa47948d-cd91-4d70-941d-21342d4acf06> <gfid:ef1f3a4f-6c7b-4741-a846-e8e78174369a> <gfid:38856f67-d776-4000-ab42-e548a0ab5f09> <gfid:7aa8f688-a53b-4962-81da-ffe5c45ac025> <gfid:b9d4bef4-bdee-45dc-bac5-85fdb45f6f41> <gfid:ba930fd2-3f46-4c32-99f4-6b6f344b649b> <gfid:4d6b8109-cf72-4837-bc48-45158785227a> <gfid:62025fc2-e011-4ce0-a3bb-2815bceaaac4> Number of entries: 853 Could you please advise. Thanks. Is the crash from glusterd or brick process? Hello Atin, I'd say the brick process but I have the feeling that ping-timeout set to 0 may be related to those crashes/timeouts. What do you suggest ? keep feeding this thread or opening a new one ? (In reply to florian.leduc from comment #23) > Hello Atin, > > I'd say the brick process but I have the feeling that ping-timeout set to 0 > may be related to those crashes/timeouts. I don't think ping timeout will contribute to it. > > What do you suggest ? keep feeding this thread or opening a new one ? I highly recommend of opening a new bug for this as otherwise it will be misleading since this bug talks about a crash in glusterd process. Since I've not received any further details around this bug, I am closing it right now, feel free to reopen if the issue persists. |