Hi I'm using glusterfs-3.2.3 and I have following problem. ------------------------------------------------------------------------- Problem ------------------------------------------------------------------------- When reading data via fuse, I see many following messages in log file, and gluster server hung up and cpu load average gets very high. glusterfs server CentOS-5.5 x86_64 glusterfs client Debian-5.0 Lenny ------------------------------------------------------------------------- Messages ------------------------------------------------------------------------- [2011-09-15 21:50:08.285585] I [socket.c:2338:socket_submit_reply] 0-tcp.volume00-server: not connected (priv->connected = 255) [2011-09-15 21:50:08.285599] E [rpcsvc.c:1033:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x520988x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 30) to rpc-transport (tcp.volume00-server) [2011-09-15 21:50:08.285678] E [server.c:136:server_submit_reply] (-->/usr/local/glusterfs-3.2.3/lib/libglusterfs.so.0(default_finodelk_cbk+0x81) [0x2b82c16d19c1] (-->/usr/local/glusterfs-3.2.3/lib/glusterfs/3.2.3/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xaf) [0x2aaaabcb250f] (-->/usr/local/glusterfs-3.2.3/lib/glusterfs/3.2.3/xlator/protocol/server.so(server_finodelk_cbk+0xb7) [0x2aaaabed2af7]))) 0-: Reply submission failed Is this a bug ? When this occur, we have to reboot server or restart glusterfsd. Following is setting file for server and client. Is there something wrong with the settings? ------------------------------------------------------------------------- glusterfs server setting files ------------------------------------------------------------------------- $ cat glusterd.vol volume management type mgmt/glusterd option working-directory /usr/local/glusterfs/etc/glusterfs option transport-type tcp option transport.socket.keepalive-time 10 option transport.socket.keepalive-interval 2 end-volume ------------------------------------------------------------------------- glusterfs client setting files ------------------------------------------------------------------------- $ cat volumename.vol volume disk1 type protocol/client option transport-type tcp/client option remote-host 192.168.1.1 option ping-timeout 5 option frame-timeout 1 option remote-subvolume /brick01 end-volume volume disk2 type protocol/client option transport-type tcp/client option remote-host 192.168.1.2 option ping-timeout 5 option frame-timeout 1 option remote-subvolume /brick01 end-volume volume replicate1 type cluster/replicate subvolumes disk1 disk2 end-volume volume cache type performance/io-cache option cache-size 256MB subvolumes replicate1 end-volume volume writeback type performance/write-behind option cache-size 128MB subvolumes cache end-volume volume iothreads type performance/io-threads option thread-count 16 subvolumes writeback end-volume
The error messages you have pasted (esp. "socket is not connected") indicate that server is unable to reach the client. We can comment on the actual nature of the problem only if you provide the client and server log files when you observe the "hang up" or spike in CPU load average. Couple of things about the volume file, "option ping-timeout 5" this makes the client disconnect if no 'activity' is seen within the last 5 seconds. A higher value should reduce the no. of disconnects. "option frame-timeout 1" this makes the frame to be unwound within a second of no response from the server.
Created attachment 677 server and client logs
Thank you for reply. I attached sever and client log when volume crashed and CPU load avarage spikes. I've seen some people posting similar error message and getting same sittuation with myenviroment, can you tell me this is a bug or normal behavior ? and how to prevent this problem ?
The logs only tells us that frames seem to be bailing out after every 1s of no response. frame-timeout is meant for the client to give up waiting for a response from the server for a particular frame. frame-timeout of 1s is practically very low. This will only result in client simple 'giving up' on frames and not making any progress. ping-timeout is there for the client to detect if server is alive/responding. Having a ping-timeout as low as 5s is only going to make client think that server is down when you might be facing a genuine network related latency. From what information you provided, there seems to be nothing 'abnormal'. I would recommend you to configure frame-timeout and ping-timeout to their default values to begin with.
I used to run glusterfs by default timeout settings. In my environment , I see many Reply submission failed in log file , that I attached before , but is this something related to CPU load spike ? And any ideas to prevent this problem ?
We would need client/server log files when you see the problem with default timeouts to tell if anything is wrong.
Created attachment 697
>We would need client/server log files when you see the problem with default >timeouts to tell if anything is wrong. I attached log files , when it was default time out setting.
The small portion of the client/server log files you have attached is insufficient to investigate the problem you might be facing. The entire client and server log files is needed. Please try to provide all the following information. - What is the volume configuration (gluster volume info <volname>) - What kind of 'workload' was seen on the glusterfs client? - Did any of the servers 'go down' or was there any network outages, when you see these messages? - Attach log files of client and server(s). - When you observe a hang issue signal USR1 to glusterfs server process(es) - 'kill -s USR1 <pid>' It dumps the process state dump in '/tmp/glusterdump.<pid>' Attach the above (glusterdump.pid) file(s).
Need all the information asked in above comment to proceed on the issue. Please feel free to upgrade to higher versions (either master branch through git, or highest available version in v3.2.x branch).
will be closing as there is no activity, please try with latest git head (or latest releases at the time of testing next time, and open a new bug (or reopen this if it exists)).