Bug 765376 (GLUSTER-3644)

Summary: Glusterfs volume crash
Product: [Community] GlusterFS Reporter: KentaroNishizawa <kentaro.nishizawa>
Component: glusterdAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED WONTFIX QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 3.2.3CC: amarts, gluster-bugs, nsathyan
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-03-12 05:14:11 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Description Flags
server and client logs
glusterfs Server and client log none

Description KentaroNishizawa 2011-09-27 03:48:54 EDT

I'm using glusterfs-3.2.3 and I have following problem.

When reading data via fuse, I see many following messages in log file,
and gluster server hung up and cpu load average gets very high.

glusterfs server  CentOS-5.5 x86_64
glusterfs client  Debian-5.0 Lenny

[2011-09-15 21:50:08.285585] I [socket.c:2338:socket_submit_reply] 0-tcp.volume00-server: not connected (priv->connected = 255)
[2011-09-15 21:50:08.285599] E [rpcsvc.c:1033:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x520988x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 30) to rpc-transport (tcp.volume00-server)
[2011-09-15 21:50:08.285678] E [server.c:136:server_submit_reply] (-->/usr/local/glusterfs-3.2.3/lib/libglusterfs.so.0(default_finodelk_cbk+0x81) [0x2b82c16d19c1] (-->/usr/local/glusterfs-3.2.3/lib/glusterfs/3.2.3/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xaf) [0x2aaaabcb250f] (-->/usr/local/glusterfs-3.2.3/lib/glusterfs/3.2.3/xlator/protocol/server.so(server_finodelk_cbk+0xb7) [0x2aaaabed2af7]))) 0-: Reply submission failed

Is this a bug ? 
When this occur, we have to reboot server or restart glusterfsd.

Following is setting file for server and client.
Is there something wrong with the settings?

glusterfs server setting files

$ cat glusterd.vol

volume management
    type mgmt/glusterd
    option working-directory /usr/local/glusterfs/etc/glusterfs
    option transport-type tcp
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2

glusterfs client setting files

$ cat volumename.vol

volume disk1
    type protocol/client
    option transport-type tcp/client
    option remote-host
    option ping-timeout 5
    option frame-timeout 1
    option remote-subvolume /brick01

volume disk2
    type protocol/client
    option transport-type tcp/client
    option remote-host
    option ping-timeout 5
    option frame-timeout 1
    option remote-subvolume /brick01

volume replicate1
   type cluster/replicate
   subvolumes disk1 disk2

volume cache
  type performance/io-cache
  option cache-size 256MB
  subvolumes replicate1

volume writeback
  type performance/write-behind
  option cache-size 128MB
  subvolumes cache

volume iothreads
  type performance/io-threads
  option thread-count 16
  subvolumes writeback
Comment 1 krishnan parthasarathi 2011-09-27 05:29:05 EDT
The error messages you have pasted (esp. "socket is not connected") indicate that server is unable to reach the client. We can comment on the actual nature of the problem only if you provide the client and server log files when you observe the "hang up" or spike in CPU load average.

Couple of things about the volume file,
    "option ping-timeout 5"
this makes the client disconnect if no 'activity' is seen within the last 5 seconds. A higher value should reduce the no. of disconnects.

    "option frame-timeout 1"
this makes the frame to be unwound within a second of no response from the server.
Comment 2 KentaroNishizawa 2011-09-27 22:56:46 EDT
Created attachment 677

server and client logs
Comment 3 KentaroNishizawa 2011-09-27 22:57:20 EDT
Thank you for reply.

I attached sever and client log when volume crashed
and CPU load avarage spikes.

I've seen some people posting similar error message and getting
same sittuation with myenviroment, can you tell me this is a bug 
or normal behavior ? and how to prevent this problem ?
Comment 4 krishnan parthasarathi 2011-09-28 00:06:31 EDT
The logs only tells us that frames seem to be bailing out after every 1s of no response.

frame-timeout is meant for the client to give up waiting for a response from the server for a particular frame. frame-timeout of 1s is practically very low. This will only result in client simple 'giving up' on frames and not making any progress.

ping-timeout is there for the client to detect if server is alive/responding. Having a ping-timeout as low as 5s is only going to make client think that server is down when you might be facing a genuine network related latency.

From what information you provided, there seems to be nothing 'abnormal'. I would recommend you to configure frame-timeout and ping-timeout to their default values to begin with.
Comment 5 KentaroNishizawa 2011-09-30 04:14:52 EDT
I used to run glusterfs by default timeout settings.

In my environment , I see many Reply submission failed in log file , 
that I attached before , but is this something related to CPU load spike ?
And any ideas to prevent this problem ?
Comment 6 krishnan parthasarathi 2011-09-30 04:21:57 EDT
We would need client/server log files when you see the problem with default timeouts to tell if anything is wrong.
Comment 7 KentaroNishizawa 2011-10-05 22:46:05 EDT
Created attachment 697
Comment 8 KentaroNishizawa 2011-10-05 22:47:32 EDT
´╝×We would need client/server log files when you see the problem with default
´╝×timeouts to tell if anything is wrong.

I attached log files , when it was default time out setting.
Comment 9 krishnan parthasarathi 2011-11-02 02:24:33 EDT
The small portion of the client/server log files you have attached is insufficient to investigate the problem you might be facing. The entire client and server log files is needed.

Please try to provide all the following information. 

- What is the volume configuration (gluster volume info <volname>)
- What kind of 'workload' was seen on the glusterfs client?
- Did any of the servers 'go down' or was there any network outages,
  when you see these messages?
- Attach log files of client and server(s).
- When you observe a hang issue signal USR1 to glusterfs server process(es) -  
 'kill -s USR1 <pid>'
  It dumps the process state dump in '/tmp/glusterdump.<pid>'
  Attach the above (glusterdump.pid) file(s).
Comment 10 Amar Tumballi 2012-02-27 07:18:59 EST
Need all the information asked in above comment to proceed on the issue. Please feel free to upgrade to higher versions (either master branch through git, or highest available version in v3.2.x branch).
Comment 11 Amar Tumballi 2012-03-12 05:14:11 EDT
will be closing as there is no activity, please try with latest git head (or latest releases at the time of testing next time, and open a new bug (or reopen this if it exists)).