765376 – (GLUSTER-3644) Glusterfs volume crash

Bug 765376 (GLUSTER-3644) - Glusterfs volume crash

Summary: Glusterfs volume crash

Keywords:
Status:	CLOSED WONTFIX
Alias:	GLUSTER-3644
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.2.3
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	krishnan parthasarathi
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-09-27 07:48 UTC by KentaroNishizawa
Modified:	2015-12-01 16:45 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-03-12 09:14:11 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
server and client logs (6.47 KB, text/plain) 2011-09-28 02:56 UTC, KentaroNishizawa	no flags	Details
glusterfs Server and client log (8.91 KB, text/plain) 2011-10-06 02:46 UTC, KentaroNishizawa	no flags	Details
View All

Description KentaroNishizawa 2011-09-27 07:48:54 UTC

Hi 

I'm using glusterfs-3.2.3 and I have following problem.

-------------------------------------------------------------------------
Problem
-------------------------------------------------------------------------
When reading data via fuse, I see many following messages in log file,
and gluster server hung up and cpu load average gets very high.

glusterfs server  CentOS-5.5 x86_64
glusterfs client  Debian-5.0 Lenny

-------------------------------------------------------------------------
Messages
-------------------------------------------------------------------------
[2011-09-15 21:50:08.285585] I [socket.c:2338:socket_submit_reply] 0-tcp.volume00-server: not connected (priv->connected = 255)
[2011-09-15 21:50:08.285599] E [rpcsvc.c:1033:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x520988x, Program: GlusterFS-3.1.0, ProgVers: 310, Proc: 30) to rpc-transport (tcp.volume00-server)
[2011-09-15 21:50:08.285678] E [server.c:136:server_submit_reply] (-->/usr/local/glusterfs-3.2.3/lib/libglusterfs.so.0(default_finodelk_cbk+0x81) [0x2b82c16d19c1] (-->/usr/local/glusterfs-3.2.3/lib/glusterfs/3.2.3/xlator/debug/io-stats.so(io_stats_finodelk_cbk+0xaf) [0x2aaaabcb250f] (-->/usr/local/glusterfs-3.2.3/lib/glusterfs/3.2.3/xlator/protocol/server.so(server_finodelk_cbk+0xb7) [0x2aaaabed2af7]))) 0-: Reply submission failed

Is this a bug ? 
When this occur, we have to reboot server or restart glusterfsd.

Following is setting file for server and client.
Is there something wrong with the settings?



-------------------------------------------------------------------------
glusterfs server setting files
-------------------------------------------------------------------------

$ cat glusterd.vol

volume management
    type mgmt/glusterd
    option working-directory /usr/local/glusterfs/etc/glusterfs
    option transport-type tcp
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
end-volume

-------------------------------------------------------------------------
glusterfs client setting files
-------------------------------------------------------------------------

$ cat volumename.vol

volume disk1
    type protocol/client
    option transport-type tcp/client
    option remote-host 192.168.1.1
    option ping-timeout 5
    option frame-timeout 1
    option remote-subvolume /brick01
end-volume

volume disk2
    type protocol/client
    option transport-type tcp/client
    option remote-host 192.168.1.2
    option ping-timeout 5
    option frame-timeout 1
    option remote-subvolume /brick01
end-volume

volume replicate1
   type cluster/replicate
   subvolumes disk1 disk2
end-volume

volume cache
  type performance/io-cache
  option cache-size 256MB
  subvolumes replicate1
end-volume

volume writeback
  type performance/write-behind
  option cache-size 128MB
  subvolumes cache
end-volume

volume iothreads
  type performance/io-threads
  option thread-count 16
  subvolumes writeback
end-volume

Comment 1 krishnan parthasarathi 2011-09-27 09:29:05 UTC

The error messages you have pasted (esp. "socket is not connected") indicate that server is unable to reach the client. We can comment on the actual nature of the problem only if you provide the client and server log files when you observe the "hang up" or spike in CPU load average.

Couple of things about the volume file,
    "option ping-timeout 5"
this makes the client disconnect if no 'activity' is seen within the last 5 seconds. A higher value should reduce the no. of disconnects.

    "option frame-timeout 1"
this makes the frame to be unwound within a second of no response from the server.

Comment 2 KentaroNishizawa 2011-09-28 02:56:46 UTC

Created attachment 677


server and client logs

Comment 3 KentaroNishizawa 2011-09-28 02:57:20 UTC

Thank you for reply.

I attached sever and client log when volume crashed
and CPU load avarage spikes.

I've seen some people posting similar error message and getting
same sittuation with myenviroment, can you tell me this is a bug 
or normal behavior ? and how to prevent this problem ?

Comment 4 krishnan parthasarathi 2011-09-28 04:06:31 UTC

The logs only tells us that frames seem to be bailing out after every 1s of no response.

frame-timeout is meant for the client to give up waiting for a response from the server for a particular frame. frame-timeout of 1s is practically very low. This will only result in client simple 'giving up' on frames and not making any progress.

ping-timeout is there for the client to detect if server is alive/responding. Having a ping-timeout as low as 5s is only going to make client think that server is down when you might be facing a genuine network related latency.

From what information you provided, there seems to be nothing 'abnormal'. I would recommend you to configure frame-timeout and ping-timeout to their default values to begin with.

Comment 5 KentaroNishizawa 2011-09-30 08:14:52 UTC

I used to run glusterfs by default timeout settings.

In my environment , I see many Reply submission failed in log file , 
that I attached before , but is this something related to CPU load spike ?
And any ideas to prevent this problem ?

Comment 6 krishnan parthasarathi 2011-09-30 08:21:57 UTC

We would need client/server log files when you see the problem with default timeouts to tell if anything is wrong.

Comment 7 KentaroNishizawa 2011-10-06 02:46:05 UTC

Created attachment 697

Comment 8 KentaroNishizawa 2011-10-06 02:47:32 UTC

＞We would need client/server log files when you see the problem with default
＞timeouts to tell if anything is wrong.

I attached log files , when it was default time out setting.

Comment 9 krishnan parthasarathi 2011-11-02 06:24:33 UTC

The small portion of the client/server log files you have attached is insufficient to investigate the problem you might be facing. The entire client and server log files is needed.

Please try to provide all the following information. 

- What is the volume configuration (gluster volume info <volname>)
- What kind of 'workload' was seen on the glusterfs client?
- Did any of the servers 'go down' or was there any network outages,
  when you see these messages?
- Attach log files of client and server(s).
- When you observe a hang issue signal USR1 to glusterfs server process(es) -  
 'kill -s USR1 <pid>'
  It dumps the process state dump in '/tmp/glusterdump.<pid>'
  Attach the above (glusterdump.pid) file(s).

Comment 10 Amar Tumballi 2012-02-27 12:18:59 UTC

Need all the information asked in above comment to proceed on the issue. Please feel free to upgrade to higher versions (either master branch through git, or highest available version in v3.2.x branch).

Comment 11 Amar Tumballi 2012-03-12 09:14:11 UTC

will be closing as there is no activity, please try with latest git head (or latest releases at the time of testing next time, and open a new bug (or reopen this if it exists)).

Note You need to log in before you can comment on or make changes to this bug.