Bug 765219 (GLUSTER-3487)

Summary: Freeze of clients by intensive parallel writes
Product: [Community] GlusterFS Reporter: Alex Aster <alrond>
Component: locksAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: 3.2.2CC: gluster-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-13 07:17:21 UTC Type: ---
Regression: --- Mount Type: fuse
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Alex Aster 2011-08-28 20:37:10 UTC
Hello, I have very bad problem.
After some time, some my client-applications have been freezed.
Then currently folders/files(where clients worked) are freezed for any other users and commands like ls, touch...
Kill -9 doesn't help. Only server restart or kill -9 of mounted glusterfs helps.

OS: Ubuntu Server 11.04, 64Bit
Typ: 2 replication content servers, each Server has 8TB Raid-6 with Ext4 (with user_xattr).
GlusterFS: 3.2.3

Mounting via fstab:
serv4:/vol-content /content glusterfs auto,noatime,nodiratime,nosuid,noexec,rw,allow_other,default_permissions,max_read=131072,_netdev 0 0

I has this Problem already with 3.1.3 and XFS, with 3.2.2 and Ext4.

I cannot always repeat this situation - a lot of parallel works read/writes with PHP-Scripts of clients.
Currently FS works, because I have moved this clients to local disk.

I have tried with clear configuration and with tuned - the same results.

Currently configuration:

Volume Name: vol-content
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: serv3:/media/content
Brick2: serv4:/media/content
Options Reconfigured:
diagnostics.dump-fd-stats: no
diagnostics.latency-measurement: no
diagnostics.client-log-level: INFO
diagnostics.brick-log-level: INFO
performance.io-cache: on
auth.allow: 192.168.0.*
performance.io-thread-count: 64
performance.write-behind-window-size: 1GB

Specially after upgrade to 3.2.3 I have enabled debug-modus and moved my clients to Gluster. After some minutes half of the client have been zombie.

For debug I have enabled:
gluster volume set vol-content diagnostics.brick-log-level DEBUG
gluster volume set vol-content diagnostics.client-log-level DEBUG
gluster volume set vol-content diagnostics.latency-measurement yes
gluster volume set vol-content diagnostics.dump-fd-stats yes

I have both logs: server and bricks, but I don't known how can I find this problem, because the logs are 400Mb just for those 2 Hours and no "error" words

Comment 1 Pranith Kumar K 2011-08-29 03:36:36 UTC
(In reply to comment #0)
> Hello, I have very bad problem.
> After some time, some my client-applications have been freezed.
> Then currently folders/files(where clients worked) are freezed for any other
> users and commands like ls, touch...
> Kill -9 doesn't help. Only server restart or kill -9 of mounted glusterfs
> helps.
> 
> OS: Ubuntu Server 11.04, 64Bit
> Typ: 2 replication content servers, each Server has 8TB Raid-6 with Ext4 (with
> user_xattr).
> GlusterFS: 3.2.3
> 
> Mounting via fstab:
> serv4:/vol-content /content glusterfs
> auto,noatime,nodiratime,nosuid,noexec,rw,allow_other,default_permissions,max_read=131072,_netdev
> 0 0
> 
> I has this Problem already with 3.1.3 and XFS, with 3.2.2 and Ext4.
> 
> I cannot always repeat this situation - a lot of parallel works read/writes
> with PHP-Scripts of clients.
> Currently FS works, because I have moved this clients to local disk.
> 
> I have tried with clear configuration and with tuned - the same results.
> 
> Currently configuration:
> 
> Volume Name: vol-content
> Type: Replicate
> Status: Started
> Number of Bricks: 2
> Transport-type: tcp
> Bricks:
> Brick1: serv3:/media/content
> Brick2: serv4:/media/content
> Options Reconfigured:
> diagnostics.dump-fd-stats: no
> diagnostics.latency-measurement: no
> diagnostics.client-log-level: INFO
> diagnostics.brick-log-level: INFO
> performance.io-cache: on
> auth.allow: 192.168.0.*
> performance.io-thread-count: 64
> performance.write-behind-window-size: 1GB
> 
> Specially after upgrade to 3.2.3 I have enabled debug-modus and moved my
> clients to Gluster. After some minutes half of the client have been zombie.
> 
> For debug I have enabled:
> gluster volume set vol-content diagnostics.brick-log-level DEBUG
> gluster volume set vol-content diagnostics.client-log-level DEBUG
> gluster volume set vol-content diagnostics.latency-measurement yes
> gluster volume set vol-content diagnostics.dump-fd-stats yes
> 
> I have both logs: server and bricks, but I don't known how can I find this
> problem, because the logs are 400Mb just for those 2 Hours and no "error" words

Could you please zip and attach the logs to the bug. I dont see any problem with the configuration you have at the moment.

Comment 2 Alex Aster 2011-08-29 06:02:22 UTC
I send you logs via email because of some sensitive informations.
13:42 - GlusterFS upgrade to 3.2.3 and add debug to config
up to 15:01 some script was started manually without problems
ca 15:12 - autostart of all scripts.
After one-two minutes some scripts already have been freezed.
15:22-23 I have stopped all

I can always repeat this situation for additional tests.

Comment 3 Alex Aster 2011-08-29 06:04:13 UTC
Has forgotten to write:
Test folder is "/vol-rest/userupload_cluster/"

Comment 4 Pranith Kumar K 2011-09-19 13:00:34 UTC
(In reply to comment #3)
> Has forgotten to write:
> Test folder is "/vol-rest/userupload_cluster/"

hi Alex,
     I don't seem to have gotten any logs mail, could you re-send the mail.

Sorry for the inconvenience.
Pranith.

Comment 5 Alex Aster 2011-09-19 15:20:48 UTC
I have send two emails on 29th August.

Aug 29 04:20:11 ubuntu postfix/smtp[25161]: 58647181566: to=<pranithk>, relay=east.smtp.exch024.serverdata.net[206.225.164.180]:25, delay=121, delays=107/0.02/0.6/13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 63A03153)
Aug 29 04:23:00 ubuntu postfix/smtp[26680]: 2B480181566: to=<pranithk>, relay=east.smtp.exch024.serverdata.net[206.225.164.180]:25, delay=101, delays=88/0.01/0.34/13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 9335E64)


And today repeated:

Sep 19 13:15:39 ubuntu postfix/smtp[2371]: E2D28180F41: to=<pranithk>, relay=east.smtp.exch024.serverdata.net[206.225.164.180]:25, delay=129, delays=110/0.01/0.57/18, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 8F7EE24C)
Sep 19 13:18:41 ubuntu postfix/smtp[2703]: 32BAC180F41: to=<pranithk>, relay=east.smtp.exch024.serverdata.net[206.225.164.180]:25, delay=105, delays=92/0.01/0.46/13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 924B3D1)

Comment 6 Pranith Kumar K 2011-09-21 13:34:34 UTC
(In reply to comment #5)
> I have send two emails on 29th August.
> 
> Aug 29 04:20:11 ubuntu postfix/smtp[25161]: 58647181566:
> to=<pranithk>,
> relay=east.smtp.exch024.serverdata.net[206.225.164.180]:25, delay=121,
> delays=107/0.02/0.6/13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as
> 63A03153)
> Aug 29 04:23:00 ubuntu postfix/smtp[26680]: 2B480181566:
> to=<pranithk>,
> relay=east.smtp.exch024.serverdata.net[206.225.164.180]:25, delay=101,
> delays=88/0.01/0.34/13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as
> 9335E64)
> 
> 
> And today repeated:
> 
> Sep 19 13:15:39 ubuntu postfix/smtp[2371]: E2D28180F41:
> to=<pranithk>,
> relay=east.smtp.exch024.serverdata.net[206.225.164.180]:25, delay=129,
> delays=110/0.01/0.57/18, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as
> 8F7EE24C)
> Sep 19 13:18:41 ubuntu postfix/smtp[2703]: 32BAC180F41:
> to=<pranithk>,
> relay=east.smtp.exch024.serverdata.net[206.225.164.180]:25, delay=105,
> delays=92/0.01/0.46/13, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as
> 924B3D1)

hi Alex,
     I went through the media-content logs, did not find anything wrong with it. The other log is not extracting
pranith @ ~/Downloads/3487/content
21:54:11 :) $ gunzip content.log.gz 

gzip: content.log.gz: unexpected end of file

pranith @ ~/Downloads/3487/content
21:54:17 :( $ ls -l !$
ls -l content.log.gz
-rw-r--r-- 1 pranith pranith 6957684 2011-09-21 21:40 content.log.gz

Could you update that log.

If there are 2 bricks and 1 client, there should be 3 logs. 2 (bricks) media-content logs and 1 (mount) content log.

Do you hang out on IRC? if yes what is your nick?.

Comment 7 Pranith Kumar K 2012-06-13 07:17:21 UTC
Please feel free to re-open with the necessary logs.