1402621 – High load one node, gluster fuse clients hang, heal info does not complete

Bug 1402621 - High load one node, gluster fuse clients hang, heal info does not complete

Summary: High load one node, gluster fuse clients hang, heal info does not complete

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.7.16
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Ravishankar N
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-08 00:22 UTC by denmat
Modified:	2017-03-08 10:53 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-03-08 10:53:43 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
ftp gluster fuse client log (redacted personal information) (199.14 KB, application/x-gzip) 2016-12-08 00:22 UTC, denmat	no flags	Details
statedump from gluster node with high load (683.38 KB, application/x-gzip) 2016-12-08 00:23 UTC, denmat	no flags	Details
statedump from gluster node with high load (619.43 KB, application/x-gzip) 2016-12-09 05:18 UTC, denmat	no flags	Details
View All

Description denmat 2016-12-08 00:22:04 UTC

Created attachment 1229281 [details]
ftp gluster fuse client log (redacted personal information)

Description of problem:

We have a problem that has occurred twice in two days, but has occurred more than once before.

3 x node Fedora Cluster in AWS (m4.xlarge) (Fedora 23 (Cloud Edition))
2.5Tb volume
Volume Name: marketplace_nfs
Type: Distributed-Replicate
Volume ID: 528de1b5-0bd5-488b-83cf-c4f3f747e6cd
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.90.5.105:/data/data0/marketplace_nfs
Brick2: 10.90.3.14:/data/data3/marketplace_nfs
Brick3: 10.90.4.195:/data/data0/marketplace_nfs
Brick4: 10.90.5.105:/data/data1/marketplace_nfs
Brick5: 10.90.3.14:/data/data1/marketplace_nfs
Brick6: 10.90.4.195:/data/data1/marketplace_nfs
Options Reconfigured:
server.outstanding-rpc-limit: 128
cluster.self-heal-readdir-size: 16KB
cluster.self-heal-window-size: 3
diagnostics.brick-log-level: INFO
network.ping-timeout: 15
cluster.quorum-type: none
performance.readdir-ahead: on
cluster.self-heal-daemon: enable
performance.cache-size: 512MB
cluster.lookup-optimize: on
cluster.data-self-heal-algorithm: diff
cluster.server-quorum-ratio: 51%

Status of volume: marketplace_nfs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.90.5.105:/data/data0/marketplace_n
fs                                          49152     0          Y       3426
Brick 10.90.3.14:/data/data3/marketplace_nf
s                                           49154     0          Y       3402
Brick 10.90.4.195:/data/data0/marketplace_n
fs                                          49152     0          Y       4868
Brick 10.90.5.105:/data/data1/marketplace_n
fs                                          49153     0          Y       31636
Brick 10.90.3.14:/data/data1/marketplace_nf
s                                           49153     0          Y       348
Brick 10.90.4.195:/data/data1/marketplace_n
fs                                          49153     0          Y       31238
NFS Server on localhost                     2049      0          Y       3999
Self-heal Daemon on localhost               N/A       N/A        Y       4008
NFS Server on ip-10-90-5-105.ec2.internal   2049      0          Y       1488
Self-heal Daemon on ip-10-90-5-105.ec2.inte
rnal                                        N/A       N/A        Y       1496
NFS Server on ip-10-90-4-195.ec2.internal   2049      0          Y       20526
Self-heal Daemon on ip-10-90-4-195.ec2.inte
rnal                                        N/A       N/A        Y       20534

Task Status of Volume marketplace_nfs
------------------------------------------------------------------------------
There are no active volume tasks


Version-Release number of selected component (if applicable):
3.7.16

How reproducible:

Cannot reproduce on demand but occurs frequently.

Actual results:

Client processes hang and cannot list the GlusterFS mount
$ gluster volume heal marketplace_nfs info hangs and cannot list healing information
Shutdown clients (not umount - halt clients)
$ gluster volume heal completes
Load starts reducing and we can remount.
Recovery time is around 20 minutes and causes significant problems

Expected results:
This does not happen


Additional info:

The file size average is 13Mb - 5Gb is around the largest size. We do some post processing after initial upload (mv, unzip, mv, delete). We have the logs from the ftp server, web servers also mount and work off this volume but we do not have logs from them.

 
Gluster servers provide no useful logging during this time. I will attach statedumps as well as the client log.

Comment 1 denmat 2016-12-08 00:23:35 UTC

Created attachment 1229282 [details]
statedump from gluster node with high load

statedump from gluster node

Comment 2 SATHEESARAN 2016-12-08 06:22:02 UTC

As per the actual results in comment0, user is seeing that heal info command hangs, and load starts reducing once heal completes. With initial thoughts, it looks like a 'replica' issue. Moving this bug to appropriate component

Comment 3 denmat 2016-12-09 05:16:52 UTC

$ sudo gluster volume heal marketplace_nfs info
Brick 10.90.5.105:/data/data0/marketplace_nfs
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.90.3.14:/data/data3/marketplace_nfs
<gfid:5bba3981-5a34-4fae-9efc-12dc4638baaa>
...
<output removed>
...
Status: Connected
Number of entries: 146

Brick 10.90.4.195:/data/data0/marketplace_nfs
<gfid:53834b40-8bb6-4d79-a393-46daaaf36f13>
...
<output removed>
...
Status: Connected
Number of entries: 142

Brick 10.90.5.105:/data/data1/marketplace_nfs
Status: Connected
Number of entries: 0

Brick 10.90.3.14:/data/data1/marketplace_nfs
Status: Connected
Number of entries: 0

Brick 10.90.4.195:/data/data1/marketplace_nfs
Status: Connected
Number of entries: 0

$ sudo gluster v status
Status of volume: marketplace_nfs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.90.5.105:/data/data0/marketplace_n
fs                                          49152     0          Y       3426
Brick 10.90.3.14:/data/data3/marketplace_nf
s                                           49154     0          Y       3402
Brick 10.90.4.195:/data/data0/marketplace_n
fs                                          49152     0          Y       4868
Brick 10.90.5.105:/data/data1/marketplace_n
fs                                          49153     0          Y       31636
Brick 10.90.3.14:/data/data1/marketplace_nf
s                                           49153     0          Y       348
Brick 10.90.4.195:/data/data1/marketplace_n
fs                                          49153     0          Y       31238
NFS Server on localhost                     2049      0          Y       20526
Self-heal Daemon on localhost               N/A       N/A        Y       20534
NFS Server on ip-10-90-5-105.ec2.internal   2049      0          Y       1488
Self-heal Daemon on ip-10-90-5-105.ec2.inte
rnal                                        N/A       N/A        Y       1496
NFS Server on 10.90.3.14                    2049      0          Y       3999
Self-heal Daemon on 10.90.3.14              N/A       N/A        Y       4008

Task Status of Volume marketplace_nfs
------------------------------------------------------------------------------
There are no active volume tasks

Status of volume: marketplace_uploads
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.90.4.195:/data/data2/uploads       49154     0          Y       20506
Brick 10.90.3.14:/data/data2/uploads        49155     0          Y       3976
Brick 10.90.5.105:/data/data2/uploads       49154     0          Y       1468
NFS Server on localhost                     2049      0          Y       20526
Self-heal Daemon on localhost               N/A       N/A        Y       20534
NFS Server on ip-10-90-5-105.ec2.internal   2049      0          Y       1488
Self-heal Daemon on ip-10-90-5-105.ec2.inte
rnal                                        N/A       N/A        Y       1496
NFS Server on 10.90.3.14                    2049      0          Y       3999
Self-heal Daemon on 10.90.3.14              N/A       N/A        Y       4008

Task Status of Volume marketplace_uploads
------------------------------------------------------------------------------
There are no active volume tasks


It has happened again. We shutdown some hosts and the heal info started completing.
You can see the host with the high load is reporting
Brick 10.90.5.105:/data/data0/marketplace_nfs
Status: Transport endpoint is not connected
Number of entries: -

Will attach another statedump from this node.

Comment 4 denmat 2016-12-09 05:18:07 UTC

Created attachment 1229816 [details]
statedump from gluster node with high load

Comment 5 denmat 2016-12-14 12:13:42 UTC

Further to this: these errors appear in the log on one node - always the same node. Thousands of these errors:

2016-12-14 12:11:26.815832] I [MSGID: 115072] [server-rpc-fops.c:1640:server_setattr_cbk] 0-marketplace_nfs-server: 458301: SETATTR /ftpdata/<removed>/60_VW50aXRsZWQxMQ.zip (c0196410-246a-4de0-ab18-386e13db088c) ==> (Operation not permitted) [Operation not permitted]
[2016-12-14 12:11:30.196858] I [MSGID: 115072] [server-rpc-fops.c:1640:server_setattr_cbk] 0-marketplace_nfs-server: 68073: SETATTR /ftpdata/<removed>/283_TmVzdGVkIFNlcXVlbmNlIDk1XzE.zip (3fc3f663-0480-41be-b448-b7a3373e6b5d) ==> (Operation not permitted) [Operation not permitted]
[2016-12-14 12:11:30.677535] I [MSGID: 115072] [server-rpc-fops.c:1640:server_setattr_cbk] 0-marketplace_nfs-server: 458326: SETATTR /ftpdata/<remove>/uhd_1748_MTAyMF9XYXRhXzRLX19fMDFfbHV0.zip (0d00c1d1-4598-4789-89e1-723325bb92dc) ==> (Operation not permitted) [Operation not permitted]

These disappear if you turn off the metadata healing daemon.

Only way to get healing done is to umount or halt systems.

Comment 6 Kaushal 2017-03-08 10:53:43 UTC

This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.

Note You need to log in before you can comment on or make changes to this bug.