Created attachment 1229281 [details] ftp gluster fuse client log (redacted personal information) Description of problem: We have a problem that has occurred twice in two days, but has occurred more than once before. 3 x node Fedora Cluster in AWS (m4.xlarge) (Fedora 23 (Cloud Edition)) 2.5Tb volume Volume Name: marketplace_nfs Type: Distributed-Replicate Volume ID: 528de1b5-0bd5-488b-83cf-c4f3f747e6cd Status: Started Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 10.90.5.105:/data/data0/marketplace_nfs Brick2: 10.90.3.14:/data/data3/marketplace_nfs Brick3: 10.90.4.195:/data/data0/marketplace_nfs Brick4: 10.90.5.105:/data/data1/marketplace_nfs Brick5: 10.90.3.14:/data/data1/marketplace_nfs Brick6: 10.90.4.195:/data/data1/marketplace_nfs Options Reconfigured: server.outstanding-rpc-limit: 128 cluster.self-heal-readdir-size: 16KB cluster.self-heal-window-size: 3 diagnostics.brick-log-level: INFO network.ping-timeout: 15 cluster.quorum-type: none performance.readdir-ahead: on cluster.self-heal-daemon: enable performance.cache-size: 512MB cluster.lookup-optimize: on cluster.data-self-heal-algorithm: diff cluster.server-quorum-ratio: 51% Status of volume: marketplace_nfs Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.90.5.105:/data/data0/marketplace_n fs 49152 0 Y 3426 Brick 10.90.3.14:/data/data3/marketplace_nf s 49154 0 Y 3402 Brick 10.90.4.195:/data/data0/marketplace_n fs 49152 0 Y 4868 Brick 10.90.5.105:/data/data1/marketplace_n fs 49153 0 Y 31636 Brick 10.90.3.14:/data/data1/marketplace_nf s 49153 0 Y 348 Brick 10.90.4.195:/data/data1/marketplace_n fs 49153 0 Y 31238 NFS Server on localhost 2049 0 Y 3999 Self-heal Daemon on localhost N/A N/A Y 4008 NFS Server on ip-10-90-5-105.ec2.internal 2049 0 Y 1488 Self-heal Daemon on ip-10-90-5-105.ec2.inte rnal N/A N/A Y 1496 NFS Server on ip-10-90-4-195.ec2.internal 2049 0 Y 20526 Self-heal Daemon on ip-10-90-4-195.ec2.inte rnal N/A N/A Y 20534 Task Status of Volume marketplace_nfs ------------------------------------------------------------------------------ There are no active volume tasks Version-Release number of selected component (if applicable): 3.7.16 How reproducible: Cannot reproduce on demand but occurs frequently. Actual results: Client processes hang and cannot list the GlusterFS mount $ gluster volume heal marketplace_nfs info hangs and cannot list healing information Shutdown clients (not umount - halt clients) $ gluster volume heal completes Load starts reducing and we can remount. Recovery time is around 20 minutes and causes significant problems Expected results: This does not happen Additional info: The file size average is 13Mb - 5Gb is around the largest size. We do some post processing after initial upload (mv, unzip, mv, delete). We have the logs from the ftp server, web servers also mount and work off this volume but we do not have logs from them. Gluster servers provide no useful logging during this time. I will attach statedumps as well as the client log.
Created attachment 1229282 [details] statedump from gluster node with high load statedump from gluster node
As per the actual results in comment0, user is seeing that heal info command hangs, and load starts reducing once heal completes. With initial thoughts, it looks like a 'replica' issue. Moving this bug to appropriate component
$ sudo gluster volume heal marketplace_nfs info Brick 10.90.5.105:/data/data0/marketplace_nfs Status: Transport endpoint is not connected Number of entries: - Brick 10.90.3.14:/data/data3/marketplace_nfs <gfid:5bba3981-5a34-4fae-9efc-12dc4638baaa> ... <output removed> ... Status: Connected Number of entries: 146 Brick 10.90.4.195:/data/data0/marketplace_nfs <gfid:53834b40-8bb6-4d79-a393-46daaaf36f13> ... <output removed> ... Status: Connected Number of entries: 142 Brick 10.90.5.105:/data/data1/marketplace_nfs Status: Connected Number of entries: 0 Brick 10.90.3.14:/data/data1/marketplace_nfs Status: Connected Number of entries: 0 Brick 10.90.4.195:/data/data1/marketplace_nfs Status: Connected Number of entries: 0 $ sudo gluster v status Status of volume: marketplace_nfs Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.90.5.105:/data/data0/marketplace_n fs 49152 0 Y 3426 Brick 10.90.3.14:/data/data3/marketplace_nf s 49154 0 Y 3402 Brick 10.90.4.195:/data/data0/marketplace_n fs 49152 0 Y 4868 Brick 10.90.5.105:/data/data1/marketplace_n fs 49153 0 Y 31636 Brick 10.90.3.14:/data/data1/marketplace_nf s 49153 0 Y 348 Brick 10.90.4.195:/data/data1/marketplace_n fs 49153 0 Y 31238 NFS Server on localhost 2049 0 Y 20526 Self-heal Daemon on localhost N/A N/A Y 20534 NFS Server on ip-10-90-5-105.ec2.internal 2049 0 Y 1488 Self-heal Daemon on ip-10-90-5-105.ec2.inte rnal N/A N/A Y 1496 NFS Server on 10.90.3.14 2049 0 Y 3999 Self-heal Daemon on 10.90.3.14 N/A N/A Y 4008 Task Status of Volume marketplace_nfs ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: marketplace_uploads Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.90.4.195:/data/data2/uploads 49154 0 Y 20506 Brick 10.90.3.14:/data/data2/uploads 49155 0 Y 3976 Brick 10.90.5.105:/data/data2/uploads 49154 0 Y 1468 NFS Server on localhost 2049 0 Y 20526 Self-heal Daemon on localhost N/A N/A Y 20534 NFS Server on ip-10-90-5-105.ec2.internal 2049 0 Y 1488 Self-heal Daemon on ip-10-90-5-105.ec2.inte rnal N/A N/A Y 1496 NFS Server on 10.90.3.14 2049 0 Y 3999 Self-heal Daemon on 10.90.3.14 N/A N/A Y 4008 Task Status of Volume marketplace_uploads ------------------------------------------------------------------------------ There are no active volume tasks It has happened again. We shutdown some hosts and the heal info started completing. You can see the host with the high load is reporting Brick 10.90.5.105:/data/data0/marketplace_nfs Status: Transport endpoint is not connected Number of entries: - Will attach another statedump from this node.
Created attachment 1229816 [details] statedump from gluster node with high load
Further to this: these errors appear in the log on one node - always the same node. Thousands of these errors: 2016-12-14 12:11:26.815832] I [MSGID: 115072] [server-rpc-fops.c:1640:server_setattr_cbk] 0-marketplace_nfs-server: 458301: SETATTR /ftpdata/<removed>/60_VW50aXRsZWQxMQ.zip (c0196410-246a-4de0-ab18-386e13db088c) ==> (Operation not permitted) [Operation not permitted] [2016-12-14 12:11:30.196858] I [MSGID: 115072] [server-rpc-fops.c:1640:server_setattr_cbk] 0-marketplace_nfs-server: 68073: SETATTR /ftpdata/<removed>/283_TmVzdGVkIFNlcXVlbmNlIDk1XzE.zip (3fc3f663-0480-41be-b448-b7a3373e6b5d) ==> (Operation not permitted) [Operation not permitted] [2016-12-14 12:11:30.677535] I [MSGID: 115072] [server-rpc-fops.c:1640:server_setattr_cbk] 0-marketplace_nfs-server: 458326: SETATTR /ftpdata/<remove>/uhd_1748_MTAyMF9XYXRhXzRLX19fMDFfbHV0.zip (0d00c1d1-4598-4789-89e1-723325bb92dc) ==> (Operation not permitted) [Operation not permitted] These disappear if you turn off the metadata healing daemon. Only way to get healing done is to umount or halt systems.
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.