Bug 1406322

Summary:	repeated operation failed warnings in gluster mount logs with disperse volume
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Manoj Pillai <mpillai>
Component:	disperse	Assignee:	Ashish Pandey <aspandey>
Status:	CLOSED ERRATA	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, asoman, mpillai, pkarampu, rcyriac, rgowdapp, rhs-bugs, rsussman, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-13	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1414287 1426559 (view as bug list)		Environment:
Last Closed:	2017-03-23 05:58:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1414287, 1415082
Bug Blocks:	1351528

Description Manoj Pillai 2016-12-20 09:30:16 UTC

Description of problem:

Tests in the Scale Lab targeted at 36-drives per server JBOD-mode support are showing repeated warnings in gluster mount logs when running on distributed-disperse volume.

From: /var/log/glusterfs/<mount-point>.log:
<excerpt>
[2016-12-20 06:12:09.768372] I [fuse-bridge.c:5241:fuse_graph_setup] 0-fuse: switched to graph 0
[2016-12-20 06:12:09.768423] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-perfvol-client-215: Server lk version = 1
[2016-12-20 06:12:09.768528] I [fuse-bridge.c:4153:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2016-12-20 06:23:44.204340] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-perfvol-disperse-4: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-12-20 06:23:44.213433] I [MSGID: 122058] [ec-heal.c:2361:ec_heal_do] 0-perfvol-disperse-4: /smf1/file_srcdir/h17-priv/thrd_05/d_001/d_002/d_009: name heal successful on 3F
[2016-12-20 06:28:30.254713] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-perfvol-disperse-19: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=3C, bad=3)
[2016-12-20 06:28:30.256043] I [MSGID: 122058] [ec-heal.c:2361:ec_heal_do] 0-perfvol-disperse-19: /smf1/file_srcdir/h17-priv/thrd_07/d_001/d_007/d_002: name heal successful on 3F
[2016-12-20 06:39:22.749543] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-perfvol-disperse-13: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0, good=1F, bad=20)
[2016-12-20 06:39:22.751316] I [MSGID: 122058] [ec-heal.c:2361:ec_heal_do] 0-perfvol-disperse-13: /smf1/file_srcdir/h17-priv/thrd_06/d_002/d_008/d_002: name heal successful on 3F
</excerpt>

One of the concerns with 36-drive JBOD mode is the large number of brick processes and the memory and cpu pressure from them. The above warnings seem to cause self-heal process to become active, which in turn adds to stress on the system:

<top output excerpt>
  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
[...]
17676 root      20   0 17.661g 1.015g   3236 S  0.0  0.4   0:00.00 glusterfs
</top output excerpt>

Version-Release number of selected component (if applicable):
glusterfs-*-3.8.4-8.el7rhgs.x86_64
kernel-3.10.0-514.el7.x86_64 (RHEL 7.3)

How reproducible:

The Scale Lab servers being used have 36 SATA drives, 40GbE, 256GB RAM. The servers are also serving as clients to run performance benchmarks. There are 6 servers, 36x(4+2) distributed-disperse gluster volume. In this configuration, the problem is consistently reproducible with the smallfile benchmark.

Steps to Reproduce:
The warnings seen above are from a smallfile run:
smallfile_cli.py --top ${top_dir} --host-set ${hosts_str} --threads 8 --files 32768 --file-size 32 --record-size 32 --fsync Y --response-times N --operation create

Additional info:

Volume options changed:
cluster.lookup-optimize: on
client.event-threads: 8
performance.client-io-threads: on

More Additional info:
The situation is much worse in tests with the more complex SPECsfs2014 VDA benchmark, which generates a lot more data, adding to memory pressure from page cache usage. I'll add info from those runs later, but briefly, there we see a steady stream of "operation failed" warnings in the gluster mount log.

Comment 2 Manoj Pillai 2016-12-23 04:30:20 UTC

Update on this bz: I was able to reproduce this issue on our BAGL setup -- 12-drives per server, 10GbE. So this is not specific to the Scale Lab setup and 36-drive config I am currently working with.

I will send an email to Pranith and Nag with access details to the BAGL setup so they can proceed with RCA. I need to continue with the Scale Lab work.

Comment 3 Manoj Pillai 2016-12-23 04:32:51 UTC

Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1358606.

I am wondering if a fix for the above bz would help here as well.

Comment 4 Pranith Kumar K 2016-12-23 10:10:32 UTC

We see a lot of disconnects based on the following logs:
erd until brick's port is available
[2016-12-22 10:11:00.615628] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-40: disconnected from perfvol-client-40. Client process will keep trying to connect to glusterd until brick's port is available
[2016-12-22 10:11:00.615675] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-41: disconnected from perfvol-client-41. Client process will keep trying to connect to glusterd until brick's port is available
[2016-12-22 10:11:00.615777] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-42: disconnected from perfvol-client-42. Client process will keep trying to connect to glusterd until brick's port is available
[2016-12-22 10:11:00.615804] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-43: disconnected from perfvol-client-43. Client process will keep trying to connect to glusterd until brick's port is available
[2016-12-22 10:11:00.615824] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-44: disconnected from perfvol-client-44. Client process will keep trying to connect to glusterd until brick's port is available
[2016-12-22 10:11:00.615844] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-45: disconnected from perfvol-client-45. Client process will keep trying to connect to glusterd until brick's port is available
[2016-12-22 10:11:00.615875] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-46: disconnected from perfvol-client-46. Client process will keep trying to connect to glusterd until brick's port is available
[2016-12-22 10:11:00.615896] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-47: disconnected from perfvol-client-47. Client process will keep trying to connect to glusterd until brick's port is available

I also checked /var/log/messages to find if there are any problems it looked fine. We need to find RC why the disconnections are happening.

Comment 5 Atin Mukherjee 2016-12-23 10:26:28 UTC

Manoj - could you specify the exact build version with which you hit the issue?

Comment 6 Manoj Pillai 2016-12-23 11:16:55 UTC

(In reply to Pranith Kumar K from comment #4)
> We see a lot of disconnects based on the following logs:
> erd until brick's port is available
> [2016-12-22 10:11:00.615628] I [MSGID: 114018]
[...]
> [2016-12-22 10:11:00.615896] I [MSGID: 114018]
> [client.c:2280:client_rpc_notify] 0-perfvol-client-47: disconnected from
> perfvol-client-47. Client process will keep trying to connect to glusterd
> until brick's port is available
> 
> I also checked /var/log/messages to find if there are any problems it looked
> fine. We need to find RC why the disconnections are happening.

These BAGL systems had an older upstream gluster version on them. For ease of analysis, I put them on a recent downstream version today i.e. 12-23. Please look at the logs from today. I don't see any such disconnect messages from today. 

The disconnects might be because of tear-down of an existing volume rather than anything else.

Comment 7 Manoj Pillai 2016-12-23 11:29:41 UTC

(In reply to Atin Mukherjee from comment #5)
> Manoj - could you specify the exact build version with which you hit the
> issue?

glusterfs-api-3.8.4-8.el7rhgs.x86_64
glusterfs-3.8.4-8.el7rhgs.x86_64
glusterfs-server-3.8.4-8.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-8.el7rhgs.x86_64
glusterfs-cli-3.8.4-8.el7rhgs.x86_64
glusterfs-libs-3.8.4-8.el7rhgs.x86_64
glusterfs-fuse-3.8.4-8.el7rhgs.x86_64

Comment 8 Pranith Kumar K 2016-12-23 11:41:48 UTC

Changing back the component as I was not looking at the correct logs.

Comment 9 Raghavendra G 2016-12-23 11:45:02 UTC

Some data from logs:

* The number of disconnects were 240
[root@gprfc069 ~]# grep "disconnected from" /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 |  sort |  wc -l
240

* Of which 144 were from 1st graph
[root@gprfc069 ~]# grep "disconnected from" /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "0-" | sort |  wc -l
144

* and 96 were from 3rd graph
[root@gprfc069 ~]# grep "disconnected from" /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "2-" | sort |  wc -l
96

(where did 2nd graph go?)

some random interesting facts (I manually verified it looking through logs too):

* there were 3 disconnects per transport in 1st graph - 144/48 = 3.
* there were 2 disconnects per transport in 3rd graph - 96/48 = 2

Among these 1 disconnect per transport can be attributed to graph switch.

[root@gprfc069 ~]# grep "current graph is no longer active" /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 | grep "0-" | wc -l
48

[root@gprfc069 ~]# grep "current graph is no longer active" /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 | grep "2-" | wc -l
48

There were bunch of msgs indicating that client was not able to get remote brick's port number

[root@gprfc069 ~]# grep "failed to get the port number for" /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 |sort| grep "0-" | wc -l
48

[root@gprfc069 ~]# grep "failed to get the port number for" /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 |sort| grep "2-" | wc -l
48

Note that we disconnect when we cannot get port number and start a fresh connection sequence. So, 1 disconnect per transport for both transport can be attributed to port-map failure

Comment 10 Pranith Kumar K 2016-12-23 11:46:49 UTC

(In reply to Raghavendra G from comment #9)
> Some data from logs:
> 
> * The number of disconnects were 240
> [root@gprfc069 ~]# grep "disconnected from"
> /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 |  sort | 
> wc -l
> 240
> 
> * Of which 144 were from 1st graph
> [root@gprfc069 ~]# grep "disconnected from"
> /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "0-"
> | sort |  wc -l
> 144
> 
> * and 96 were from 3rd graph
> [root@gprfc069 ~]# grep "disconnected from"
> /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "2-"
> | sort |  wc -l
> 96
> 
> (where did 2nd graph go?)
> 
> some random interesting facts (I manually verified it looking through logs
> too):
> 
> * there were 3 disconnects per transport in 1st graph - 144/48 = 3.
> * there were 2 disconnects per transport in 3rd graph - 96/48 = 2
> 
> Among these 1 disconnect per transport can be attributed to graph switch.
> 
> [root@gprfc069 ~]# grep "current graph is no longer active"
> /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 | grep "0-"
> | wc -l
> 48
> 
> [root@gprfc069 ~]# grep "current graph is no longer active"
> /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 | grep "2-"
> | wc -l
> 48
> 
> There were bunch of msgs indicating that client was not able to get remote
> brick's port number
> 
> [root@gprfc069 ~]# grep "failed to get the port number for"
> /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 |sort| grep
> "0-" | wc -l
> 48
> 
> [root@gprfc069 ~]# grep "failed to get the port number for"
> /var/log/glusterfs/mnt-glustervol.log-20161223  | cut -d" " -f 7 |sort| grep
> "2-" | wc -l
> 48
> 
> Note that we disconnect when we cannot get port number and start a fresh
> connection sequence. So, 1 disconnect per transport for both transport can
> be attributed to port-map failure

Raghavendra,
     This is not RPC issue: https://bugzilla.redhat.com/show_bug.cgi?id=1406322#c6

We can't believe old logs :-).

Pranith

Comment 11 Raghavendra G 2016-12-23 11:51:51 UTC

> Raghavendra,
>      This is not RPC issue:
> https://bugzilla.redhat.com/show_bug.cgi?id=1406322#c6
> 
> We can't believe old logs :-).

I know. Just posted other evidence to corroborate comment 6 :).

> 
> Pranith

Comment 14 Ashish Pandey 2017-01-18 07:27:10 UTC

I reproduced this issue easily on my laptop.

What I have been seeing is that while creating and writing files on mount point, some lookup are also happening. These lookups happens without locks.

At this point if some parallel version or size update is happening, there is possibility that some bricks will have different xattrs. Now, ec_check_status called by ec_look->ec_complete, will trigger this message logging.

Comment 16 Pranith Kumar K 2017-01-23 10:14:29 UTC

https://code.engineering.redhat.com/gerrit/95854/

Comment 18 Nag Pavan Chilakam 2017-02-07 10:44:52 UTC

 Your comment was:

    I have run IOs on fuse mount using below manners
    1)same client with 5 terminals pumping small file IOs(kernel untar)
    2)different clients pumping small file IOs(kernel untar)


    in both cases I didn't see any such warnings as long as all bricks were online and pingable from all clients


    hence moving to verified


    test build:
    3.8.4-13

Comment 23 errata-xmlrpc 2017-03-23 05:58:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html