Bug 1406322
Summary: | repeated operation failed warnings in gluster mount logs with disperse volume | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Manoj Pillai <mpillai> | |
Component: | disperse | Assignee: | Ashish Pandey <aspandey> | |
Status: | CLOSED ERRATA | QA Contact: | Nag Pavan Chilakam <nchilaka> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.2 | CC: | amukherj, asoman, mpillai, pkarampu, rcyriac, rgowdapp, rhs-bugs, rsussman, storage-qa-internal | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.2.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.8.4-13 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1414287 1426559 (view as bug list) | Environment: | ||
Last Closed: | 2017-03-23 05:58:31 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1414287, 1415082 | |||
Bug Blocks: | 1351528 |
Description
Manoj Pillai
2016-12-20 09:30:16 UTC
Update on this bz: I was able to reproduce this issue on our BAGL setup -- 12-drives per server, 10GbE. So this is not specific to the Scale Lab setup and 36-drive config I am currently working with. I will send an email to Pranith and Nag with access details to the BAGL setup so they can proceed with RCA. I need to continue with the Scale Lab work. Also see: https://bugzilla.redhat.com/show_bug.cgi?id=1358606. I am wondering if a fix for the above bz would help here as well. We see a lot of disconnects based on the following logs: erd until brick's port is available [2016-12-22 10:11:00.615628] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-40: disconnected from perfvol-client-40. Client process will keep trying to connect to glusterd until brick's port is available [2016-12-22 10:11:00.615675] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-41: disconnected from perfvol-client-41. Client process will keep trying to connect to glusterd until brick's port is available [2016-12-22 10:11:00.615777] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-42: disconnected from perfvol-client-42. Client process will keep trying to connect to glusterd until brick's port is available [2016-12-22 10:11:00.615804] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-43: disconnected from perfvol-client-43. Client process will keep trying to connect to glusterd until brick's port is available [2016-12-22 10:11:00.615824] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-44: disconnected from perfvol-client-44. Client process will keep trying to connect to glusterd until brick's port is available [2016-12-22 10:11:00.615844] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-45: disconnected from perfvol-client-45. Client process will keep trying to connect to glusterd until brick's port is available [2016-12-22 10:11:00.615875] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-46: disconnected from perfvol-client-46. Client process will keep trying to connect to glusterd until brick's port is available [2016-12-22 10:11:00.615896] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-perfvol-client-47: disconnected from perfvol-client-47. Client process will keep trying to connect to glusterd until brick's port is available I also checked /var/log/messages to find if there are any problems it looked fine. We need to find RC why the disconnections are happening. Manoj - could you specify the exact build version with which you hit the issue? (In reply to Pranith Kumar K from comment #4) > We see a lot of disconnects based on the following logs: > erd until brick's port is available > [2016-12-22 10:11:00.615628] I [MSGID: 114018] [...] > [2016-12-22 10:11:00.615896] I [MSGID: 114018] > [client.c:2280:client_rpc_notify] 0-perfvol-client-47: disconnected from > perfvol-client-47. Client process will keep trying to connect to glusterd > until brick's port is available > > I also checked /var/log/messages to find if there are any problems it looked > fine. We need to find RC why the disconnections are happening. These BAGL systems had an older upstream gluster version on them. For ease of analysis, I put them on a recent downstream version today i.e. 12-23. Please look at the logs from today. I don't see any such disconnect messages from today. The disconnects might be because of tear-down of an existing volume rather than anything else. (In reply to Atin Mukherjee from comment #5) > Manoj - could you specify the exact build version with which you hit the > issue? glusterfs-api-3.8.4-8.el7rhgs.x86_64 glusterfs-3.8.4-8.el7rhgs.x86_64 glusterfs-server-3.8.4-8.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-8.el7rhgs.x86_64 glusterfs-cli-3.8.4-8.el7rhgs.x86_64 glusterfs-libs-3.8.4-8.el7rhgs.x86_64 glusterfs-fuse-3.8.4-8.el7rhgs.x86_64 Changing back the component as I was not looking at the correct logs. Some data from logs: * The number of disconnects were 240 [root@gprfc069 ~]# grep "disconnected from" /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | sort | wc -l 240 * Of which 144 were from 1st graph [root@gprfc069 ~]# grep "disconnected from" /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "0-" | sort | wc -l 144 * and 96 were from 3rd graph [root@gprfc069 ~]# grep "disconnected from" /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "2-" | sort | wc -l 96 (where did 2nd graph go?) some random interesting facts (I manually verified it looking through logs too): * there were 3 disconnects per transport in 1st graph - 144/48 = 3. * there were 2 disconnects per transport in 3rd graph - 96/48 = 2 Among these 1 disconnect per transport can be attributed to graph switch. [root@gprfc069 ~]# grep "current graph is no longer active" /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "0-" | wc -l 48 [root@gprfc069 ~]# grep "current graph is no longer active" /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "2-" | wc -l 48 There were bunch of msgs indicating that client was not able to get remote brick's port number [root@gprfc069 ~]# grep "failed to get the port number for" /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 |sort| grep "0-" | wc -l 48 [root@gprfc069 ~]# grep "failed to get the port number for" /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 |sort| grep "2-" | wc -l 48 Note that we disconnect when we cannot get port number and start a fresh connection sequence. So, 1 disconnect per transport for both transport can be attributed to port-map failure (In reply to Raghavendra G from comment #9) > Some data from logs: > > * The number of disconnects were 240 > [root@gprfc069 ~]# grep "disconnected from" > /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | sort | > wc -l > 240 > > * Of which 144 were from 1st graph > [root@gprfc069 ~]# grep "disconnected from" > /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "0-" > | sort | wc -l > 144 > > * and 96 were from 3rd graph > [root@gprfc069 ~]# grep "disconnected from" > /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "2-" > | sort | wc -l > 96 > > (where did 2nd graph go?) > > some random interesting facts (I manually verified it looking through logs > too): > > * there were 3 disconnects per transport in 1st graph - 144/48 = 3. > * there were 2 disconnects per transport in 3rd graph - 96/48 = 2 > > Among these 1 disconnect per transport can be attributed to graph switch. > > [root@gprfc069 ~]# grep "current graph is no longer active" > /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "0-" > | wc -l > 48 > > [root@gprfc069 ~]# grep "current graph is no longer active" > /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 | grep "2-" > | wc -l > 48 > > There were bunch of msgs indicating that client was not able to get remote > brick's port number > > [root@gprfc069 ~]# grep "failed to get the port number for" > /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 |sort| grep > "0-" | wc -l > 48 > > [root@gprfc069 ~]# grep "failed to get the port number for" > /var/log/glusterfs/mnt-glustervol.log-20161223 | cut -d" " -f 7 |sort| grep > "2-" | wc -l > 48 > > Note that we disconnect when we cannot get port number and start a fresh > connection sequence. So, 1 disconnect per transport for both transport can > be attributed to port-map failure Raghavendra, This is not RPC issue: https://bugzilla.redhat.com/show_bug.cgi?id=1406322#c6 We can't believe old logs :-). Pranith > Raghavendra, > This is not RPC issue: > https://bugzilla.redhat.com/show_bug.cgi?id=1406322#c6 > > We can't believe old logs :-). I know. Just posted other evidence to corroborate comment 6 :). > > Pranith I reproduced this issue easily on my laptop. What I have been seeing is that while creating and writing files on mount point, some lookup are also happening. These lookups happens without locks. At this point if some parallel version or size update is happening, there is possibility that some bricks will have different xattrs. Now, ec_check_status called by ec_look->ec_complete, will trigger this message logging. Your comment was: I have run IOs on fuse mount using below manners 1)same client with 5 terminals pumping small file IOs(kernel untar) 2)different clients pumping small file IOs(kernel untar) in both cases I didn't see any such warnings as long as all bricks were online and pingable from all clients hence moving to verified test build: 3.8.4-13 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |