Bug 1408158
Summary: | IO is paused for minimum one and half minute when one of the EC volume hosted cluster node goes down. | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Byreddy <bsrirama> |
Component: | rpc | Assignee: | Raghavendra G <rgowdapp> |
Status: | CLOSED ERRATA | QA Contact: | Sri Vignesh Selvan <sselvan> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.2 | CC: | amukherj, aspandey, mchangir, nchilaka, rgowdapp, rhs-bugs, sheggodu, storage-qa-internal |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | RHGS 3.4.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | rebase | ||
Fixed In Version: | glusterfs-3.12.2-1 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-09-04 06:29:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1408354 | ||
Bug Blocks: | 1503134 |
Description
Byreddy
2016-12-22 10:37:33 UTC
This issue can be seen in replica volume too. I followed the same steps and at the same time and afr and ec volume paused for 45 seconds. I think this is an issue with rpc when it takes some time to detect network issue due to node down. However, if we kill bricks it works fine. Raghavendra, Can we move it to rpc? A possible RCA can be found at https://bugzilla.redhat.com/show_bug.cgi?id=1408354#c26. However, need to confirm it. Duplicate of bz 1408354. Backport of https://review.gluster.org/16731 is needed to fix the issue. Build version : --------------- glusterfs-3.12.2-13.el7rhgs.x86_64 I still see 40-100 seconds pause on I/O when the node is halted/rebooted. Also, there is some pause in the I/O even after the brick comes up(heals are either in progress/pending) Due to above, I may have to FAILQA the bug. tested With the same vol configuration and setting as mentioned in c#2 time taken about 60-100secs to resume IOs on a node reboot as part of what was mentioned by Dev, to validate this fix, did below by changing the mentioned tunables Tried running same after setting below option. Test#1: 1. transport.tcp-user-timeout to a small value like 10. 2. transport.socket.keepalive-time to a small value like 5. when a node is rebooted, IOs are now getting paused for 20-30 seconds Test#2: client.tcp-user-timeout 5 gluster v set vol server.tcp-user-timeout 3 now, when a node is rebooted, IOs are now getting paused for <10 seconds It's comparatively lower time than what it was before. Lower the values comparatively lesser the time it take for io's to be back from pause state. Now, Milind can you confirm If we should be moving this bug to verified based o n above findings and below comments: 1) what is the impact of these tunables if customer sets this for all his volumes, is there a chance for something else to break or any problem that can be seen 2) if we don't forsee any problems with those tunables, and given that they are not default , shouldn't we be making them the default, as customer can't just change them before doing a node reboot everytime, to avoid facing this problem based on answers to above 2 comments, we can decide on moving this bug to verified (In reply to Sri Vignesh Selvan from comment #22) > tested With the same vol configuration and setting as mentioned in c#2 > > time taken about 60-100secs to resume IOs on a node reboot > > as part of what was mentioned by Dev, to validate this fix, did below by > changing the mentioned tunables > > > Tried running same after setting below option. > > Test#1: > 1. transport.tcp-user-timeout to a small value like 10. > 2. transport.socket.keepalive-time to a small value like 5. > > when a node is rebooted, IOs are now getting paused for 20-30 seconds > > Test#2: > client.tcp-user-timeout 5 > gluster v set vol server.tcp-user-timeout 3 > > > now, when a node is rebooted, IOs are now getting paused for <10 seconds > It's comparatively lower time than what it was before. > > Lower the values comparatively lesser the time it take for io's to be back > from pause state. > > > > > > Now, Milind can you confirm If we should be moving this bug to verified > based o n above findings and below comments: > 1) what is the impact of these tunables if customer sets this for all his > volumes, is there a chance for something else to break or any problem that > can be seen Aggressive (lower) settings for keepalive-time and tcp-user-timeout can be used where the network is highly reliable or else customers may face spurious disconnects and loss of service. > 2) if we don't forsee any problems with those tunables, and given that they > are not default , shouldn't we be making them the default, as customer can't > just change them before doing a node reboot everytime, to avoid facing this > problem Any default value is relative to the network reliability and the workload that is running on the cluster. This needs discussion and probably incorporated in gluster workload profiles that we maintain for some workloads. > > > > based on answers to above 2 comments, we can decide on moving this bug to > verified Based on the comment #23, moving this bug to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607 |