1774900 – Update default options related to tcp-timeout and ping-timeout

Bug 1774900 - Update default options related to tcp-timeout and ping-timeout

Summary: Update default options related to tcp-timeout and ping-timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhhi
Sub Component:
Version:	rhhiv-1.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHHI-V 1.8
Assignee:	Gobinda Das
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:	1845064 1848899
Blocks:	RHHI-V-1.8-Engineering-Inflight-BZs
TreeView+	depends on / blocked

Reported:	2019-11-21 09:07 UTC by Sahina Bose
Modified:	2020-08-04 14:51 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, the detection of disconnected hosts took a long time leading to sanlock timeout. With this release, the socket and rpc timeouts in gluster have been improved so that disconnected hosts are detected before sanlock timeout occurs and reboot of a single host does not impact virtual machines running on other hosts.
Clone Of:
Environment:
Last Closed:	2020-08-04 14:50:58 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2020:3314	0	None	None	None	2020-08-04 14:51:25 UTC

Description Sahina Bose 2019-11-21 09:07:44 UTC

Description of problem:

In extensive reboot tests conducted by a customer, it was seen that a node reboot in a 3 node environment causes sanlock lease renewal to timeout due to latency seen with storage.
On analysis, it was found to be caused due to the time taken for the client/gluster to detect that a brick has gone down.
Updating these parameters helps with the issue:

#Parameter    Old value -> New value:
network.ping-timeout    30 -> 20
client.tcp-user-timeout    0 -> 30
server.tcp-user-timeout    0 -> 30


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Hard power reset one of the nodes in the cluster 
2. Check the latency reported for sanlock lease renewal


Actual results:
Latency seen close to or greater than 80s. 80s is an issue, as sanlock would kill the processes holding the lease.

Expected results:
No issues with sanlock timeout

Additional info:

Comment 1 Sahina Bose 2019-11-21 09:11:21 UTC

See bug 1719140 for details. Do we need to change the default virt profile, Krutika?

Comment 2 Krutika Dhananjay 2019-11-21 10:53:49 UTC

(In reply to Sahina Bose from comment #1)
> See bug 1719140 for details. Do we need to change the default virt profile,
> Krutika?

Sure, but volume-set operation alone won't guarantee the changes have been applied. All gluster processes need to be restarted post this.
Maybe better idea to set these at the time of deployment itself?

-Krutika

Comment 5 SATHEESARAN 2020-06-08 14:39:43 UTC

Dependent bug is in POST state

Comment 7 SATHEESARAN 2020-07-09 09:22:20 UTC

Verified with RHGS 3.5.2-async ( glusterfs-6.0-37.1 )

New volume options are added to the volumes when optimizing for virt store use case

Newly seen volume options are:
network.ping-timeout: 30
server.tcp-user-timeout: 20
server.keepalive-time: 10
server.keepalive-interval: 2
server.keepalive-count: 5

[root@ ~]# gluster volume info engine
 
Volume Name: engine
Type: Replicate
Volume ID: 6d415e41-b2b3-4199-9cc0-dcc7774e2d94
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: host1.example.com:/gluster_bricks/engine/engine
Brick2: host2.example.com:/gluster_bricks/engine/engine
Brick3: host3.example.com:/gluster_bricks/engine/engine
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
network.ping-timeout: 30
server.tcp-user-timeout: 20
server.keepalive-time: 10
server.keepalive-interval: 2
server.keepalive-count: 5
storage.owner-uid: 36
storage.owner-gid: 36
performance.strict-o-direct: on
cluster.granular-entry-heal: enable


One problem in this context, though the gluster profile is set with ping-timeout as 20,
RHV manager sets it to 30. This needs to be tracked in a separate bug

Comment 11 errata-xmlrpc 2020-08-04 14:50:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHHI for Virtualization 1.8 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3314

Note You need to log in before you can comment on or make changes to this bug.