Bug 1774900

Summary:	Update default options related to tcp-timeout and ping-timeout
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Sahina Bose <sabose>
Component:	rhhi	Assignee:	Gobinda Das <godas>
Status:	CLOSED ERRATA	QA Contact:	SATHEESARAN <sasundar>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhhiv-1.6	CC:	dwalveka, godas, pasik, pkesavap, rhs-bugs
Target Milestone:	---
Target Release:	RHHI-V 1.8
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously, the detection of disconnected hosts took a long time leading to sanlock timeout. With this release, the socket and rpc timeouts in gluster have been improved so that disconnected hosts are detected before sanlock timeout occurs and reboot of a single host does not impact virtual machines running on other hosts.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-04 14:50:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1845064, 1848899
Bug Blocks:	1779977

Description Sahina Bose 2019-11-21 09:07:44 UTC

Description of problem:

In extensive reboot tests conducted by a customer, it was seen that a node reboot in a 3 node environment causes sanlock lease renewal to timeout due to latency seen with storage.
On analysis, it was found to be caused due to the time taken for the client/gluster to detect that a brick has gone down.
Updating these parameters helps with the issue:

#Parameter    Old value -> New value:
network.ping-timeout    30 -> 20
client.tcp-user-timeout    0 -> 30
server.tcp-user-timeout    0 -> 30


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Hard power reset one of the nodes in the cluster 
2. Check the latency reported for sanlock lease renewal


Actual results:
Latency seen close to or greater than 80s. 80s is an issue, as sanlock would kill the processes holding the lease.

Expected results:
No issues with sanlock timeout

Additional info:

Comment 1 Sahina Bose 2019-11-21 09:11:21 UTC

See bug 1719140 for details. Do we need to change the default virt profile, Krutika?

Comment 2 Krutika Dhananjay 2019-11-21 10:53:49 UTC

(In reply to Sahina Bose from comment #1)
> See bug 1719140 for details. Do we need to change the default virt profile,
> Krutika?

Sure, but volume-set operation alone won't guarantee the changes have been applied. All gluster processes need to be restarted post this.
Maybe better idea to set these at the time of deployment itself?

-Krutika

Comment 5 SATHEESARAN 2020-06-08 14:39:43 UTC

Dependent bug is in POST state

Comment 7 SATHEESARAN 2020-07-09 09:22:20 UTC

Verified with RHGS 3.5.2-async ( glusterfs-6.0-37.1 )

New volume options are added to the volumes when optimizing for virt store use case

Newly seen volume options are:
network.ping-timeout: 30
server.tcp-user-timeout: 20
server.keepalive-time: 10
server.keepalive-interval: 2
server.keepalive-count: 5

[root@ ~]# gluster volume info engine
 
Volume Name: engine
Type: Replicate
Volume ID: 6d415e41-b2b3-4199-9cc0-dcc7774e2d94
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: host1.example.com:/gluster_bricks/engine/engine
Brick2: host2.example.com:/gluster_bricks/engine/engine
Brick3: host3.example.com:/gluster_bricks/engine/engine
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
network.ping-timeout: 30
server.tcp-user-timeout: 20
server.keepalive-time: 10
server.keepalive-interval: 2
server.keepalive-count: 5
storage.owner-uid: 36
storage.owner-gid: 36
performance.strict-o-direct: on
cluster.granular-entry-heal: enable


One problem in this context, though the gluster profile is set with ping-timeout as 20,
RHV manager sets it to 30. This needs to be tracked in a separate bug

Comment 11 errata-xmlrpc 2020-08-04 14:50:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHHI for Virtualization 1.8 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3314