Description of problem: In extensive reboot tests conducted by a customer, it was seen that a node reboot in a 3 node environment causes sanlock lease renewal to timeout due to latency seen with storage. On analysis, it was found to be caused due to the time taken for the client/gluster to detect that a brick has gone down. Updating these parameters helps with the issue: #Parameter Old value -> New value: network.ping-timeout 30 -> 20 client.tcp-user-timeout 0 -> 30 server.tcp-user-timeout 0 -> 30 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Hard power reset one of the nodes in the cluster 2. Check the latency reported for sanlock lease renewal Actual results: Latency seen close to or greater than 80s. 80s is an issue, as sanlock would kill the processes holding the lease. Expected results: No issues with sanlock timeout Additional info:
See bug 1719140 for details. Do we need to change the default virt profile, Krutika?
(In reply to Sahina Bose from comment #1) > See bug 1719140 for details. Do we need to change the default virt profile, > Krutika? Sure, but volume-set operation alone won't guarantee the changes have been applied. All gluster processes need to be restarted post this. Maybe better idea to set these at the time of deployment itself? -Krutika
Dependent bug is in POST state
Verified with RHGS 3.5.2-async ( glusterfs-6.0-37.1 ) New volume options are added to the volumes when optimizing for virt store use case Newly seen volume options are: network.ping-timeout: 30 server.tcp-user-timeout: 20 server.keepalive-time: 10 server.keepalive-interval: 2 server.keepalive-count: 5 [root@ ~]# gluster volume info engine Volume Name: engine Type: Replicate Volume ID: 6d415e41-b2b3-4199-9cc0-dcc7774e2d94 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: host1.example.com:/gluster_bricks/engine/engine Brick2: host2.example.com:/gluster_bricks/engine/engine Brick3: host3.example.com:/gluster_bricks/engine/engine Options Reconfigured: performance.client-io-threads: on nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 network.ping-timeout: 30 server.tcp-user-timeout: 20 server.keepalive-time: 10 server.keepalive-interval: 2 server.keepalive-count: 5 storage.owner-uid: 36 storage.owner-gid: 36 performance.strict-o-direct: on cluster.granular-entry-heal: enable One problem in this context, though the gluster profile is set with ping-timeout as 20, RHV manager sets it to 30. This needs to be tracked in a separate bug
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHHI for Virtualization 1.8 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:3314