1648210 – Bumping up of op-version times out on a scaled system with ~1200 volumes

Bug 1648210 - Bumping up of op-version times out on a scaled system with ~1200 volumes

Summary: Bumping up of op-version times out on a scaled system with ~1200 volumes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	cns-3.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 2
Assignee:	Atin Mukherjee
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1648237
TreeView+	depends on / blocked

Reported:	2018-11-09 07:03 UTC by krishnaram Karthick
Modified:	2019-05-29 08:43 UTC (History)
CC List:	14 users (show)
Fixed In Version:	glusterfs-3.12.2-28
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1648237 (view as bug list)
Environment:
Last Closed:	2018-12-17 17:07:27 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1651645	0	high	CLOSED	cluster.op-version should be set with a higher timeout value	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2018:3827	0	None	None	None	2018-12-17 17:07:35 UTC

Internal Links: 1651645

Description krishnaram Karthick 2018-11-09 07:03:27 UTC

Description of problem:
While trying to bump up the op-version after upgrading a CNS setup from 3.9 to 3.10, the operation timed out.

sh-4.2# gluster volume set all cluster.op-version 31302
Error : Request timed out

However, after few minutes the op-version had got bumped up. So, there is no functionality impact on the command getting timed out.

sh-4.2# gluster vol get all all
Option                                  Value                                   
------                                  -----                                   
cluster.server-quorum-ratio             51                                      
cluster.enable-shared-storage           disable                                 
cluster.op-version                      31302                                   
cluster.max-op-version                  31302                                   
cluster.brick-multiplex                 on                                      
cluster.max-bricks-per-process          0                                       
cluster.daemon-log-level                INFO

Version-Release number of selected component (if applicable):
glusterfs-3.12.2-18.el7rhgs.x86_64

# glusterd --version
glusterfs 3.12.2

How reproducible:
1/1

Steps to Reproduce:
1. create a CNS system with 1200 volumes
2. upgrade CNS 3.9 to to OCS 3.10
3. After the upgrade, bump up the op-version by running 'gluster volume set all cluster.op-version 31302'

Actual results:
command timed out

Expected results:
setting op-version should succeed and return success

Additional info:

Comment 2 Atin Mukherjee 2018-11-09 08:23:49 UTC

Root cause : On a setup with 1200 volumes, the code flow tries to restart all the shd services 1200 times which is an overkill. Restarting all the per node daemons only once should be sufficient enough.

upstream patch : https://review.gluster.org/#/c/glusterfs/+/21608/

Comment 3 Atin Mukherjee 2018-11-09 17:46:02 UTC

I'm bumping up the severity to high as even though the transaction eventually completes but it does block the other transactions in the queue which means all the gluster management commands will be stuck when cluster.op-version is attempted to be bumped up.

Comment 32 errata-xmlrpc 2018-12-17 17:07:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3827

Note You need to log in before you can comment on or make changes to this bug.