Bug 1728184

Summary:	Task- 'Check for GlusterFS cluster health" takes more than one and half hour for a single retry during upgrading ocs with upgrade playbook
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Arun Kumar <arukumar>
Component:	cns-ansible	Assignee:	John Mulligan <jmulligan>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Arun Kumar <arukumar>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	ocs-3.11	CC:	dpivonka, hchiramm, knarra, kramdoss, madam, rhs-bugs, rtalur, sarumuga
Target Milestone:	---	Keywords:	ZStream
Target Release:	OCS 3.11.z Batch Update 4
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	openshift-ansible-3.11.153-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-02-13 05:22:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1703695

Comment 10 Saravanakumar 2019-07-24 08:15:15 UTC

PR posted upstream:
https://github.com/openshift/openshift-ansible/pull/11777

Comment 15 Saravanakumar 2019-08-09 07:39:45 UTC

PR merged.

Comment 16 Arun Kumar 2019-08-13 05:13:17 UTC

When the value of openshift_storage_glusterfs_health_timeout parameter is set to 3 or any other number in the inventory file then playbook fails after first attempt only. Playbook does not go 30 attempts or the attempts specified. 

eg.
 
'openshift_storage_glusterfs_health_timeout=3'



The snippet of ansible logs
==============================================


2019-08-12 15:03:33,516 p=67311 u=root |  TASK [openshift_storage_glusterfs : Check for GlusterFS cluster health] *******************************************************************************************************************************************
2019-08-12 15:03:33,516 p=67311 u=root |  task path: /usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/cluster_health.yml:4
2019-08-12 15:03:33,905 p=67311 u=root |  Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py
2019-08-12 15:03:35,512 p=67311 u=root |  fatal: [master -> master]: FAILED! => {
    "attempts": 1,
    "changed": false,
    "invocation": {
        "module_args": {
            "check_bricks": true,
            "cluster_name": "storage",
            "exclude_node": "master",
            "oc_bin": "oc",
            "oc_conf": "/etc/origin/master/admin.kubeconfig",
            "oc_namespace": "arun-glusterfs",
            "target_nodes": null
        }
    },
    "msg": "volume heketidbstorage is not ready",
    "state": "unknown"
}
2019-08-12 15:03:35,518 p=67311 u=root |        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/upgrade.retry

2019-08-12 15:03:35,518 p=67311 u=root |  PLAY RECAP ********************************************************************************************************************************************************************************************************
2019-08-12 15:03:35,519 p=67311 u=root |  gluster1                   : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,519 p=67311 u=root |  gluster2                   : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,519 p=67311 u=root |  gluster3                   : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,519 p=67311 u=root |  gluster4                   : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,519 p=67311 u=root |  gluster5                   : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,520 p=67311 u=root |  infra                      : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,520 p=67311 u=root |  localhost                  : ok=12   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,520 p=67311 u=root |  master                     : ok=53   changed=0    unreachable=0    failed=1
2019-08-12 15:03:35,520 p=67311 u=root |  registry1                  : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,520 p=67311 u=root |  registry2                  : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,520 p=67311 u=root |  registry3                  : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,521 p=67311 u=root |  INSTALLER STATUS **************************************************************************************************************************************************************************************************
2019-08-12 15:03:35,523 p=67311 u=root |  Initialization     : Complete (0:01:21)
2019-08-12 15:03:35,524 p=67311 u=root |  GlusterFS Upgrade  : In Progress (0:00:06)
2019-08-12 15:03:35,524 p=67311 u=root |        This phase can be restarted by running: playbooks/openshift-glusterfs/upgrade.yml



Thanks 
Arun

Comment 18 Daniel Pivonka 2019-09-16 16:42:16 UTC

PR above is merged. fix in version updated.

Comment 22 Daniel Pivonka 2019-10-09 19:37:44 UTC

I have investigated why this checks takes so long myself. It is dependent on the number of volumes in the cluster as it linearly goes through and runs a 'gluster volume heal info' on each volume as Sarvana stated before. Each check takes roughly ten seconds. At 1000 volumes that will take roughly 2.5 hours. There are a few other things that are checked and times may vary alittle but it roughly adds up to the time your experiencing.

So there is no easy way to improve the time it takes to run this health check. Improving the 'gluster volume heal info' time would be a gluster issue. I will attempt to make a warning that will run before the health check to inform the user 'this will take roughly 10*number_of_volumes seconds to complete' but thats all that can be done here from an openshift ansible perspective.

As for the number of retires that has been addressed in this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1726608

Comment 23 Daniel Pivonka 2019-10-09 20:48:26 UTC

PR: https://github.com/openshift/openshift-ansible/pull/11945
this adds the warning message

Comment 24 Daniel Pivonka 2019-10-10 18:42:09 UTC

PR merged