Bug 1728184
Summary: | Task- 'Check for GlusterFS cluster health" takes more than one and half hour for a single retry during upgrading ocs with upgrade playbook | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Arun Kumar <arukumar> |
Component: | cns-ansible | Assignee: | John Mulligan <jmulligan> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Arun Kumar <arukumar> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | ocs-3.11 | CC: | dpivonka, hchiramm, knarra, kramdoss, madam, rhs-bugs, rtalur, sarumuga |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | OCS 3.11.z Batch Update 4 | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | openshift-ansible-3.11.153-1 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-02-13 05:22:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1703695 |
Comment 10
Saravanakumar
2019-07-24 08:15:15 UTC
PR merged. When the value of openshift_storage_glusterfs_health_timeout parameter is set to 3 or any other number in the inventory file then playbook fails after first attempt only. Playbook does not go 30 attempts or the attempts specified. eg. 'openshift_storage_glusterfs_health_timeout=3' The snippet of ansible logs ============================================== 2019-08-12 15:03:33,516 p=67311 u=root | TASK [openshift_storage_glusterfs : Check for GlusterFS cluster health] ******************************************************************************************************************************************* 2019-08-12 15:03:33,516 p=67311 u=root | task path: /usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/cluster_health.yml:4 2019-08-12 15:03:33,905 p=67311 u=root | Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py 2019-08-12 15:03:35,512 p=67311 u=root | fatal: [master -> master]: FAILED! => { "attempts": 1, "changed": false, "invocation": { "module_args": { "check_bricks": true, "cluster_name": "storage", "exclude_node": "master", "oc_bin": "oc", "oc_conf": "/etc/origin/master/admin.kubeconfig", "oc_namespace": "arun-glusterfs", "target_nodes": null } }, "msg": "volume heketidbstorage is not ready", "state": "unknown" } 2019-08-12 15:03:35,518 p=67311 u=root | to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/upgrade.retry 2019-08-12 15:03:35,518 p=67311 u=root | PLAY RECAP ******************************************************************************************************************************************************************************************************** 2019-08-12 15:03:35,519 p=67311 u=root | gluster1 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,519 p=67311 u=root | gluster2 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,519 p=67311 u=root | gluster3 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,519 p=67311 u=root | gluster4 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,519 p=67311 u=root | gluster5 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,520 p=67311 u=root | infra : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,520 p=67311 u=root | localhost : ok=12 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,520 p=67311 u=root | master : ok=53 changed=0 unreachable=0 failed=1 2019-08-12 15:03:35,520 p=67311 u=root | registry1 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,520 p=67311 u=root | registry2 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,520 p=67311 u=root | registry3 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,521 p=67311 u=root | INSTALLER STATUS ************************************************************************************************************************************************************************************************** 2019-08-12 15:03:35,523 p=67311 u=root | Initialization : Complete (0:01:21) 2019-08-12 15:03:35,524 p=67311 u=root | GlusterFS Upgrade : In Progress (0:00:06) 2019-08-12 15:03:35,524 p=67311 u=root | This phase can be restarted by running: playbooks/openshift-glusterfs/upgrade.yml Thanks Arun PR above is merged. fix in version updated. I have investigated why this checks takes so long myself. It is dependent on the number of volumes in the cluster as it linearly goes through and runs a 'gluster volume heal info' on each volume as Sarvana stated before. Each check takes roughly ten seconds. At 1000 volumes that will take roughly 2.5 hours. There are a few other things that are checked and times may vary alittle but it roughly adds up to the time your experiencing. So there is no easy way to improve the time it takes to run this health check. Improving the 'gluster volume heal info' time would be a gluster issue. I will attempt to make a warning that will run before the health check to inform the user 'this will take roughly 10*number_of_volumes seconds to complete' but thats all that can be done here from an openshift ansible perspective. As for the number of retires that has been addressed in this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1726608 PR: https://github.com/openshift/openshift-ansible/pull/11945 this adds the warning message PR merged |