PR posted upstream: https://github.com/openshift/openshift-ansible/pull/11777
PR merged.
When the value of openshift_storage_glusterfs_health_timeout parameter is set to 3 or any other number in the inventory file then playbook fails after first attempt only. Playbook does not go 30 attempts or the attempts specified. eg. 'openshift_storage_glusterfs_health_timeout=3' The snippet of ansible logs ============================================== 2019-08-12 15:03:33,516 p=67311 u=root | TASK [openshift_storage_glusterfs : Check for GlusterFS cluster health] ******************************************************************************************************************************************* 2019-08-12 15:03:33,516 p=67311 u=root | task path: /usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/cluster_health.yml:4 2019-08-12 15:03:33,905 p=67311 u=root | Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py 2019-08-12 15:03:35,512 p=67311 u=root | fatal: [master -> master]: FAILED! => { "attempts": 1, "changed": false, "invocation": { "module_args": { "check_bricks": true, "cluster_name": "storage", "exclude_node": "master", "oc_bin": "oc", "oc_conf": "/etc/origin/master/admin.kubeconfig", "oc_namespace": "arun-glusterfs", "target_nodes": null } }, "msg": "volume heketidbstorage is not ready", "state": "unknown" } 2019-08-12 15:03:35,518 p=67311 u=root | to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/upgrade.retry 2019-08-12 15:03:35,518 p=67311 u=root | PLAY RECAP ******************************************************************************************************************************************************************************************************** 2019-08-12 15:03:35,519 p=67311 u=root | gluster1 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,519 p=67311 u=root | gluster2 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,519 p=67311 u=root | gluster3 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,519 p=67311 u=root | gluster4 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,519 p=67311 u=root | gluster5 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,520 p=67311 u=root | infra : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,520 p=67311 u=root | localhost : ok=12 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,520 p=67311 u=root | master : ok=53 changed=0 unreachable=0 failed=1 2019-08-12 15:03:35,520 p=67311 u=root | registry1 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,520 p=67311 u=root | registry2 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,520 p=67311 u=root | registry3 : ok=16 changed=0 unreachable=0 failed=0 2019-08-12 15:03:35,521 p=67311 u=root | INSTALLER STATUS ************************************************************************************************************************************************************************************************** 2019-08-12 15:03:35,523 p=67311 u=root | Initialization : Complete (0:01:21) 2019-08-12 15:03:35,524 p=67311 u=root | GlusterFS Upgrade : In Progress (0:00:06) 2019-08-12 15:03:35,524 p=67311 u=root | This phase can be restarted by running: playbooks/openshift-glusterfs/upgrade.yml Thanks Arun
PR above is merged. fix in version updated.
I have investigated why this checks takes so long myself. It is dependent on the number of volumes in the cluster as it linearly goes through and runs a 'gluster volume heal info' on each volume as Sarvana stated before. Each check takes roughly ten seconds. At 1000 volumes that will take roughly 2.5 hours. There are a few other things that are checked and times may vary alittle but it roughly adds up to the time your experiencing. So there is no easy way to improve the time it takes to run this health check. Improving the 'gluster volume heal info' time would be a gluster issue. I will attempt to make a warning that will run before the health check to inform the user 'this will take roughly 10*number_of_volumes seconds to complete' but thats all that can be done here from an openshift ansible perspective. As for the number of retires that has been addressed in this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1726608
PR: https://github.com/openshift/openshift-ansible/pull/11945 this adds the warning message
PR merged