Bug 1728184 - Task- 'Check for GlusterFS cluster health" takes more than one and half hour for a single retry during upgrading ocs with upgrade playbook
Summary: Task- 'Check for GlusterFS cluster health" takes more than one and half hour ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: cns-ansible
Version: ocs-3.11
Hardware: Unspecified
OS: Linux
unspecified
urgent
Target Milestone: ---
: OCS 3.11.z Batch Update 4
Assignee: John Mulligan
QA Contact: Arun Kumar
URL:
Whiteboard:
Depends On:
Blocks: 1703695
TreeView+ depends on / blocked
 
Reported: 2019-07-09 09:18 UTC by Arun Kumar
Modified: 2020-02-13 05:22 UTC (History)
8 users (show)

Fixed In Version: openshift-ansible-3.11.153-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-13 05:22:21 UTC
Embargoed:


Attachments (Terms of Use)

Comment 10 Saravanakumar 2019-07-24 08:15:15 UTC
PR posted upstream:
https://github.com/openshift/openshift-ansible/pull/11777

Comment 15 Saravanakumar 2019-08-09 07:39:45 UTC
PR merged.

Comment 16 Arun Kumar 2019-08-13 05:13:17 UTC
When the value of openshift_storage_glusterfs_health_timeout parameter is set to 3 or any other number in the inventory file then playbook fails after first attempt only. Playbook does not go 30 attempts or the attempts specified. 

eg.
 
'openshift_storage_glusterfs_health_timeout=3'



The snippet of ansible logs
==============================================


2019-08-12 15:03:33,516 p=67311 u=root |  TASK [openshift_storage_glusterfs : Check for GlusterFS cluster health] *******************************************************************************************************************************************
2019-08-12 15:03:33,516 p=67311 u=root |  task path: /usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/cluster_health.yml:4
2019-08-12 15:03:33,905 p=67311 u=root |  Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py
2019-08-12 15:03:35,512 p=67311 u=root |  fatal: [master -> master]: FAILED! => {
    "attempts": 1,
    "changed": false,
    "invocation": {
        "module_args": {
            "check_bricks": true,
            "cluster_name": "storage",
            "exclude_node": "master",
            "oc_bin": "oc",
            "oc_conf": "/etc/origin/master/admin.kubeconfig",
            "oc_namespace": "arun-glusterfs",
            "target_nodes": null
        }
    },
    "msg": "volume heketidbstorage is not ready",
    "state": "unknown"
}
2019-08-12 15:03:35,518 p=67311 u=root |        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/upgrade.retry

2019-08-12 15:03:35,518 p=67311 u=root |  PLAY RECAP ********************************************************************************************************************************************************************************************************
2019-08-12 15:03:35,519 p=67311 u=root |  gluster1                   : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,519 p=67311 u=root |  gluster2                   : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,519 p=67311 u=root |  gluster3                   : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,519 p=67311 u=root |  gluster4                   : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,519 p=67311 u=root |  gluster5                   : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,520 p=67311 u=root |  infra                      : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,520 p=67311 u=root |  localhost                  : ok=12   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,520 p=67311 u=root |  master                     : ok=53   changed=0    unreachable=0    failed=1
2019-08-12 15:03:35,520 p=67311 u=root |  registry1                  : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,520 p=67311 u=root |  registry2                  : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,520 p=67311 u=root |  registry3                  : ok=16   changed=0    unreachable=0    failed=0
2019-08-12 15:03:35,521 p=67311 u=root |  INSTALLER STATUS **************************************************************************************************************************************************************************************************
2019-08-12 15:03:35,523 p=67311 u=root |  Initialization     : Complete (0:01:21)
2019-08-12 15:03:35,524 p=67311 u=root |  GlusterFS Upgrade  : In Progress (0:00:06)
2019-08-12 15:03:35,524 p=67311 u=root |        This phase can be restarted by running: playbooks/openshift-glusterfs/upgrade.yml



Thanks 
Arun

Comment 18 Daniel Pivonka 2019-09-16 16:42:16 UTC
PR above is merged. fix in version updated.

Comment 22 Daniel Pivonka 2019-10-09 19:37:44 UTC
I have investigated why this checks takes so long myself. It is dependent on the number of volumes in the cluster as it linearly goes through and runs a 'gluster volume heal info' on each volume as Sarvana stated before. Each check takes roughly ten seconds. At 1000 volumes that will take roughly 2.5 hours. There are a few other things that are checked and times may vary alittle but it roughly adds up to the time your experiencing.

So there is no easy way to improve the time it takes to run this health check. Improving the 'gluster volume heal info' time would be a gluster issue. I will attempt to make a warning that will run before the health check to inform the user 'this will take roughly 10*number_of_volumes seconds to complete' but thats all that can be done here from an openshift ansible perspective.

As for the number of retires that has been addressed in this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1726608

Comment 23 Daniel Pivonka 2019-10-09 20:48:26 UTC
PR: https://github.com/openshift/openshift-ansible/pull/11945
this adds the warning message

Comment 24 Daniel Pivonka 2019-10-10 18:42:09 UTC
PR merged


Note You need to log in before you can comment on or make changes to this bug.