Bug 1726608

Summary: [RFE] Limit the number of retries for pre-requisite tasks in upgrade playbook
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nitin Goyal <nigoyal>
Component: cns-ansibleAssignee: John Mulligan <jmulligan>
Status: CLOSED CURRENTRELEASE QA Contact: Prasanth <pprakash>
Severity: high Docs Contact:
Priority: unspecified    
Version: ocs-3.11CC: arukumar, dpivonka, hchiramm, jarrpa, knarra, kramdoss, madam, pasik, rhs-bugs, rtalur, sarumuga
Target Milestone: ---Keywords: ZStream
Target Release: OCS 3.11.z Batch Update 4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openshift-ansible-3.11.147-1 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-13 05:22:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1703695    
Attachments:
Description Flags
logs of ansible which shows the last task where it is trying again and again. none

Description Nitin Goyal 2019-07-03 09:28:39 UTC
Description of problem:

Pre-requisite tasks such as,
1. All gluster pods should be up and running.
2. heketi and gluster-block-prov pod should be up and running.
3. Brick capacity.
4. Gluster volume heal should be complete.
5. There should not be any pending entries in heketi.

Should fail the playbook after not more than 5 retries, as issues with bricks capacity cannot be resolved without manual efforts. The playbook does not need to re-try many times to eventually fail with these pre-req failures. This is explicitly for the 1st few tasks where the playbook checks if the cluster is healthy and ready to be upgraded. (TASK [openshift_storage_glusterfs : Check for GlusterFS cluster health]). 

Raising this bug to save a customer's time while upgrading.


Version-Release number of selected component (if applicable):
openshift-ansible-3.11.123-1.git.0.db681ba.el7.noarch


How reproducible:

Steps to Reproduce:
1. Upgrade glusterfs cluster with upgrade playbook.
2.
3.


Actual results:

2019-07-02 16:26:25,958 p=125669 u=root |  ok: [dhcp46-218.lab.eng.blr.redhat.com]
2019-07-02 16:26:26,333 p=125669 u=root |  TASK [openshift_storage_glusterfs : Check for GlusterFS cluster health]**********************************************************************************************************************
2019-07-02 16:28:00,145 p=125669 u=root |  FAILED - RETRYING: Check for GlusterFS cluster health (120 retries left).
.
.
.
2019-07-02 20:11:32,535 p=125669 u=root |  FAILED - RETRYING: Check for GlusterFS cluster health (1 retries left).
2019-07-02 20:13:41,017 p=125669 u=root |  fatal: [dhcp46-218.lab.eng.blr.redhat.com -> dhcp46-218.lab.eng.blr.redhat.com]: FAILED! => {"attempts": 120, "changed": false, "msg": "bricks     near capacity found: {u'dhcp46-19.lab.eng.blr.redhat.com': [u'/var/lib/heketi/mounts/vg_966da3e87480fa95ddacd2ad91f49518/brick_7d2df4baf0b1a547ed6f38f57aff7969'], u'dhcp47-107.lab.eng.blr.  redhat.com': [u'/var/lib/heketi/mounts/vg_ab9409e4645b20dd76bf550cab8ef1eb/brick_0cf9feb79abf33fc48eedcab4b3d4c54'], u'dhcp46-35.lab.eng.blr.redhat.com': [u'/var/lib/heketi/mounts/          vg_4c95c52d5ee3d5d5d685c57aeaf5ac04/brick_2571c4a94a15468a37952641e5176da2']}", "state": "unknown"}
2019-07-02 20:13:41,019 p=125669 u=root |   to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/upgrade.retry


Expected results:

IMO it should not retries as many times. Because some things can not be resolved with retrying again and again.


Additional info:

Comment 2 Nitin Goyal 2019-07-03 09:43:53 UTC
Created attachment 1587013 [details]
logs of ansible which shows the last task where it is trying again and again.

Comment 3 Jose A. Rivera 2019-07-09 14:39:40 UTC
You can set "openshift_storage_glusterfs_timeout" to a smaller interval if so desired. I do not want to change the default.

Comment 12 Saravanakumar 2019-07-24 08:15:53 UTC
PR posted upstream:

https://github.com/openshift/openshift-ansible/pull/11777

Also, fixes bz#1728184

Comment 19 Daniel Pivonka 2019-09-16 16:44:03 UTC
PR above is merged. Fixed in version updated.

Comment 22 Daniel Pivonka 2019-09-26 15:24:12 UTC
this bug should be able to be verified by putting cluster in a bad state such as unhealed volumes or full bricks and if this bug is fixed it should try the health check 3 times then the playbook should fail.
the 3 times comes from the var openshift_storage_glusterfs_timeout which defualts to 30 and is then divided by 10 resulting in 3 retries. 
this var openshift_storage_glusterfs_timeout should be in multiple of 10 if changed.

Comment 24 Daniel Pivonka 2019-09-30 14:29:06 UTC
That is not the expected result. Setting openshift_storage_glusterfs_timeout to 50 for example should result in 5 retires.

Comment 25 Arun Kumar 2019-09-30 14:33:16 UTC
Based on comment 24 I am moving this bug to fail-qa state.

Comment 26 Daniel Pivonka 2019-09-30 17:05:16 UTC
The variable to change the number of health retries is openshift_storage_glusterfs_health_timeout not openshift_storage_glusterfs_timeout.

This variable is still over written to 30 here  https://github.com/openshift/openshift-ansible/blob/5218f3c57ea9b5b5570bf0bc61b9bfea6df0632d/roles/openshift_storage_glusterfs/tasks/glusterfs_upgrade.yml#L12

and the default value for it is 1200 here https://github.com/openshift/openshift-ansible/blob/5218f3c57ea9b5b5570bf0bc61b9bfea6df0632d/roles/openshift_storage_glusterfs/defaults/main.yml#L27


I will open a PR to change the default to 30 and remove the over write. That should resolve this.

Comment 27 Daniel Pivonka 2019-09-30 20:16:50 UTC
PR: https://github.com/openshift/openshift-ansible/pull/11931

Comment 28 Daniel Pivonka 2019-10-02 16:09:16 UTC
PR merged

Comment 29 Daniel Pivonka 2019-10-03 14:58:38 UTC
fixed in version updated

Comment 30 Arun Kumar 2019-10-07 12:00:19 UTC
I have verfied the bug and moving it to verfied state. Snippet of ansible logs and inventory file arguments are as follows:

version:
========
[root@master ~]# rpm -qa|grep ansible
openshift-ansible-playbooks-3.11.147-1.git.0.bd6c010.el7.noarch
ansible-2.6.19-1.el7ae.noarch
openshift-ansible-3.11.147-1.git.0.bd6c010.el7.noarch
openshift-ansible-roles-3.11.147-1.git.0.bd6c010.el7.noarch
openshift-ansible-docs-3.11.147-1.git.0.bd6c010.el7.noarch


case 1: openshift_storage_glusterfs_timeout=70
==============================================
inventory file  arguments:
--------------------------
openshift_storage_glusterfs_block_host_vol_create=true
openshift_storage_glusterfs_block_host_vol_size=100
openshift_storage_glusterfs_health_timeout=70
openshift_storage_gluster_update_techpreview=true

attempts=7 (passed), ansible logs:
---------------------------------
2019-10-07 12:56:26,245 p=100682 u=root |  Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py
2019-10-07 12:56:29,208 p=100682 u=root |  fatal: [master -> master]: FAILED! => {
    "attempts": 7, 
    "changed": false,
    "invocation": {
        "module_args": {
            "check_bricks": true, 
            "cluster_name": "storage",
            "exclude_node": "master",
            "oc_bin": "oc", 
            "oc_conf": "/etc/origin/master/admin.kubeconfig",
            "oc_namespace": "glusterfs",
            "target_nodes": null
        }
    }, 
    "msg": "volume vol_82011932030d7bc34672f20a537b35d3 is not ready",
    "state": "unknown"
}



case 2: openshift_storage_glusterfs_timeout=20
==============================================
inventory file  arguments:
--------------------------
openshift_storage_glusterfs_block_host_vol_size=100
openshift_storage_glusterfs_health_timeout=20
openshift_storage_gluster_update_techpreview=true

attempts=2 (passed), ansible logs:
---------------------------------
2019-10-07 13:02:28,038 p=113347 u=root |  Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py
2019-10-07 13:02:30,989 p=113347 u=root |  fatal: [master -> master]: FAILED! => {
    "attempts": 2, 
    "changed": false,
    "invocation": {
        "module_args": {
            "check_bricks": true, 
            "cluster_name": "storage",
            "exclude_node": "master",
            "oc_bin": "oc", 
            "oc_conf": "/etc/origin/master/admin.kubeconfig",
            "oc_namespace": "glusterfs",
            "target_nodes": null
        }
    }, 
    "msg": "volume vol_82011932030d7bc34672f20a537b35d3 is not ready",
    "state": "unknown"
}


case 3: variable not mentioned in the inventory file
====================================================
attempts=3 (default value, passed), ansible logs:
-------------------------------------------------
2019-10-07 13:17:14,103 p=2657 u=root |  Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py
2019-10-07 13:17:17,028 p=2657 u=root |  fatal: [master -> master]: FAILED! => {
    "attempts": 3, 
    "changed": false,
    "invocation": {
        "module_args": {
            "check_bricks": true, 
            "cluster_name": "storage",
            "exclude_node": "master",
            "oc_bin": "oc", 
            "oc_conf": "/etc/origin/master/admin.kubeconfig",
            "oc_namespace": "glusterfs",
            "target_nodes": null
        }
    }, 
    "msg": "volume vol_82011932030d7bc34672f20a537b35d3 is not ready",
    "state": "unknown"
}
2019-10-07 13:17:17,036 p=2657 u=root |         to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-glusterfs/upgrade.retry