1484324 – The playbook should abort immediately once pre check finish if pre_check failed

Bug 1484324 - The playbook should abort immediately once pre check finish if pre_check failed

Summary: The playbook should abort immediately once pre check finish if pre_check failed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Russell Teague
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-23 09:10 UTC by Anping Li
Modified:	2017-11-28 22:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	In some instances, a host failure would not result in the playbook exiting during checks. The play has been updated to set any_errors_fatal to true, ensuring the play exits as expected.
Clone Of:
Environment:
Last Closed:	2017-11-28 22:07:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
The upgrade logs (deleted) 2017-08-23 09:10 UTC, Anping Li	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description Anping Li 2017-08-23 09:10:21 UTC

Description of problem:
When the disk/memory check failed, for there no fatal error on localhost, the upgrade continues until it failed on task Gate on etcd backup [1]. the output show it failed for etcd backup wrongly. It is better to abort at the play pre/gate_checks.yml. 

[1] task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:18


Version-Release number of the following components:
openshift-ansible: master

How reproducible:
always

Steps to Reproduce:
1. RPM install OCP v3.6 and make the disk size is less than 10G
2. upgrade to v3.7
3. check the playbook output.

Actual results:
*
task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/pre/verify_health_checks.yml:9

CHECK [disk_availability : openshift-181.lab.eng.nay.redhat.com] 
CHECK [memory_availability : openshift-181.lab.eng.nay.redhat.com] fatal: [openshift-181.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {
            "failed": true, 
            "msg": "Available disk space in \"/var\" (6.7 GB) is below minimum recommended (10.0 GB)"
        }, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

CHECK [disk_availability : openshift-221.lab.eng.nay.redhat.com] 
CHECK [memory_availability : openshift-221.lab.eng.nay.redhat.com] fatal: [openshift-221.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {
            "failed": true, 
            "msg": "Available disk space in \"/var\" (6.7 GB) is below minimum recommended (10.0 GB)"
        }, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

CHECK [disk_availability : openshift-182.lab.eng.nay.redhat.com] 
CHECK [memory_availability : openshift-182.lab.eng.nay.redhat.com] fatal: [openshift-182.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {
            "failed": true, 
            "msg": "Available disk space in \"/var\" (6.6 GB) is below minimum recommended (10.0 GB)"
        }, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

CHECK [disk_availability : openshift-217.lab.eng.nay.redhat.com] 
CHECK [memory_availability : openshift-217.lab.eng.nay.redhat.com] fatal: [openshift-217.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {}, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

CHECK [disk_availability : openshift-210.lab.eng.nay.redhat.com] 
CHECK [memory_availability : openshift-210.lab.eng.nay.redhat.com] fatal: [openshift-210.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {}, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

CHECK [disk_availability : openshift-220.lab.eng.nay.redhat.com] CHECK [memory_availability : openshift-220.lab.eng.nay.redhat.com] ok: [openshift-220.lab.eng.nay.redhat.com] => {
    "changed": false, 
    "checks": {
        "disk_availability": {
            "skipped": true, 
            "skipped_reason": "Not active for this host"
        }, 
        "memory_availability": {
            "skipped": true, 
            "skipped_reason": "Not active for this host"
        }
    }, 
    "playbook_context": "upgrade"
}
META: ran handlers

PLAY [Verify master processes] 
PLAY [Validate configuration for rolling restart] 
PLAY [Create temp file on localhost] **************************
PLAY [Check if temp file exists on any masters] 
PLAY [Cleanup temp file on localhost] 
PLAY [Warn if restarting the system where ansible is running] 
PLAY [Verify upgrade targets] 
PLAY [Verify docker upgrade targets] 
PLAY [Verify 3.7 specific upgrade checks] 
PLAY [Flag pre-upgrade checks complete for hosts without errors] 
PLAY [Cleanup unused Docker images] 
PLAY [Pre master upgrade - Upgrade all storage] 
PLAY [Set master embedded_etcd fact] 
PLAY [Backup etcd] 
PLAY [Gate on etcd backup] 
TASK [Gathering Facts] 
META: ran handlers
TASK [set_fact] task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:18
ok: [localhost] => {
    "ansible_facts": {
        "etcd_backup_completed": []
    }, 
    "changed": false
}

TASK [set_fact] task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:22
ok: [localhost] => {
    "ansible_facts": {
        "etcd_backup_failed": [
            "openshift-181.lab.eng.nay.redhat.com", 
            "openshift-182.lab.eng.nay.redhat.com", 
            "openshift-221.lab.eng.nay.redhat.com"
        ]
    }, 
    "changed": false
}

TASK [fail] 
task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:24
fatal: [localhost]: FAILED! => {
    "changed": false, 
    "failed": true
}

MSG:

Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com,openshift-182.lab.eng.nay.redhat.com,openshift-221.lab.eng.nay.redhat.com
    to retry, use: --limit @/root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry

PLAY RECAP localhost                  : ok=18   changed=0    unreachable=0    failed=1   
openshift-181.lab.eng.nay.redhat.com : ok=80   changed=7    unreachable=0    failed=1   
openshift-182.lab.eng.nay.redhat.com : ok=77   changed=7    unreachable=0    failed=1   
openshift-210.lab.eng.nay.redhat.com : ok=76   changed=7    unreachable=0    failed=1   
openshift-217.lab.eng.nay.redhat.com : ok=76   changed=7    unreachable=0    failed=1   
openshift-220.lab.eng.nay.redhat.com : ok=38   changed=2    unreachable=0    failed=0   
openshift-221.lab.eng.nay.redhat.com : ok=77   changed=7    unreachable=0    failed=1   

Failure summary:

  1. Host:     openshift-181.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space in "/var" (6.7 GB) is below minimum recommended (10.0 GB)
               
               check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)

  2. Host:     openshift-221.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space in "/var" (6.7 GB) is below minimum recommended (10.0 GB)
               
               check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)

  3. Host:     openshift-182.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space in "/var" (6.6 GB) is below minimum recommended (10.0 GB)
               
               check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)

  4. Host:     openshift-217.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)

  5. Host:     openshift-210.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)

  6. Host:     localhost
     Play:     Gate on etcd backup
     Task:     fail
     Message:  Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com,openshift-182.lab.eng.nay.redhat.com,openshift-221.lab.eng.nay.redhat.com

The execution of "/root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml"
includes checks designed to fail early if the requirements
of the playbook are not met. One or more of these checks
failed. To disregard these results, you may choose to
disable failing checks by setting an Ansible variable:

   openshift_disable_check=disk_availability,memory_availability

Failing check names are shown in the failure details above.
Some checks may be configurable by variables if your requirements
are different from the defaults; consult check documentation.
Variables can be set in the inventory or passed on the
command line using the -e flag to ansible-playbook.


Expected results:
 

Additional info:

Comment 1 Russell Teague 2017-09-13 18:30:14 UTC

Could you verify this is still an issue and provide the version of openshift-ansible?  In my testing, I found the upgrade playbook exited immediately when the health checks failed.

$ git describe
openshift-ansible-3.7.0-0.126.0-19-ge1754cbde

$ ansible-playbook -i hosts openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml

...

PLAY [Verify Host Requirements] **************************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************************************
ok: [ec2-52-90-73-245.compute-1.amazonaws.com]
ok: [ec2-52-90-164-78.compute-1.amazonaws.com]
ok: [ec2-34-229-99-90.compute-1.amazonaws.com]

TASK [openshift_health_check] ****************************************************************************************************************

CHECK [disk_availability : ec2-34-229-99-90.compute-1.amazonaws.com] *************************************************************************

CHECK [memory_availability : ec2-34-229-99-90.compute-1.amazonaws.com] ***********************************************************************
fatal: [ec2-34-229-99-90.compute-1.amazonaws.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {}, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed


CHECK [disk_availability : ec2-52-90-164-78.compute-1.amazonaws.com] *************************************************************************

CHECK [memory_availability : ec2-52-90-164-78.compute-1.amazonaws.com] ***********************************************************************

CHECK [disk_availability : ec2-52-90-73-245.compute-1.amazonaws.com] *************************************************************************

CHECK [memory_availability : ec2-52-90-73-245.compute-1.amazonaws.com] ***********************************************************************
fatal: [ec2-52-90-73-245.compute-1.amazonaws.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {}, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

fatal: [ec2-52-90-164-78.compute-1.amazonaws.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {}, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed


PLAY RECAP ***********************************************************************************************************************************
ec2-34-229-99-90.compute-1.amazonaws.com : ok=87   changed=9    unreachable=0    failed=1   
ec2-52-90-164-78.compute-1.amazonaws.com : ok=82   changed=10   unreachable=0    failed=1   
ec2-52-90-73-245.compute-1.amazonaws.com : ok=82   changed=10   unreachable=0    failed=1   
localhost                  : ok=11   changed=0    unreachable=0    failed=0   



Failure summary:


  1. Hosts:    ec2-34-229-99-90.compute-1.amazonaws.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)

  2. Hosts:    ec2-52-90-164-78.compute-1.amazonaws.com, ec2-52-90-73-245.compute-1.amazonaws.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)

The execution of "openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml" includes checks designed to fail early if the requirements of the playbook are not met. One or more of these checks failed. To disregard these results,explicitly disable checks by setting an Ansible variable:
   openshift_disable_check=memory_availability
Failing check names are shown in the failure details above. Some checks may be configurable by variables if your requirements are different from the defaults; consult check documentation.
Variables can be set in the inventory or passed on the command line using the -e flag to ansible-playbook.

Comment 2 Russell Teague 2017-09-28 19:12:13 UTC

Also, if this is still an issue what version of Ansible is in use?  There is a known bug with Ansible 2.4/devel that could cause this problem.
https://github.com/ansible/ansible/issues/30691

Comment 3 Anping Li 2017-10-09 14:25:34 UTC

Ressell, 

The result is same as before.  Both node and master failed for health checking. But the localhost continue until the task 'Gate on etcd backup'.  it doesn't harm to feature, so I downgrade the Severity to low.

# rpm -qa|grep ansible
openshift-ansible-docs-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
openshift-ansible-callback-plugins-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
openshift-ansible-filter-plugins-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
openshift-ansible-playbooks-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
ansible-2.3.2.0-2.el7.noarch
openshift-ansible-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
openshift-ansible-lookup-plugins-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
openshift-ansible-roles-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch


TASK [set_fact] ****************************************************************
fatal: [openshift-217.lab.eng.nay.redhat.com]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n  pre_tasks:\n  - set_fact:\n    ^ here\n"}
fatal: [openshift-210.lab.eng.nay.redhat.com]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n  pre_tasks:\n  - set_fact:\n    ^ here\n"}
fatal: [openshift-226.lab.eng.nay.redhat.com]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n  pre_tasks:\n  - set_fact:\n    ^ here\n"}

PLAY [Validate configuration for rolling restart] ******************************

PLAY [Create temp file on localhost] *******************************************

TASK [command] *****************************************************************
ok: [localhost -> localhost]

PLAY [Check if temp file exists on any masters] ********************************

PLAY [Cleanup temp file on localhost] ******************************************

TASK [file] ********************************************************************
ok: [localhost]

PLAY [Warn if restarting the system where ansible is running] ******************

PLAY [Verify upgrade targets] **************************************************

PLAY [Verify docker upgrade targets] *******************************************

PLAY [Verify 3.7 specific upgrade checks] **************************************

PLAY [Flag pre-upgrade checks complete for hosts without errors] ***************

PLAY [Cleanup unused Docker images] ********************************************

PLAY [Pre master upgrade - Upgrade all storage] ********************************

PLAY [Set master embedded_etcd fact] *******************************************

PLAY [Backup etcd] *************************************************************

PLAY [Gate on etcd backup] *****************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [set_fact] ****************************************************************
ok: [localhost]

TASK [set_fact] ****************************************************************
ok: [localhost]

TASK [fail] ********************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com"}

PLAY RECAP *********************************************************************
localhost                  : ok=16   changed=0    unreachable=0    failed=1   
openshift-181.lab.eng.nay.redhat.com : ok=35   changed=2    unreachable=0    failed=1   
openshift-182.lab.eng.nay.redhat.com : ok=30   changed=2    unreachable=0    failed=1   
openshift-210.lab.eng.nay.redhat.com : ok=67   changed=8    unreachable=0    failed=1   
openshift-217.lab.eng.nay.redhat.com : ok=67   changed=8    unreachable=0    failed=1   
openshift-226.lab.eng.nay.redhat.com : ok=67   changed=8    unreachable=0    failed=1   



Failure summary:


  1. Hosts:    openshift-182.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)

  2. Hosts:    openshift-181.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space in "/var" (6.8 GB) is below minimum recommended (10.0 GB)
               
               check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)

  3. Hosts:    openshift-210.lab.eng.nay.redhat.com, openshift-217.lab.eng.nay.redhat.com, openshift-226.lab.eng.nay.redhat.com
     Play:     Set openshift_version for etcd, node, and master hosts
     Task:     set_fact
     Message:  the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'
               
               The error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may
               be elsewhere in the file depending on the exact syntax problem.
               
               The offending line appears to be:
               
                 pre_tasks:
                 - set_fact:
                   ^ here
               

  4. Hosts:    localhost
     Play:     Gate on etcd backup
     Task:     fail
     Message:  Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com

The execution of "/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml" includes checks designed to fail early if the requirements of the playbook are not met. One or more of these checks failed. To disregard these results,explicitly disable checks by setting an Ansible variable:
   openshift_disable_check=disk_availability,memory_availability
Failing check names are shown in the failure details above. Some checks may be configurable by variables if your requirements are different from the defaults; consult check documentation.
Variables can be set in the inventory or passed on the command line using the -e flag to ansible-playbook.

Comment 4 Russell Teague 2017-10-12 15:34:03 UTC

I am still unable to reproduce this failure.  Please attach complete ansible log using '-vv' output and the inventory file in use.

Comment 5 Russell Teague 2017-10-12 19:03:13 UTC

Proposed: https://github.com/openshift/openshift-ansible/pull/5741

Comment 6 Russell Teague 2017-10-16 14:09:41 UTC

Merged: https://github.com/openshift/openshift-ansible/pull/5741

Commit has been merged since openshift-ansible-3.7.0-0.150.0

Comment 8 Anping Li 2017-11-02 02:57:58 UTC

Verified and pass on openshift-ansible-3.7.0-0.189.0

Comment 11 errata-xmlrpc 2017-11-28 22:07:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.