Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1484324 - The playbook should abort immediately once pre check finish if pre_check failed
The playbook should abort immediately once pre check finish if pre_check failed
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade (Show other bugs)
3.7.0
Unspecified Unspecified
unspecified Severity low
: ---
: 3.7.0
Assigned To: Russell Teague
Anping Li
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-23 05:10 EDT by Anping Li
Modified: 2017-11-28 17:07 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
In some instances, a host failure would not result in the playbook exiting during checks. The play has been updated to set any_errors_fatal to true, ensuring the play exits as expected.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-11-28 17:07:41 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
The upgrade logs (deleted)
2017-08-23 05:10 EDT, Anping Li
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-28 21:34:54 EST

  None (edit)
Description Anping Li 2017-08-23 05:10:21 EDT
Description of problem:
When the disk/memory check failed, for there no fatal error on localhost, the upgrade continues until it failed on task Gate on etcd backup [1]. the output show it failed for etcd backup wrongly. It is better to abort at the play pre/gate_checks.yml. 

[1] task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:18


Version-Release number of the following components:
openshift-ansible: master

How reproducible:
always

Steps to Reproduce:
1. RPM install OCP v3.6 and make the disk size is less than 10G
2. upgrade to v3.7
3. check the playbook output.

Actual results:
*
task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/pre/verify_health_checks.yml:9

CHECK [disk_availability : openshift-181.lab.eng.nay.redhat.com] 
CHECK [memory_availability : openshift-181.lab.eng.nay.redhat.com] fatal: [openshift-181.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {
            "failed": true, 
            "msg": "Available disk space in \"/var\" (6.7 GB) is below minimum recommended (10.0 GB)"
        }, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

CHECK [disk_availability : openshift-221.lab.eng.nay.redhat.com] 
CHECK [memory_availability : openshift-221.lab.eng.nay.redhat.com] fatal: [openshift-221.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {
            "failed": true, 
            "msg": "Available disk space in \"/var\" (6.7 GB) is below minimum recommended (10.0 GB)"
        }, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

CHECK [disk_availability : openshift-182.lab.eng.nay.redhat.com] 
CHECK [memory_availability : openshift-182.lab.eng.nay.redhat.com] fatal: [openshift-182.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {
            "failed": true, 
            "msg": "Available disk space in \"/var\" (6.6 GB) is below minimum recommended (10.0 GB)"
        }, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

CHECK [disk_availability : openshift-217.lab.eng.nay.redhat.com] 
CHECK [memory_availability : openshift-217.lab.eng.nay.redhat.com] fatal: [openshift-217.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {}, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

CHECK [disk_availability : openshift-210.lab.eng.nay.redhat.com] 
CHECK [memory_availability : openshift-210.lab.eng.nay.redhat.com] fatal: [openshift-210.lab.eng.nay.redhat.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {}, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

CHECK [disk_availability : openshift-220.lab.eng.nay.redhat.com] CHECK [memory_availability : openshift-220.lab.eng.nay.redhat.com] ok: [openshift-220.lab.eng.nay.redhat.com] => {
    "changed": false, 
    "checks": {
        "disk_availability": {
            "skipped": true, 
            "skipped_reason": "Not active for this host"
        }, 
        "memory_availability": {
            "skipped": true, 
            "skipped_reason": "Not active for this host"
        }
    }, 
    "playbook_context": "upgrade"
}
META: ran handlers

PLAY [Verify master processes] 
PLAY [Validate configuration for rolling restart] 
PLAY [Create temp file on localhost] **************************
PLAY [Check if temp file exists on any masters] 
PLAY [Cleanup temp file on localhost] 
PLAY [Warn if restarting the system where ansible is running] 
PLAY [Verify upgrade targets] 
PLAY [Verify docker upgrade targets] 
PLAY [Verify 3.7 specific upgrade checks] 
PLAY [Flag pre-upgrade checks complete for hosts without errors] 
PLAY [Cleanup unused Docker images] 
PLAY [Pre master upgrade - Upgrade all storage] 
PLAY [Set master embedded_etcd fact] 
PLAY [Backup etcd] 
PLAY [Gate on etcd backup] 
TASK [Gathering Facts] 
META: ran handlers
TASK [set_fact] task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:18
ok: [localhost] => {
    "ansible_facts": {
        "etcd_backup_completed": []
    }, 
    "changed": false
}

TASK [set_fact] task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:22
ok: [localhost] => {
    "ansible_facts": {
        "etcd_backup_failed": [
            "openshift-181.lab.eng.nay.redhat.com", 
            "openshift-182.lab.eng.nay.redhat.com", 
            "openshift-221.lab.eng.nay.redhat.com"
        ]
    }, 
    "changed": false
}

TASK [fail] 
task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:24
fatal: [localhost]: FAILED! => {
    "changed": false, 
    "failed": true
}

MSG:

Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com,openshift-182.lab.eng.nay.redhat.com,openshift-221.lab.eng.nay.redhat.com
    to retry, use: --limit @/root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry

PLAY RECAP localhost                  : ok=18   changed=0    unreachable=0    failed=1   
openshift-181.lab.eng.nay.redhat.com : ok=80   changed=7    unreachable=0    failed=1   
openshift-182.lab.eng.nay.redhat.com : ok=77   changed=7    unreachable=0    failed=1   
openshift-210.lab.eng.nay.redhat.com : ok=76   changed=7    unreachable=0    failed=1   
openshift-217.lab.eng.nay.redhat.com : ok=76   changed=7    unreachable=0    failed=1   
openshift-220.lab.eng.nay.redhat.com : ok=38   changed=2    unreachable=0    failed=0   
openshift-221.lab.eng.nay.redhat.com : ok=77   changed=7    unreachable=0    failed=1   

Failure summary:

  1. Host:     openshift-181.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space in "/var" (6.7 GB) is below minimum recommended (10.0 GB)
               
               check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)

  2. Host:     openshift-221.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space in "/var" (6.7 GB) is below minimum recommended (10.0 GB)
               
               check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)

  3. Host:     openshift-182.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space in "/var" (6.6 GB) is below minimum recommended (10.0 GB)
               
               check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)

  4. Host:     openshift-217.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)

  5. Host:     openshift-210.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)

  6. Host:     localhost
     Play:     Gate on etcd backup
     Task:     fail
     Message:  Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com,openshift-182.lab.eng.nay.redhat.com,openshift-221.lab.eng.nay.redhat.com

The execution of "/root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml"
includes checks designed to fail early if the requirements
of the playbook are not met. One or more of these checks
failed. To disregard these results, you may choose to
disable failing checks by setting an Ansible variable:

   openshift_disable_check=disk_availability,memory_availability

Failing check names are shown in the failure details above.
Some checks may be configurable by variables if your requirements
are different from the defaults; consult check documentation.
Variables can be set in the inventory or passed on the
command line using the -e flag to ansible-playbook.


Expected results:
 

Additional info:
Comment 1 Russell Teague 2017-09-13 14:30:14 EDT
Could you verify this is still an issue and provide the version of openshift-ansible?  In my testing, I found the upgrade playbook exited immediately when the health checks failed.

$ git describe
openshift-ansible-3.7.0-0.126.0-19-ge1754cbde

$ ansible-playbook -i hosts openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml

...

PLAY [Verify Host Requirements] **************************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************************************
ok: [ec2-52-90-73-245.compute-1.amazonaws.com]
ok: [ec2-52-90-164-78.compute-1.amazonaws.com]
ok: [ec2-34-229-99-90.compute-1.amazonaws.com]

TASK [openshift_health_check] ****************************************************************************************************************

CHECK [disk_availability : ec2-34-229-99-90.compute-1.amazonaws.com] *************************************************************************

CHECK [memory_availability : ec2-34-229-99-90.compute-1.amazonaws.com] ***********************************************************************
fatal: [ec2-34-229-99-90.compute-1.amazonaws.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {}, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed


CHECK [disk_availability : ec2-52-90-164-78.compute-1.amazonaws.com] *************************************************************************

CHECK [memory_availability : ec2-52-90-164-78.compute-1.amazonaws.com] ***********************************************************************

CHECK [disk_availability : ec2-52-90-73-245.compute-1.amazonaws.com] *************************************************************************

CHECK [memory_availability : ec2-52-90-73-245.compute-1.amazonaws.com] ***********************************************************************
fatal: [ec2-52-90-73-245.compute-1.amazonaws.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {}, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed

fatal: [ec2-52-90-164-78.compute-1.amazonaws.com]: FAILED! => {
    "changed": false, 
    "checks": {
        "disk_availability": {}, 
        "memory_availability": {
            "failed": true, 
            "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)"
        }
    }, 
    "failed": true, 
    "playbook_context": "upgrade"
}

MSG:

One or more checks failed


PLAY RECAP ***********************************************************************************************************************************
ec2-34-229-99-90.compute-1.amazonaws.com : ok=87   changed=9    unreachable=0    failed=1   
ec2-52-90-164-78.compute-1.amazonaws.com : ok=82   changed=10   unreachable=0    failed=1   
ec2-52-90-73-245.compute-1.amazonaws.com : ok=82   changed=10   unreachable=0    failed=1   
localhost                  : ok=11   changed=0    unreachable=0    failed=0   



Failure summary:


  1. Hosts:    ec2-34-229-99-90.compute-1.amazonaws.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)

  2. Hosts:    ec2-52-90-164-78.compute-1.amazonaws.com, ec2-52-90-73-245.compute-1.amazonaws.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)

The execution of "openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml" includes checks designed to fail early if the requirements of the playbook are not met. One or more of these checks failed. To disregard these results,explicitly disable checks by setting an Ansible variable:
   openshift_disable_check=memory_availability
Failing check names are shown in the failure details above. Some checks may be configurable by variables if your requirements are different from the defaults; consult check documentation.
Variables can be set in the inventory or passed on the command line using the -e flag to ansible-playbook.
Comment 2 Russell Teague 2017-09-28 15:12:13 EDT
Also, if this is still an issue what version of Ansible is in use?  There is a known bug with Ansible 2.4/devel that could cause this problem.
https://github.com/ansible/ansible/issues/30691
Comment 3 Anping Li 2017-10-09 10:25:34 EDT
Ressell, 

The result is same as before.  Both node and master failed for health checking. But the localhost continue until the task 'Gate on etcd backup'.  it doesn't harm to feature, so I downgrade the Severity to low.

# rpm -qa|grep ansible
openshift-ansible-docs-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
openshift-ansible-callback-plugins-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
openshift-ansible-filter-plugins-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
openshift-ansible-playbooks-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
ansible-2.3.2.0-2.el7.noarch
openshift-ansible-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
openshift-ansible-lookup-plugins-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch
openshift-ansible-roles-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch


TASK [set_fact] ****************************************************************
fatal: [openshift-217.lab.eng.nay.redhat.com]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n  pre_tasks:\n  - set_fact:\n    ^ here\n"}
fatal: [openshift-210.lab.eng.nay.redhat.com]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n  pre_tasks:\n  - set_fact:\n    ^ here\n"}
fatal: [openshift-226.lab.eng.nay.redhat.com]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n  pre_tasks:\n  - set_fact:\n    ^ here\n"}

PLAY [Validate configuration for rolling restart] ******************************

PLAY [Create temp file on localhost] *******************************************

TASK [command] *****************************************************************
ok: [localhost -> localhost]

PLAY [Check if temp file exists on any masters] ********************************

PLAY [Cleanup temp file on localhost] ******************************************

TASK [file] ********************************************************************
ok: [localhost]

PLAY [Warn if restarting the system where ansible is running] ******************

PLAY [Verify upgrade targets] **************************************************

PLAY [Verify docker upgrade targets] *******************************************

PLAY [Verify 3.7 specific upgrade checks] **************************************

PLAY [Flag pre-upgrade checks complete for hosts without errors] ***************

PLAY [Cleanup unused Docker images] ********************************************

PLAY [Pre master upgrade - Upgrade all storage] ********************************

PLAY [Set master embedded_etcd fact] *******************************************

PLAY [Backup etcd] *************************************************************

PLAY [Gate on etcd backup] *****************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [set_fact] ****************************************************************
ok: [localhost]

TASK [set_fact] ****************************************************************
ok: [localhost]

TASK [fail] ********************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com"}

PLAY RECAP *********************************************************************
localhost                  : ok=16   changed=0    unreachable=0    failed=1   
openshift-181.lab.eng.nay.redhat.com : ok=35   changed=2    unreachable=0    failed=1   
openshift-182.lab.eng.nay.redhat.com : ok=30   changed=2    unreachable=0    failed=1   
openshift-210.lab.eng.nay.redhat.com : ok=67   changed=8    unreachable=0    failed=1   
openshift-217.lab.eng.nay.redhat.com : ok=67   changed=8    unreachable=0    failed=1   
openshift-226.lab.eng.nay.redhat.com : ok=67   changed=8    unreachable=0    failed=1   



Failure summary:


  1. Hosts:    openshift-182.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)

  2. Hosts:    openshift-181.lab.eng.nay.redhat.com
     Play:     Verify Host Requirements
     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "disk_availability":
               Available disk space in "/var" (6.8 GB) is below minimum recommended (10.0 GB)
               
               check "memory_availability":
               Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)

  3. Hosts:    openshift-210.lab.eng.nay.redhat.com, openshift-217.lab.eng.nay.redhat.com, openshift-226.lab.eng.nay.redhat.com
     Play:     Set openshift_version for etcd, node, and master hosts
     Task:     set_fact
     Message:  the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'
               
               The error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may
               be elsewhere in the file depending on the exact syntax problem.
               
               The offending line appears to be:
               
                 pre_tasks:
                 - set_fact:
                   ^ here
               

  4. Hosts:    localhost
     Play:     Gate on etcd backup
     Task:     fail
     Message:  Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com

The execution of "/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml" includes checks designed to fail early if the requirements of the playbook are not met. One or more of these checks failed. To disregard these results,explicitly disable checks by setting an Ansible variable:
   openshift_disable_check=disk_availability,memory_availability
Failing check names are shown in the failure details above. Some checks may be configurable by variables if your requirements are different from the defaults; consult check documentation.
Variables can be set in the inventory or passed on the command line using the -e flag to ansible-playbook.
Comment 4 Russell Teague 2017-10-12 11:34:03 EDT
I am still unable to reproduce this failure.  Please attach complete ansible log using '-vv' output and the inventory file in use.
Comment 5 Russell Teague 2017-10-12 15:03:13 EDT
Proposed: https://github.com/openshift/openshift-ansible/pull/5741
Comment 6 Russell Teague 2017-10-16 10:09:41 EDT
Merged: https://github.com/openshift/openshift-ansible/pull/5741

Commit has been merged since openshift-ansible-3.7.0-0.150.0
Comment 8 Anping Li 2017-11-01 22:57:58 EDT
Verified and pass on openshift-ansible-3.7.0-0.189.0
Comment 11 errata-xmlrpc 2017-11-28 17:07:41 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.