Description of problem: When the disk/memory check failed, for there no fatal error on localhost, the upgrade continues until it failed on task Gate on etcd backup [1]. the output show it failed for etcd backup wrongly. It is better to abort at the play pre/gate_checks.yml. [1] task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:18 Version-Release number of the following components: openshift-ansible: master How reproducible: always Steps to Reproduce: 1. RPM install OCP v3.6 and make the disk size is less than 10G 2. upgrade to v3.7 3. check the playbook output. Actual results: * task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/pre/verify_health_checks.yml:9 CHECK [disk_availability : openshift-181.lab.eng.nay.redhat.com] CHECK [memory_availability : openshift-181.lab.eng.nay.redhat.com] fatal: [openshift-181.lab.eng.nay.redhat.com]: FAILED! => { "changed": false, "checks": { "disk_availability": { "failed": true, "msg": "Available disk space in \"/var\" (6.7 GB) is below minimum recommended (10.0 GB)" }, "memory_availability": { "failed": true, "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)" } }, "failed": true, "playbook_context": "upgrade" } MSG: One or more checks failed CHECK [disk_availability : openshift-221.lab.eng.nay.redhat.com] CHECK [memory_availability : openshift-221.lab.eng.nay.redhat.com] fatal: [openshift-221.lab.eng.nay.redhat.com]: FAILED! => { "changed": false, "checks": { "disk_availability": { "failed": true, "msg": "Available disk space in \"/var\" (6.7 GB) is below minimum recommended (10.0 GB)" }, "memory_availability": { "failed": true, "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)" } }, "failed": true, "playbook_context": "upgrade" } MSG: One or more checks failed CHECK [disk_availability : openshift-182.lab.eng.nay.redhat.com] CHECK [memory_availability : openshift-182.lab.eng.nay.redhat.com] fatal: [openshift-182.lab.eng.nay.redhat.com]: FAILED! => { "changed": false, "checks": { "disk_availability": { "failed": true, "msg": "Available disk space in \"/var\" (6.6 GB) is below minimum recommended (10.0 GB)" }, "memory_availability": { "failed": true, "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)" } }, "failed": true, "playbook_context": "upgrade" } MSG: One or more checks failed CHECK [disk_availability : openshift-217.lab.eng.nay.redhat.com] CHECK [memory_availability : openshift-217.lab.eng.nay.redhat.com] fatal: [openshift-217.lab.eng.nay.redhat.com]: FAILED! => { "changed": false, "checks": { "disk_availability": {}, "memory_availability": { "failed": true, "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)" } }, "failed": true, "playbook_context": "upgrade" } MSG: One or more checks failed CHECK [disk_availability : openshift-210.lab.eng.nay.redhat.com] CHECK [memory_availability : openshift-210.lab.eng.nay.redhat.com] fatal: [openshift-210.lab.eng.nay.redhat.com]: FAILED! => { "changed": false, "checks": { "disk_availability": {}, "memory_availability": { "failed": true, "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)" } }, "failed": true, "playbook_context": "upgrade" } MSG: One or more checks failed CHECK [disk_availability : openshift-220.lab.eng.nay.redhat.com] CHECK [memory_availability : openshift-220.lab.eng.nay.redhat.com] ok: [openshift-220.lab.eng.nay.redhat.com] => { "changed": false, "checks": { "disk_availability": { "skipped": true, "skipped_reason": "Not active for this host" }, "memory_availability": { "skipped": true, "skipped_reason": "Not active for this host" } }, "playbook_context": "upgrade" } META: ran handlers PLAY [Verify master processes] PLAY [Validate configuration for rolling restart] PLAY [Create temp file on localhost] ************************** PLAY [Check if temp file exists on any masters] PLAY [Cleanup temp file on localhost] PLAY [Warn if restarting the system where ansible is running] PLAY [Verify upgrade targets] PLAY [Verify docker upgrade targets] PLAY [Verify 3.7 specific upgrade checks] PLAY [Flag pre-upgrade checks complete for hosts without errors] PLAY [Cleanup unused Docker images] PLAY [Pre master upgrade - Upgrade all storage] PLAY [Set master embedded_etcd fact] PLAY [Backup etcd] PLAY [Gate on etcd backup] TASK [Gathering Facts] META: ran handlers TASK [set_fact] task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:18 ok: [localhost] => { "ansible_facts": { "etcd_backup_completed": [] }, "changed": false } TASK [set_fact] task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:22 ok: [localhost] => { "ansible_facts": { "etcd_backup_failed": [ "openshift-181.lab.eng.nay.redhat.com", "openshift-182.lab.eng.nay.redhat.com", "openshift-221.lab.eng.nay.redhat.com" ] }, "changed": false } TASK [fail] task path: /root/openshift-ansible/playbooks/common/openshift-cluster/upgrades/etcd/backup.yml:24 fatal: [localhost]: FAILED! => { "changed": false, "failed": true } MSG: Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com,openshift-182.lab.eng.nay.redhat.com,openshift-221.lab.eng.nay.redhat.com to retry, use: --limit @/root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.retry PLAY RECAP localhost : ok=18 changed=0 unreachable=0 failed=1 openshift-181.lab.eng.nay.redhat.com : ok=80 changed=7 unreachable=0 failed=1 openshift-182.lab.eng.nay.redhat.com : ok=77 changed=7 unreachable=0 failed=1 openshift-210.lab.eng.nay.redhat.com : ok=76 changed=7 unreachable=0 failed=1 openshift-217.lab.eng.nay.redhat.com : ok=76 changed=7 unreachable=0 failed=1 openshift-220.lab.eng.nay.redhat.com : ok=38 changed=2 unreachable=0 failed=0 openshift-221.lab.eng.nay.redhat.com : ok=77 changed=7 unreachable=0 failed=1 Failure summary: 1. Host: openshift-181.lab.eng.nay.redhat.com Play: Verify Host Requirements Task: openshift_health_check Message: One or more checks failed Details: check "disk_availability": Available disk space in "/var" (6.7 GB) is below minimum recommended (10.0 GB) check "memory_availability": Available memory (3.7 GiB) is too far below recommended value (16.0 GiB) 2. Host: openshift-221.lab.eng.nay.redhat.com Play: Verify Host Requirements Task: openshift_health_check Message: One or more checks failed Details: check "disk_availability": Available disk space in "/var" (6.7 GB) is below minimum recommended (10.0 GB) check "memory_availability": Available memory (3.7 GiB) is too far below recommended value (16.0 GiB) 3. Host: openshift-182.lab.eng.nay.redhat.com Play: Verify Host Requirements Task: openshift_health_check Message: One or more checks failed Details: check "disk_availability": Available disk space in "/var" (6.6 GB) is below minimum recommended (10.0 GB) check "memory_availability": Available memory (3.7 GiB) is too far below recommended value (16.0 GiB) 4. Host: openshift-217.lab.eng.nay.redhat.com Play: Verify Host Requirements Task: openshift_health_check Message: One or more checks failed Details: check "memory_availability": Available memory (3.7 GiB) is too far below recommended value (8.0 GiB) 5. Host: openshift-210.lab.eng.nay.redhat.com Play: Verify Host Requirements Task: openshift_health_check Message: One or more checks failed Details: check "memory_availability": Available memory (3.7 GiB) is too far below recommended value (8.0 GiB) 6. Host: localhost Play: Gate on etcd backup Task: fail Message: Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com,openshift-182.lab.eng.nay.redhat.com,openshift-221.lab.eng.nay.redhat.com The execution of "/root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml" includes checks designed to fail early if the requirements of the playbook are not met. One or more of these checks failed. To disregard these results, you may choose to disable failing checks by setting an Ansible variable: openshift_disable_check=disk_availability,memory_availability Failing check names are shown in the failure details above. Some checks may be configurable by variables if your requirements are different from the defaults; consult check documentation. Variables can be set in the inventory or passed on the command line using the -e flag to ansible-playbook. Expected results: Additional info:
Could you verify this is still an issue and provide the version of openshift-ansible? In my testing, I found the upgrade playbook exited immediately when the health checks failed. $ git describe openshift-ansible-3.7.0-0.126.0-19-ge1754cbde $ ansible-playbook -i hosts openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml ... PLAY [Verify Host Requirements] ************************************************************************************************************** TASK [Gathering Facts] *********************************************************************************************************************** ok: [ec2-52-90-73-245.compute-1.amazonaws.com] ok: [ec2-52-90-164-78.compute-1.amazonaws.com] ok: [ec2-34-229-99-90.compute-1.amazonaws.com] TASK [openshift_health_check] **************************************************************************************************************** CHECK [disk_availability : ec2-34-229-99-90.compute-1.amazonaws.com] ************************************************************************* CHECK [memory_availability : ec2-34-229-99-90.compute-1.amazonaws.com] *********************************************************************** fatal: [ec2-34-229-99-90.compute-1.amazonaws.com]: FAILED! => { "changed": false, "checks": { "disk_availability": {}, "memory_availability": { "failed": true, "msg": "Available memory (3.7 GiB) is too far below recommended value (16.0 GiB)" } }, "failed": true, "playbook_context": "upgrade" } MSG: One or more checks failed CHECK [disk_availability : ec2-52-90-164-78.compute-1.amazonaws.com] ************************************************************************* CHECK [memory_availability : ec2-52-90-164-78.compute-1.amazonaws.com] *********************************************************************** CHECK [disk_availability : ec2-52-90-73-245.compute-1.amazonaws.com] ************************************************************************* CHECK [memory_availability : ec2-52-90-73-245.compute-1.amazonaws.com] *********************************************************************** fatal: [ec2-52-90-73-245.compute-1.amazonaws.com]: FAILED! => { "changed": false, "checks": { "disk_availability": {}, "memory_availability": { "failed": true, "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)" } }, "failed": true, "playbook_context": "upgrade" } MSG: One or more checks failed fatal: [ec2-52-90-164-78.compute-1.amazonaws.com]: FAILED! => { "changed": false, "checks": { "disk_availability": {}, "memory_availability": { "failed": true, "msg": "Available memory (3.7 GiB) is too far below recommended value (8.0 GiB)" } }, "failed": true, "playbook_context": "upgrade" } MSG: One or more checks failed PLAY RECAP *********************************************************************************************************************************** ec2-34-229-99-90.compute-1.amazonaws.com : ok=87 changed=9 unreachable=0 failed=1 ec2-52-90-164-78.compute-1.amazonaws.com : ok=82 changed=10 unreachable=0 failed=1 ec2-52-90-73-245.compute-1.amazonaws.com : ok=82 changed=10 unreachable=0 failed=1 localhost : ok=11 changed=0 unreachable=0 failed=0 Failure summary: 1. Hosts: ec2-34-229-99-90.compute-1.amazonaws.com Play: Verify Host Requirements Task: openshift_health_check Message: One or more checks failed Details: check "memory_availability": Available memory (3.7 GiB) is too far below recommended value (16.0 GiB) 2. Hosts: ec2-52-90-164-78.compute-1.amazonaws.com, ec2-52-90-73-245.compute-1.amazonaws.com Play: Verify Host Requirements Task: openshift_health_check Message: One or more checks failed Details: check "memory_availability": Available memory (3.7 GiB) is too far below recommended value (8.0 GiB) The execution of "openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml" includes checks designed to fail early if the requirements of the playbook are not met. One or more of these checks failed. To disregard these results,explicitly disable checks by setting an Ansible variable: openshift_disable_check=memory_availability Failing check names are shown in the failure details above. Some checks may be configurable by variables if your requirements are different from the defaults; consult check documentation. Variables can be set in the inventory or passed on the command line using the -e flag to ansible-playbook.
Also, if this is still an issue what version of Ansible is in use? There is a known bug with Ansible 2.4/devel that could cause this problem. https://github.com/ansible/ansible/issues/30691
Ressell, The result is same as before. Both node and master failed for health checking. But the localhost continue until the task 'Gate on etcd backup'. it doesn't harm to feature, so I downgrade the Severity to low. # rpm -qa|grep ansible openshift-ansible-docs-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch openshift-ansible-callback-plugins-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch openshift-ansible-filter-plugins-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch openshift-ansible-playbooks-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch ansible-2.3.2.0-2.el7.noarch openshift-ansible-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch openshift-ansible-lookup-plugins-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch openshift-ansible-roles-3.7.0-0.144.2.git.0.da1dd6c.el7.noarch TASK [set_fact] **************************************************************** fatal: [openshift-217.lab.eng.nay.redhat.com]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n pre_tasks:\n - set_fact:\n ^ here\n"} fatal: [openshift-210.lab.eng.nay.redhat.com]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n pre_tasks:\n - set_fact:\n ^ here\n"} fatal: [openshift-226.lab.eng.nay.redhat.com]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version'\n\nThe error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n pre_tasks:\n - set_fact:\n ^ here\n"} PLAY [Validate configuration for rolling restart] ****************************** PLAY [Create temp file on localhost] ******************************************* TASK [command] ***************************************************************** ok: [localhost -> localhost] PLAY [Check if temp file exists on any masters] ******************************** PLAY [Cleanup temp file on localhost] ****************************************** TASK [file] ******************************************************************** ok: [localhost] PLAY [Warn if restarting the system where ansible is running] ****************** PLAY [Verify upgrade targets] ************************************************** PLAY [Verify docker upgrade targets] ******************************************* PLAY [Verify 3.7 specific upgrade checks] ************************************** PLAY [Flag pre-upgrade checks complete for hosts without errors] *************** PLAY [Cleanup unused Docker images] ******************************************** PLAY [Pre master upgrade - Upgrade all storage] ******************************** PLAY [Set master embedded_etcd fact] ******************************************* PLAY [Backup etcd] ************************************************************* PLAY [Gate on etcd backup] ***************************************************** TASK [Gathering Facts] ********************************************************* ok: [localhost] TASK [set_fact] **************************************************************** ok: [localhost] TASK [set_fact] **************************************************************** ok: [localhost] TASK [fail] ******************************************************************** fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com"} PLAY RECAP ********************************************************************* localhost : ok=16 changed=0 unreachable=0 failed=1 openshift-181.lab.eng.nay.redhat.com : ok=35 changed=2 unreachable=0 failed=1 openshift-182.lab.eng.nay.redhat.com : ok=30 changed=2 unreachable=0 failed=1 openshift-210.lab.eng.nay.redhat.com : ok=67 changed=8 unreachable=0 failed=1 openshift-217.lab.eng.nay.redhat.com : ok=67 changed=8 unreachable=0 failed=1 openshift-226.lab.eng.nay.redhat.com : ok=67 changed=8 unreachable=0 failed=1 Failure summary: 1. Hosts: openshift-182.lab.eng.nay.redhat.com Play: Verify Host Requirements Task: openshift_health_check Message: One or more checks failed Details: check "memory_availability": Available memory (3.7 GiB) is too far below recommended value (8.0 GiB) 2. Hosts: openshift-181.lab.eng.nay.redhat.com Play: Verify Host Requirements Task: openshift_health_check Message: One or more checks failed Details: check "disk_availability": Available disk space in "/var" (6.8 GB) is below minimum recommended (10.0 GB) check "memory_availability": Available memory (3.7 GiB) is too far below recommended value (16.0 GiB) 3. Hosts: openshift-210.lab.eng.nay.redhat.com, openshift-217.lab.eng.nay.redhat.com, openshift-226.lab.eng.nay.redhat.com Play: Set openshift_version for etcd, node, and master hosts Task: set_fact Message: the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: {{ hostvars[groups.oo_first_master.0].openshift_version }}: 'dict object' has no attribute 'openshift_version' The error appears to have been in '/usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/initialize_openshift_version.yml': line 27, column 5, but may be elsewhere in the file depending on the exact syntax problem. The offending line appears to be: pre_tasks: - set_fact: ^ here 4. Hosts: localhost Play: Gate on etcd backup Task: fail Message: Upgrade cannot continue. The following hosts did not complete etcd backup: openshift-181.lab.eng.nay.redhat.com The execution of "/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade.yml" includes checks designed to fail early if the requirements of the playbook are not met. One or more of these checks failed. To disregard these results,explicitly disable checks by setting an Ansible variable: openshift_disable_check=disk_availability,memory_availability Failing check names are shown in the failure details above. Some checks may be configurable by variables if your requirements are different from the defaults; consult check documentation. Variables can be set in the inventory or passed on the command line using the -e flag to ansible-playbook.
I am still unable to reproduce this failure. Please attach complete ansible log using '-vv' output and the inventory file in use.
Proposed: https://github.com/openshift/openshift-ansible/pull/5741
Merged: https://github.com/openshift/openshift-ansible/pull/5741 Commit has been merged since openshift-ansible-3.7.0-0.150.0
Verified and pass on openshift-ansible-3.7.0-0.189.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188