Bug 1377619
| Summary: | [3.3] ansible playbooks continue running after get error | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Wenkai Shi <weshi> |
| Component: | Installer | Assignee: | Tim Bielawa <tbielawa> |
| Status: | CLOSED ERRATA | QA Contact: | Wenkai Shi <weshi> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.3.0 | CC: | abutcher, aos-bugs, jokerman, mmccomas, tbielawa |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, an OpenShift Ansible failure on one node would cause installation to fail entirely when verifying node registration with the master. OpenShift Ansible will now continue running on nodes which have not failed and will only ensure that passing hosts have been registered.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-01-31 21:10:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I was able to reproduce this error in a 1 master 1 node setup. I renamed redhat.repo to redhat.repo.orig after the playbooks started running. > Play 22/29 (Configure nodes) > ...................................................................... > fatal: [n01.example.com]: FAILED! => {"changed": false, "failed": true, > "msg": "No package matching 'atomic-openshift-3.3.0.35' found available, > installed or updated", "rc": 126, > "results": ["No package matching 'atomic-openshift-3.3.0.35' found available, installed or updated"]} > ..................................................................................................... Which was then followed by > Play 23/29 (Additional node config) That play finished and then we get to the "Set node schedulability" play. After a few tasks we see the "Wait for Node Registration" task run and retry over and over. This should be failing much earlier. I'll see what I can do to make that happen. I'm honestly confused as to why it **isn't** causing the install to fail already. Intuitively you'd think it would be. Our task that it failed on doesn't even have "ignore_errors: yes" set on it. So why does the playbook continue running? I can see after the schedulability task fails that no more tasks run on my example **node**. However, tasks still run on the master at that point. Per the documentation [1] this is normal behavior, "Generally playbooks will stop executing any more steps on a host that has a failure" note that it says "on a host", not "on all hosts". However, we're in luck. Ansible does provide us with some nobs we can frob to tune the error handling behavior. For example there are the "max_fail_percentage" [2] and "any_errors_fatal" [3] options. I think the "max fail %" option isn't quite what we're looking for. But the "any errors fatal" option may be more relevant. I'm going to try playing around with that and see how the behavior changes. [1] http://docs.ansible.com/ansible/playbooks_error_handling.html#ignoring-failed-commands [2] http://docs.ansible.com/ansible/playbooks_delegation.html#maximum-failure-percentage [3] http://docs.ansible.com/ansible/playbooks_error_handling.html#aborting-the-play After adding the "any_errors_fatal" param to the node configuration plays the installation aborts immediately. > Play 22/29 (Configure nodes) > ...................................................................... > fatal: [n01.example.com]: FAILED! => {"changed": false, "failed": true, "msg": > "No package matching 'atomic-openshift-3.3.0.35' found available > , installed or updated", "rc": 126, "results": ["No package matching > 'atomic-openshift-3.3.0.35' found available, installed or updated"]} > > NO MORE HOSTS LEFT > ************************************************************* > > localhost : ok=14 changed=9 unreachable=0 failed=0 > m01.example.com : ok=343 changed=28 unreachable=0 failed=0 > n01.example.com : ok=78 changed=3 unreachable=0 failed=1 Proposed fix: https://github.com/openshift/openshift-ansible/pull/2827 Code has been merged. Verified with version openshift-ansible-3.4.43-1.git.0.a9dbe87.el7.noarch.
Prepare two host, remove one host's yum repo file to make sure installation will be failed. Installation stop when got error.
[root@ansible ~]# ansible-playbook -i hosts -v /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml
...
TASK [openshift_facts : Ensure yum-utils is installed] *************************
ok: [node.example.com] => {
"changed": false,
"rc": 0,
"results": [
"yum-utils-1.1.31-40.el7.noarch providing yum-utils is already installed"
]
}
fatal: [master.example.com]: FAILED! => {
"changed": false,
"failed": true,
"rc": 126,
"results": [
"No package matching 'yum-utils' found available, installed or updated"
]
}
MSG:
No package matching 'yum-utils' found available, installed or updated
...
PLAY RECAP *********************************************************************
localhost : ok=8 changed=0 unreachable=0 failed=0
node.example.com : ok=10 changed=0 unreachable=0 failed=1
master.example.com : ok=7 changed=0 unreachable=0 failed=1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0224 |
Description of problem: ansible playbook continue running after get error. Version-Release number of selected component (if applicable): openshift-ansible-3.3.22 How reproducible: 100% Steps to Reproduce: 1.Deploy Env with two hosts(master and node) 2.Make sure the one node error during deploying For example:remove the yum repo in the node(After we begin to running ansible playbooks) 3.Ansible will get error in node at the Install package step. And master success. TASK [openshift_common : Install the base package for versioning] ************** Tuesday 20 September 2016 07:41:25 +0000 (0:00:06.755) 0:21:09.607 ***** ok: [openshift-156.lab.sjc.redhat.com] => {"changed": false, "msg": "", "rc": 0, "results": ["atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64 providing atomic-openshift-3.3.0.31 is already installed"]} fatal: [openshift-167.lab.sjc.redhat.com]: FAILED! => {"changed": true, "failed": true, "msg": "Error: Package: atomic-openshift-clients-3.3.0.31-1.git.0.aede597.el7.x86_64 (aos)\n Requires: git\n", "rc": 1, "results": ["Loaded plugins: product-id, search-disabled-repos, subscription-manager\nThis system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.\nResolving Dependencies\n--> Running transaction check\n---> Package atomic-openshift.x86_64 0:3.3.0.31-1.git.0.aede597.el7 will be installed\n--> Processing Dependency: atomic-openshift-clients = 3.3.0.31-1.git.0.aede597.el7 for package: atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64\n--> Running transaction check\n---> Package atomic-openshift-clients.x86_64 0:3.3.0.31-1.git.0.aede597.el7 will be installed\n--> Processing Dependency: git for package: atomic-openshift-clients-3.3.0.31-1.git.0.aede597.el7.x86_64\n--> Finished Dependency Resolution\n You could try using --skip-broken to work around the problem\n You could try running: rpm -Va --nofiles --nodigest\n"]} 4.Playbook will go on running, no more node playbooks show on. 5.The ansbile will be fail in the end, because master can not find node. TASK [openshift_manage_node : Wait for Node Registration] ********************** Tuesday 20 September 2016 07:45:22 +0000 (0:00:03.248) 0:25:06.807 ***** ok: [openshift-156.lab.sjc.redhat.com] => (item=openshift-156.lab.sjc.redhat.com) => {"changed": false, "cmd": ["oc", "get", "node", "openshift-156.lab.sjc.redhat.com", "--config=/tmp/openshift-ansible-ottsjx/admin.kubeconfig", "-n", "default"], "delta": "0:00:00.197822", "end": "2016-09-20 03:45:26.313879", "item": "openshift-156.lab.sjc.redhat.com", "rc": 0, "start": "2016-09-20 03:45:26.116057", "stderr": "", "stdout": "NAME STATUS AGE\nopenshift-156.lab.sjc.redhat.com Ready 34s", "stdout_lines": ["NAME STATUS AGE", "openshift-156.lab.sjc.redhat.com Ready 34s"], "warnings": []} FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (50 retries left). ... FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (1 retries left). failed: [openshift-156.lab.sjc.redhat.com] (item=openshift-167.lab.sjc.redhat.com) => {"changed": false, "cmd": ["oc", "get", "node", "openshift-167.lab.sjc.redhat.com", "--config=/tmp/openshift-ansible-ottsjx/admin.kubeconfig", "-n", "default"], "delta": "0:00:00.176960", "end": "2016-09-20 03:52:20.163324", "failed": true, "item": "openshift-167.lab.sjc.redhat.com", "rc": 1, "start": "2016-09-20 03:52:19.986364", "stderr": "Error from server: nodes \"openshift-167.lab.sjc.redhat.com\" not found", "stdout": "", "stdout_lines": [], "warnings": []} Actual results: After the error step, deploy will be continue but fail in the end. Expected results: Installation should stop when got error.