Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1377619

Summary: [3.3] ansible playbooks continue running after get error
Product: OpenShift Container Platform Reporter: Wenkai Shi <weshi>
Component: InstallerAssignee: Tim Bielawa <tbielawa>
Status: CLOSED ERRATA QA Contact: Wenkai Shi <weshi>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3.0CC: abutcher, aos-bugs, jokerman, mmccomas, tbielawa
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, an OpenShift Ansible failure on one node would cause installation to fail entirely when verifying node registration with the master. OpenShift Ansible will now continue running on nodes which have not failed and will only ensure that passing hosts have been registered.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-31 21:10:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Wenkai Shi 2016-09-20 08:38:42 UTC
Description of problem:
ansible playbook continue running after get error.

Version-Release number of selected component (if applicable):
openshift-ansible-3.3.22

How reproducible:
100%

Steps to Reproduce:
1.Deploy Env with two hosts(master and node)

2.Make sure the one node error during deploying
For example:remove the yum repo in the node(After we begin to running ansible playbooks)

3.Ansible will get error in node at the Install package step. And master success.
TASK [openshift_common : Install the base package for versioning] **************
Tuesday 20 September 2016  07:41:25 +0000 (0:00:06.755)       0:21:09.607 ***** ok: [openshift-156.lab.sjc.redhat.com] => {"changed": false, "msg": "", "rc": 0, "results": ["atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64 providing atomic-openshift-3.3.0.31 is already installed"]}
fatal: [openshift-167.lab.sjc.redhat.com]: FAILED! => {"changed": true, "failed": true, "msg": "Error: Package: atomic-openshift-clients-3.3.0.31-1.git.0.aede597.el7.x86_64 (aos)\n           Requires: git\n", "rc": 1, "results": ["Loaded plugins: product-id, search-disabled-repos, subscription-manager\nThis system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.\nResolving Dependencies\n--> Running transaction check\n---> Package atomic-openshift.x86_64 0:3.3.0.31-1.git.0.aede597.el7 will be installed\n--> Processing Dependency: atomic-openshift-clients = 3.3.0.31-1.git.0.aede597.el7 for package: atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64\n--> Running transaction check\n---> Package atomic-openshift-clients.x86_64 0:3.3.0.31-1.git.0.aede597.el7 will be installed\n--> Processing Dependency: git for package: atomic-openshift-clients-3.3.0.31-1.git.0.aede597.el7.x86_64\n--> Finished Dependency Resolution\n You could try using --skip-broken to work around the problem\n You could try running: rpm -Va --nofiles --nodigest\n"]}


4.Playbook will go on running, no more node playbooks show on.

5.The ansbile will be fail in the end, because master can not find node.
TASK [openshift_manage_node : Wait for Node Registration] **********************
Tuesday 20 September 2016  07:45:22 +0000 (0:00:03.248)       0:25:06.807 ***** ok: [openshift-156.lab.sjc.redhat.com] => (item=openshift-156.lab.sjc.redhat.com) => {"changed": false, "cmd": ["oc", "get", "node", "openshift-156.lab.sjc.redhat.com", "--config=/tmp/openshift-ansible-ottsjx/admin.kubeconfig", "-n", "default"], "delta": "0:00:00.197822", "end": "2016-09-20 03:45:26.313879", "item": "openshift-156.lab.sjc.redhat.com", "rc": 0, "start": "2016-09-20 03:45:26.116057", "stderr": "", "stdout": "NAME                               STATUS    AGE\nopenshift-156.lab.sjc.redhat.com   Ready     34s", "stdout_lines": ["NAME                               STATUS    AGE", "openshift-156.lab.sjc.redhat.com   Ready     34s"], "warnings": []}
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (50 retries left).
...
FAILED - RETRYING: TASK: openshift_manage_node : Wait for Node Registration (1 retries left).
failed: [openshift-156.lab.sjc.redhat.com] (item=openshift-167.lab.sjc.redhat.com) => {"changed": false, "cmd": ["oc", "get", "node", "openshift-167.lab.sjc.redhat.com", "--config=/tmp/openshift-ansible-ottsjx/admin.kubeconfig", "-n", "default"], "delta": "0:00:00.176960", "end": "2016-09-20 03:52:20.163324", "failed": true, "item": "openshift-167.lab.sjc.redhat.com", "rc": 1, "start": "2016-09-20 03:52:19.986364", "stderr": "Error from server: nodes \"openshift-167.lab.sjc.redhat.com\" not found", "stdout": "", "stdout_lines": [], "warnings": []}



Actual results:
After the error step, deploy will be continue but fail in the end.

Expected results:
Installation should stop when got error.

Comment 1 Tim Bielawa 2016-10-25 16:58:08 UTC
I was able to reproduce this error in a 1 master 1 node setup. I renamed redhat.repo to redhat.repo.orig after the playbooks started running.

> Play 22/29 (Configure nodes)
> ......................................................................
> fatal: [n01.example.com]: FAILED! => {"changed": false, "failed": true,
> "msg": "No package matching 'atomic-openshift-3.3.0.35' found available,
> installed or updated", "rc": 126,
> "results": ["No package matching 'atomic-openshift-3.3.0.35' found available, installed or updated"]}
> .....................................................................................................

Which was then followed by 

> Play 23/29 (Additional node config)

That play finished and then we get to the "Set node schedulability" play. After a few tasks we see the "Wait for Node Registration" task run and retry over and over.

This should be failing much earlier. I'll see what I can do to make that happen. I'm honestly confused as to why it **isn't** causing the install to fail already. Intuitively you'd think it would be. Our task that it failed on doesn't even have "ignore_errors: yes" set on it. So why does the playbook continue running?

I can see after the schedulability task fails that no more tasks run on my example **node**. However, tasks still run on the master at that point. Per the documentation [1] this is normal behavior, "Generally playbooks will stop executing any more steps on a host that has a failure" note that it says "on a host", not "on all hosts".

However, we're in luck. Ansible does provide us with some nobs we can frob to tune the error handling behavior. For example there are the "max_fail_percentage" [2] and "any_errors_fatal" [3] options.

I think the "max fail %" option isn't quite what we're looking for. But the "any errors fatal" option may be more relevant. I'm going to try playing around with that and see how the behavior changes.


[1] http://docs.ansible.com/ansible/playbooks_error_handling.html#ignoring-failed-commands 
[2] http://docs.ansible.com/ansible/playbooks_delegation.html#maximum-failure-percentage
[3] http://docs.ansible.com/ansible/playbooks_error_handling.html#aborting-the-play

Comment 3 Tim Bielawa 2016-10-25 17:26:04 UTC
After adding the "any_errors_fatal" param to the node configuration plays the installation aborts immediately.

> Play 22/29 (Configure nodes)
> ......................................................................
> fatal: [n01.example.com]: FAILED! => {"changed": false, "failed": true, "msg":
> "No package matching 'atomic-openshift-3.3.0.35' found available
> , installed or updated", "rc": 126, "results": ["No package matching 
> 'atomic-openshift-3.3.0.35' found available, installed or updated"]}
>
> NO MORE HOSTS LEFT
> *************************************************************
>
> localhost                  : ok=14   changed=9    unreachable=0    failed=0
> m01.example.com            : ok=343  changed=28   unreachable=0    failed=0
> n01.example.com            : ok=78   changed=3    unreachable=0    failed=1

Comment 4 Andrew Butcher 2016-11-21 19:34:47 UTC
Proposed fix: https://github.com/openshift/openshift-ansible/pull/2827

Comment 5 Tim Bielawa 2016-12-15 17:58:20 UTC
Code has been merged.

Comment 7 Wenkai Shi 2017-01-09 08:27:11 UTC
Verified with version openshift-ansible-3.4.43-1.git.0.a9dbe87.el7.noarch.

Prepare two host, remove one host's yum repo file to make sure installation will be failed. Installation stop when got error.

[root@ansible ~]# ansible-playbook -i hosts -v /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml
...
TASK [openshift_facts : Ensure yum-utils is installed] *************************
ok: [node.example.com] => {
    "changed": false, 
    "rc": 0, 
    "results": [
        "yum-utils-1.1.31-40.el7.noarch providing yum-utils is already installed"
    ]
}
fatal: [master.example.com]: FAILED! => {
    "changed": false, 
    "failed": true, 
    "rc": 126, 
    "results": [
        "No package matching 'yum-utils' found available, installed or updated"
    ]
}

MSG:

No package matching 'yum-utils' found available, installed or updated
...
PLAY RECAP *********************************************************************
localhost                  : ok=8    changed=0    unreachable=0    failed=0   
node.example.com : ok=10   changed=0    unreachable=0    failed=1   
master.example.com : ok=7    changed=0    unreachable=0    failed=1

Comment 9 errata-xmlrpc 2017-01-31 21:10:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0224