Bug 1534538 - Metrics Apply continues when it should fail
Summary: Metrics Apply continues when it should fail
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.9.0
Assignee: Vadim Rutkovsky
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-15 13:08 UTC by Sebastian Mossberg
Modified: 2018-12-13 19:26 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Playbook for openshift-metrics component was not verifying the return code of some commands Consequence: Some incorrect parameters could be used to setup metrics. The playbook succedeed but metrics was incorrectly setup Fix: The playbook verifies exit code for `oc apply` commands Result: Installation fails when incorrect commands are used
Clone Of:
Environment:
Last Closed: 2018-12-13 19:26:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Full ansible log including the additional debug message (318.43 KB, text/plain)
2018-01-15 13:08 UTC, Sebastian Mossberg
no flags Details
ansible log (2.33 MB, text/plain)
2018-01-22 14:02 UTC, Junqi Zhao
no flags Details
ansible running log, it throws out the failed error (2.30 MB, text/plain)
2018-01-26 01:13 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:3748 0 None None None 2018-12-13 19:26:58 UTC

Description Sebastian Mossberg 2018-01-15 13:08:19 UTC
Created attachment 1381434 [details]
Full ansible log including the additional debug message

Description of problem:

The openshift-metrics role does not detect error from the "oc create " when applying templates for ReplicationControllers. With an incorrect inventory, this results in a completed install, showing no errors, but the RC is not existing afterwards.  

This causes great confusion when deploying metrics through the installer. 

Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.7.14-1.git.0.4b35b2d.el7.noarch

rpm -q ansible
# rpm -q ansible
ansible-2.4.1.0-1.el7.noarch

ansible --version
[root@bastion ~]# ansible --version
ansible 2.4.1.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /bin/ansible
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

How reproducible:

Steps to Reproduce:
1. Set wrong settings in the inventory file.
### Metrics ###
openshift_metrics_install_metrics=true

---
#- Heapster
openshift_metrics_heapster_nodeselector={'region': 'infra'}

openshift_metrics_heapster_requests_memory=4Gi  <-- request is larger than limit
openshift_metrics_heapster_limits_memory=1Gi
openshift_metrics_heapster_requests_cpu=1000m
openshift_metrics_heapster_limits_cpu=1000m

2. Run the openshift-metrics playbook
# ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/openshift-metrics.yml


Actual results:
The installer finishes without indicating any errors, although the command in Apply task in roles/openshift_metrics/tasks/oc_apply.yaml fails.

Expected results:
The installer should fail or warn that applying the tempalte failed and display the errors message from the "oc apply". 



Additional info:
I've added a debug task that displays the output from the apply command in roles/openshift_metrics/tasks/oc_apply.yaml.

TASK [openshift_metrics : Applying /tmp/openshift-metrics-ansible-8ypfqX/templates/metrics-heapster-rc.yaml] **************************************************************************************************
ok: [master1.openshift.mitzicom.int.m0sslab.org]

TASK [openshift_metrics : Prinout apply] **********************************************************************************************************************************************************************
ok: [master1.openshift.mitzicom.int.m0sslab.org] => {
"msg": {
"changed": false,
"cmd": [
"oc",
"--config=/tmp/openshift-metrics-ansible-8ypfqX/admin.kubeconfig",
"apply",
"-f",
"/tmp/openshift-metrics-ansible-8ypfqX/templates/metrics-heapster-rc.yaml",
"-n",
"openshift-infra"
],
"delta": "0:00:02.524603",
"end": "2018-01-12 17:36:20.996628",
"failed": false,
"failed_when_result": false,
"msg": "non-zero return code",
"rc": 1,
"start": "2018-01-12 17:36:18.472025",
"stderr": "The ReplicationController \"heapster\" is invalid: spec.template.spec.containers[0].resources.limits: Invalid value: \"1Gi\": must be greater than or equal to memory request", 
"stderr_lines": [
"The ReplicationController \"heapster\" is invalid: spec.template.spec.containers[0].resources.limits: Invalid value: \"1Gi\": must be greater than or equal to memory request" 
],
"stdout": "",
"stdout_lines": []
}
}

TASK [openshift_metrics : Determine change status of ReplicationController heapster] **************************************************************************************************************************
ok: [master1.openshift.mitzicom.int.m0sslab.org]


Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Sebastian Mossberg 2018-01-15 13:11:33 UTC
A suggestion would be to expand the failed when: to include "Invalid" as well, which would catch this specific error.

 https://github.com/openshift/openshift-ansible/blob/8de2b99d915fadc6579320b7b74354ab33213067/roles/openshift_metrics/tasks/oc_apply.yaml#L19

Comment 2 Vadim Rutkovsky 2018-01-17 11:18:44 UTC
Created a PR for master: https://github.com/openshift/openshift-ansible/pull/6751

It seems the safest way to check rc code

Comment 3 Vadim Rutkovsky 2018-01-18 10:08:13 UTC
Created PR for 3.7: https://github.com/openshift/openshift-ansible/pull/6769

Comment 4 Scott Dodson 2018-01-19 20:51:44 UTC
Since there are two pull requests i've marked this BZ as 3.9.0 target release and moved it to modified ON_QA so they can begin testing it in 3.9. Once verified we can backport the fix and clone the bug.

Comment 5 Junqi Zhao 2018-01-22 14:02:11 UTC
Set the paramters in inventory
openshift_metrics_heapster_nodeselector={'region': 'infra'}
openshift_metrics_heapster_requests_memory=4Gi  <-- request is larger than limit
openshift_metrics_heapster_limits_memory=1Gi
openshift_metrics_heapster_requests_cpu=1000m
openshift_metrics_heapster_limits_cpu=1000m

although there is warn info in log, the installation process is passed, there is not fail or warn that applying the tempalte failed and display the errors message from the "oc apply". 
***********************************************************
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2018-01-22 08:47:32.080078", 
    "stderr": "The ReplicationController \"heapster\" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: \"4Gi\": must be less than or equal to memory limit", 
    "stderr_lines": [
        "The ReplicationController \"heapster\" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: \"4Gi\": must be less than or equal to memory limit"
    ], 
    "stdout": "", 
    "stdout_lines": []
}
**********************************************************
PLAY RECAP *******************************************************************************
MASTER : ok=22   changed=0    unreachable=0    failed=0   
NODE : ok=198  changed=44   unreachable=0    failed=0   
localhost                  : ok=12   changed=0    unreachable=0    failed=0   


INSTALLER STATUS ********************************************************************************
Initialization             : Complete (0:00:21)
Metrics Install            : Complete (0:01:20)

more info please the ansible log
# rpm -qa | grep openshift-ansible
openshift-ansible-docs-3.9.0-0.22.0.git.0.0e9d896.el7.noarch
openshift-ansible-playbooks-3.9.0-0.22.0.git.0.0e9d896.el7.noarch
openshift-ansible-3.9.0-0.22.0.git.0.0e9d896.el7.noarch
openshift-ansible-roles-3.9.0-0.22.0.git.0.0e9d896.el7.noarch

Comment 6 Junqi Zhao 2018-01-22 14:02:46 UTC
Created attachment 1384448 [details]
ansible log

Comment 7 Vadim Rutkovsky 2018-01-22 17:19:29 UTC
(In reply to Junqi Zhao from comment #5)
> Set the paramters in inventory
> openshift_metrics_heapster_nodeselector={'region': 'infra'}
> openshift_metrics_heapster_requests_memory=4Gi  <-- request is larger than
> limit
> openshift_metrics_heapster_limits_memory=1Gi
> openshift_metrics_heapster_requests_cpu=1000m
> openshift_metrics_heapster_limits_cpu=1000m
> 
> although there is warn info in log, the installation process is passed,
> there is not fail or warn that applying the tempalte failed and display the
> errors message from the "oc apply". 


Good catch, Ansible considered these clauses as 'ANDs' while I expected those to be ORs.

Create PR https://github.com/openshift/openshift-ansible/pull/6815 to fix this

Comment 8 openshift-github-bot 2018-01-24 15:50:50 UTC
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/0344a8f71af49cdf3827ccf6be339097369c12f7
Merge pull request #6815 from vrutkovs/failed_when-rc-to-int

Automatic merge from submit-queue.

failed_when lists are implicitely ANDs, not ORs

For some reason I believed `failed_when` lists are considered to be ORs, but it turns out these are ANDs.

Fixes bug 1534538
Relates to #6751

Comment 9 Vadim Rutkovsky 2018-01-25 07:35:08 UTC
The fix is available in openshift-ansible-3.9.0-0.24.0.git.0.735690f.el7

Comment 10 Junqi Zhao 2018-01-26 01:12:47 UTC
Tested again with openshift-ansible-3.9.0-0.24.0.git.0.735690f.el7.noarch, used the same settings in Comment 7, and tested other similar scenarios, such as openshift_metrics_hawkular_requests_memory > openshift_metrics_hawkular_limits_memory.

The installer fail and warn that applying the tempalte failed and display the errors message from the "oc apply". 

    "stderr": "The ReplicationController \"heapster\" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: \"4Gi\": must be less than or equal to memory limit", 
    "stderr_lines": [
        "The ReplicationController \"heapster\" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: \"4Gi\": must be less than or equal to memory limit"
    ], 
    "stdout": "", 
    "stdout_lines": []
}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/openshift-metrics/config.retry

PLAY RECAP *******************************************************************************
MASTER : ok=148  changed=26   unreachable=0    failed=1   
NODE : ok=24   changed=0    unreachable=0    failed=0   


Attached the ansible logs

Comment 11 Junqi Zhao 2018-01-26 01:13:44 UTC
Created attachment 1386402 [details]
ansible running log, it throws out the failed error

Comment 14 errata-xmlrpc 2018-12-13 19:26:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748


Note You need to log in before you can comment on or make changes to this bug.