Bug 1340524

Summary: [RFE] Add sanity checks before starting overcloud deployment
Product: Red Hat OpenStack Reporter: Raoul Scarazzini <rscarazz>
Component: openstack-tripleo-validationsAssignee: Florian Fuchs <flfuchs>
Status: CLOSED WONTFIX QA Contact: nlevinki <nlevinki>
Severity: low Docs Contact: RHOS Documentation Team <rhos-docs>
Priority: high    
Version: 9.0 (Mitaka)CC: achernet, beth.white, dbecker, dmsimard, jcoufal, jjoyce, jpichon, jrist, jschluet, mburns, mcornea, morazi, rhel-osp-director-maint, rscarazz, sclewis, slinaber, tvignaud
Target Milestone: Upstream M2Keywords: FutureFeature, Triaged
Target Release: 15.0 (Stein)   
Hardware: x86_64   
OS: Linux   
Whiteboard: NeedsAllocation
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 13:39:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1442136    

Description Raoul Scarazzini 2016-05-27 16:59:16 UTC
Description of problem:

Sometimes you loose a lot of time (up to 240 minutes) while waiting for a failed overcloud deployment to complete, due to timeouts.
The most of the time those timeouts are provoked by wrong configurations of the environment or connectivity problems that can be avoided by doing some sanity checks before the overcloud is deployed.

For example, today I faced this:

Stdout: u'ERROR    : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host already uses address 172.18.0.15.\n'

And I got into this message while looking inside the os-collect-config logs of one of the controller, after seeing that the deployment was stuck, so following the steps on the undercloud.
Other times it happens that if for some reason you cannot reach the default gateway from the overcloud controllers/computes the deploy reaches the timeout (after 240 minutes).

Having some sanity checks like these:

- Are the network connections working?
- Are the declared vlan accessible from the undercloud and overcloud nodes?
- Are the IP that dhcp is going to assign free?

before starting the deployment would save users a lot of time.

Comment 4 Julie Pichon 2017-04-20 14:31:03 UTC
This may or may not be relevant, but for additional information: there are "pre-deployment checks" that are run before starting a deployment on the CLI at the moment. These checks were recently (Pike) migrated to workflows/actions:

https://github.com/openstack/tripleo-common/blob/9ae174/workbooks/validations.yaml#L642

There doesn't seem to be overlap with the sanity checks requested in the description, however it may be worthwhile figuring out if the new checks fit better as validations or with these in tripleo-common (or if perhaps the tripleo-common checks should become proper validations in the future?).

Comment 5 Florian Fuchs 2017-05-31 11:20:18 UTC
In addition to the changes mentioned in comment #4 there is also some work in progress for a network environment validation, including checking for duplicate IPs, network overlaps and more: 

https://review.openstack.org/#/c/341586/

Comment 7 David Moreau Simard 2017-07-24 14:20:10 UTC
Just sharing something I thought was nice UX when I was deploying OpenShift with OpenShift-Ansible, maybe we could draw some ideas from there.

I was running our integration jobs and stumbled on the following error:
===
14:28:30 PLAY RECAP *********************************************************************
14:28:30 localhost                  : ok=8    changed=0    unreachable=0    failed=0   
14:28:30 registry.rdoproject.org    : ok=91   changed=7    unreachable=0    failed=1   
14:28:30 
14:28:30 
14:28:30 Failure summary:
14:28:30 
14:28:30   1. Host:     registry.rdoproject.org
14:28:30      Play:     Verify Requirements
14:28:30      Task:     openshift_health_check
14:28:30      Message:  One or more checks failed
14:28:30      Details:  check "memory_availability":
14:28:30                Available memory (7.8 GiB) is too far below recommended value (16.0 GiB)
14:28:30 
14:28:30 The execution of "openshift-ansible/playbooks/byo/config.yml"
14:28:30 includes checks designed to fail early if the requirements
14:28:30 of the playbook are not met. One or more of these checks
14:28:30 failed. To disregard these results, you may choose to
14:28:30 disable failing checks by setting an Ansible variable:
14:28:30 
14:28:30    openshift_disable_check=memory_availability
14:28:30 
14:28:30 Failing check names are shown in the failure details above.
14:28:30 Some checks may be configurable by variables if your requirements
14:28:30 are different from the defaults; consult check documentation.
14:28:30 Variables can be set in the inventory or passed on the
14:28:30 command line using the -e flag to ansible-playbook.
14:28:31 ERROR: InvocationError: '/home/jenkins/workspace/rdo-registry-integration/rdo-infra/rdo-container-registry/.tox/ansible-playbook/bin/ansible-playbook -b -i hosts openshift-ansible/playbooks/byo/config.yml -e ansible_ssh_user=jenkins'
===

It turns out they have a custom Ansible action module [1] which is able to do a variety of checks at any time during their roles or playbooks [2] like, in this case, checking for memory availability [3].

[1]: https://github.com/openshift/openshift-ansible/blob/1da90af63656f127b21720248b5c8c25ebc728ed/roles/openshift_health_checker/action_plugins/openshift_health_check.py
[2]: https://github.com/openshift/openshift-ansible/blob/7b0acaff56fe08e4d302a86ed4b00db6739f16a9/playbooks/common/openshift-cluster/config.yml#L11
[3]: https://github.com/openshift/openshift-ansible/blob/1da90af63656f127b21720248b5c8c25ebc728ed/roles/openshift_health_checker/openshift_checks/memory_availability.py