Bug 1538445

Summary:	Need to do some pre-check to make sure there are available nodes to deploy TSB and web console pods
Product:	OpenShift Container Platform	Reporter:	Weihua Meng <wmeng>
Component:	Installer	Assignee:	Vadim Rutkovsky <vrutkovs>
Status:	CLOSED ERRATA	QA Contact:	Weihua Meng <wmeng>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.9.0	CC:	aos-bugs, dmoessne, jialiu, jokerman, mmccomas, wmeng
Target Milestone:	---
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-28 14:22:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Weihua Meng 2018-01-25 06:09:45 UTC

Description of problem:
Need to check node labels and node selector, or installation may fail after running for dozens of minutes.
For some components such as template service broker, using default node selector "region=infra", if there is no nodes with such lable, installation will fail.
Running installation for dozens of minutes to get a failure is quite upset, especially for customers.
  
Version-Release number of the following components:


How reproducible:

Steps to Reproduce:
1.


Actual results:


Expected results:
Installation fails at early stage, and show the problems.

Comment 1 Johnny Liu 2018-01-26 01:26:43 UTC

When customer do not set template_service_broker_selector and openshift_web_console_nodeselector in inventory host file, also do not define "region=infra" node in the cluster, then installation would exit as failure at "Verify that TSB is running" or "Verify that the web console is running" task. That is because installer is trying to deploy TSB and web console pod onto "region=infra" node by default, but there is no such nodes in the cluster.

Once installer exit, there is no any prompt to tell user what is wrong there, and waste a lot of time to run this installation.

As user experience enhancement, openshift-ansible should do some pre-check before running installation to make sure there are available nodes to deploy TSB and web console at early phase.

Comment 2 Vadim Rutkovsky 2018-02-06 13:38:44 UTC

Created https://github.com/openshift/openshift-ansible/pull/7022 to fix this.

The PR won't check web console nodeselector, as by default its set to k8s master and it might not be specified in ansible inventory.

Comment 3 Vadim Rutkovsky 2018-02-16 09:53:43 UTC

Partial fix for this is available in openshift-ansible-3.9.0-0.45.0.git.0.05f6826.el7 - it would only check the labels set in the inventory, so web console check is not enabled yet.

Comment 5 Weihua Meng 2018-02-22 03:27:39 UTC

Not fixed.
openshift-ansible-3.9.0-0.45.0.git.0.05f6826.el7.noarch.rpm

Installation failed and exited after running 21 mins

steps
set non-exist node label for template service broker in inventory file
template_service_broker_selector={"label123": "cannotfind"}


TASK [template_service_broker : Ensure that Template Service Broker has nodes to run on] ***
Thursday 22 February 2018  03:09:20 +0000 (0:00:00.066)       0:09:25.007 ***** 
skipping: [host-xxxx.redhat.com] => {"changed": false, "skip_reason": "Conditional result was False"}

TASK [template_service_broker : Apply template file] ***************************
Thursday 22 February 2018  03:09:24 +0000 (0:00:00.186)       0:09:28.636 ***** 

changed: [host-xxxx.redhat.com] => {"changed": true, "cmd": "oc process --config=/tmp/tsb-ansible-ii4vC4/admin.kubeconfig -f \"/tmp/tsb-ansible-ii4vC4/apiserver-template.yaml\" --param API_SERVER_CONFIG=\"kind: TemplateServiceBrokerConfig\napiVersion: config.templateservicebroker.openshift.io/v1\ntemplateNamespaces:\n- openshift\n\" --param IMAGE=\"registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker:v3.9.0\" --param NODE_SELECTOR='{\"label123\": \"cannotfind\"}' | oc apply --config=/tmp/tsb-ansible-ii4vC4/admin.kubeconfig -f -", "delta": "0:00:00.555409", "end": "2018-02-21 22:09:25.039356", "rc": 0, "start": "2018-02-21 22:09:24.483947", "stderr": "", "stderr_lines": [], "stdout": "daemonset \"apiserver\" created\nconfigmap \"apiserver-config\" created\nserviceaccount \"apiserver\" created\nservice \"apiserver\" created\nserviceaccount \"templateservicebroker-client\" created\nsecret \"templateservicebroker-client\" created", "stdout_lines": ["daemonset \"apiserver\" created", "configmap \"apiserver-config\" created", "serviceaccount \"apiserver\" created", "service \"apiserver\" created", "serviceaccount \"templateservicebroker-client\" created", "secret \"templateservicebroker-client\" created"]}

TASK [template_service_broker : Verify that TSB is running] ********************
Thursday 22 February 2018  03:09:25 +0000 (0:00:00.809)       0:09:30.194 ***** 

FAILED - RETRYING: Verify that TSB is running (60 retries left).

FAILED - RETRYING: Verify that TSB is running (59 retries left).

...

FAILED - RETRYING: Verify that TSB is running (2 retries left).

FAILED - RETRYING: Verify that TSB is running (1 retries left).

fatal: [host-xxxx.redhat.com]: FAILED! => {"attempts": 60, "changed": false, "cmd": ["curl", "-k", "https://apiserver.openshift-template-service-broker.svc/healthz"], "delta": "0:00:01.024979", "end": "2018-02-21 22:15:35.435460", "msg": "non-zero return code", "rc": 7, "start": "2018-02-21 22:15:34.410481", "stderr": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\r  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused", "stderr_lines": ["  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current", "                                 Dload  Upload   Total   Spent    Left  Speed", "", "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0", "  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused"], "stdout": "", "stdout_lines": []}


# oc get all -n openshift-template-service-broker
NAME           DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR         AGE
ds/apiserver   0         0         0         0            0           label123=cannotfind   17m

NAME            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
svc/apiserver   ClusterIP   172.30.251.209   <none>        443/TCP   17m

Comment 6 Vadim Rutkovsky 2018-02-22 12:01:46 UTC

Would you mind trying that on openshift-ansible-3.9.0-0.47.0.git.0.f8847bb.el7?

The fix in https://github.com/openshift/openshift-ansible/pull/7220 might have fixed it.

Comment 7 Weihua Meng 2018-02-22 12:46:17 UTC

failed after running 15 mins for 1 master + 1 node, could be longer for ha cluster.
can we have an early notice/failure for that?

openshift-ansible-3.9.0-0.47.0.git.0.f8847bb.el7.noarch.rpm

TASK [template_service_broker : Ensure that Template Service Broker has nodes to run on] ***
Thursday 22 February 2018  12:33:44 +0000 (0:00:00.065)       0:10:04.752 ***** 
fatal: [host-xxxx.redhat.com]: FAILED! => {"changed": false, "msg": "No schedulable nodes found matching node selector for Template Service Broker - '{'label123': 'cannotfind'}'"}

PLAY RECAP *********************************************************************
host-xxxx.redhat.com : ok=626  changed=249  unreachable=0    failed=1   
host-xxxx.redhat.com : ok=149  changed=52   unreachable=0    failed=0

Comment 8 Vadim Rutkovsky 2018-02-22 12:54:47 UTC

(In reply to Weihua Meng from comment #7)
> failed after running 15 mins for 1 master + 1 node, could be longer for ha
> cluster.
> can we have an early notice/failure for that?

Unfortunately I don't think we can do this earlier:

* component playbooks can be run independently - if we embed the check in playbooks/deploy_cluster.yml it won't show when just playbooks/openshift-service-catalog/config.yml is being run
* existing node labels are being set later on, so we can't fully predict which labels would be set

This is a tradeoff of course, we'll try to come up with better checks later on though

Comment 9 Scott Dodson 2018-02-28 20:13:30 UTC

I don't think we can do any better before 3.9.

Comment 12 errata-xmlrpc 2018-03-28 14:22:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489